SoftMax
Applicability
|
Product |
Supported |
|---|---|
|
|
√ |
|
|
√ |
|
|
√ |
|
|
√ |
|
|
x |
|
|
x |
Function
If the product of non-last axis lengths of the input tensor [m0, m1, ..., mt, n] (t ≥ 0) is considered as m, the shape of the input tensor is [m, n]. Perform the following SoftMax computation on the input tensor [m, n] by row.

For ease of understanding, the formula (using the ND input format as an example) expressed through a Python script is as follows, where src is the source operand (input), and dst, sum, and max are the destination operands (output).
1 2 3 4 5 6 7 8 9 |
def softmax(src): # Perform rowmax (taking the maximum value by row) processing along the last axis. max = np.max(src, axis=-1, keepdims=True) sub = src - max exp = np.exp(sub) # Perform rowsum (taking the sum by row) processing along the last axis. sum = np.sum(exp, axis=-1, keepdims=True) dst = exp / sum return dst, max, sum |
The internal reduce process varies according to the input data format. When the input is in ND format, the internal reduce process is performed along the last axis. When the input is in NZ format, the internal reduce process is performed along the last and first axes. The following figure shows the reduce process.
Principles
The following figure shows the internal algorithm diagram of the SoftMax high-level APIs by taking the input tensor of the float type, in ND format, and with shape [m, k] as an example.
The computation process is divided into the following steps, all of which are performed on vectors:
- reducemax: Compute the maximum value of each row of input x to obtain [m, 1]. The computation result is saved to the temporary space temp.
- broadcast: Pad the data [m, 1] in temp by data block. For example, for the float type, extend [m, 1] to [m, 8] and output max.
- sub: Subtract max from all data of input x by row.
- exp: Compute exp for all data after sub.
- reducesum: Sum up each row of data after exp is performed to obtain [m, 1]. The computation result is saved to temp.
- broadcast: Pad [m, 1] in temp by data block. For example, for the float type, extend [m, 1] to [m, 8] and output sum.
- div: Divide all data generated after exp by sum at each row to obtain the final result.
Prototype
- Allocate the temporary space through the API framework.
- The data types of LocalTensor are the same.
1 2
template <typename T, bool isReuseSource = false, bool isBasicBlock = false, bool isDataFormatNZ = false, const SoftmaxConfig& config = SOFTMAX_DEFAULT_CFG> __aicore__ inline void SoftMax(const LocalTensor<T>& dstTensor, const LocalTensor<T>& sumTensor, const LocalTensor<T>& maxTensor, const LocalTensor<T>& srcTensor, const SoftMaxTiling& tiling, const SoftMaxShapeInfo& softmaxShapeInfo = {})
- The data types of LocalTensor are different.
1 2
template <typename T, bool isReuseSource = false, bool isBasicBlock = false, bool isDataFormatNZ = false, const SoftmaxConfig& config = SOFTMAX_DEFAULT_CFG> __aicore__ inline void SoftMax(const LocalTensor<half>& dstTensor, const LocalTensor<float>& sumTensor, const LocalTensor<float>& maxTensor, const LocalTensor<half>& srcTensor, const SoftMaxTiling& tiling, const SoftMaxShapeInfo& softmaxShapeInfo = {})
- Without sumTensor and maxTensor
1 2
template <typename T, bool isReuseSource = false, bool isBasicBlock = false, const SoftmaxConfig& config = SOFTMAX_DEFAULT_CFG> __aicore__ inline void SoftMax(const LocalTensor<T>& dstTensor, const LocalTensor<T>& srcTensor, const SoftMaxTiling& tiling, const SoftMaxShapeInfo& softmaxShapeInfo = {})
- The data types of LocalTensor are the same.
- Pass to the temporary space through the sharedTmpBuffer input parameter.
- The data types of LocalTensor are the same.
1 2
template <typename T, bool isReuseSource = false, bool isBasicBlock = false, bool isDataFormatNZ = false, const SoftmaxConfig& config = SOFTMAX_DEFAULT_CFG> __aicore__ inline void SoftMax(const LocalTensor<T>& dstTensor, const LocalTensor<T>& sumTensor, const LocalTensor<T>& maxTensor, const LocalTensor<T>& srcTensor, const LocalTensor<uint8_t>& sharedTmpBuffer, const SoftMaxTiling& tiling, const SoftMaxShapeInfo& softmaxShapeInfo = {})
- The data types of LocalTensor are different.
1 2
template <typename T, bool isReuseSource = false, bool isBasicBlock = false, bool isDataFormatNZ = false, const SoftmaxConfig& config = SOFTMAX_DEFAULT_CFG> __aicore__ inline void SoftMax(const LocalTensor<half>& dstTensor, const LocalTensor<float>& sumTensor, const LocalTensor<float>& maxTensor, const LocalTensor<half>& srcTensor, const LocalTensor<uint8_t>& sharedTmpBuffer, const SoftMaxTiling& tiling, const SoftMaxShapeInfo& softmaxShapeInfo = {})
- Without sumTensor and maxTensor
1 2
template <typename T, bool isReuseSource = false, bool isBasicBlock = false, const SoftmaxConfig& config = SOFTMAX_DEFAULT_CFG> __aicore__ inline void SoftMax(const LocalTensor<T>& dstTensor, const LocalTensor<T>& srcTensor, const LocalTensor<uint8_t>& sharedTmpBuffer, const SoftMaxTiling& tiling, const SoftMaxShapeInfo& softmaxShapeInfo = {})
- The data types of LocalTensor are the same.
Due to the complex computation involved in the internal implementation of this API, extra temporary space is required to store intermediate variables generated during computation. The temporary space can be allocated through the API framework or passed by developers through the sharedTmpBuffer input parameter.
- When the API framework is used for temporary space allocation, developers do not need to allocate the space, but must reserve the required size for the space.
- When the sharedTmpBuffer input parameter is used for passing the temporary space, the tensor serves as the temporary space. In this case, the API framework is not required for temporary space allocation. This enables developers to manage the sharedTmpBuffer space and reuse the buffer after calling the API, so that the buffer is not repeatedly allocated and deallocated, improving the flexibility and buffer utilization. For details about the memory reuse mode, see Temporary Buffer Shared by Operators and High-Level APIs.
If the API framework is used, developers must reserve the temporary space. If sharedTmpBuffer is used, developers must allocate space for the tensor. The method of obtaining the temporary space size (BufferSize) is as follows: Obtain the required maximum and minimum temporary space sizes using the GetSoftMaxMaxTmpSize/GetSoftMaxMinTmpSize API provided in SoftMax/SimpleSoftMax Tiling. The minimum space can ensure correct functionality, while the maximum space is used to improve performance.
Parameters
|
Parameter |
Description |
||||
|---|---|---|---|---|---|
|
T |
Data type of the operand. For the For the For the For the |
||||
|
isReuseSource |
This parameter is reserved. Pass the default value false. |
||||
|
isBasicBlock |
If the shape information and tiling strategy of both srcTensor and dstTensor meet the base block requirements, this parameter can be enabled to improve performance. By default, this parameter is disabled. Use either of the following methods to determine whether the base block requirements are met:
For the |
||||
|
isDataFormatNZ |
Whether the current input and output data is in NZ format. The default data format is ND, that is, the default value of this parameter is false. For the |
||||
|
config |
(Optional) structure template parameter, which is of the SoftmaxConfig type. The definition is as follows:
A configuration example is as follows:
This parameter is used together with the tiling computation API in the kernel. Note: After oriSrcM and oriSrcK are set, isBasicBlock does not take effect. In this case, whether the computation data is a base block is determined and processed by the API. For the For the For the For the |
|
Parameter |
Input/Output |
Description |
||
|---|---|---|---|---|
|
dstTensor |
Output |
Destination operand. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. The shape of dst is the same as that of the source operand src. |
||
|
sumTensor |
Output |
Destination operand. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. It is used to store the reducesum result during SoftMax computation.
|
||
|
maxTensor |
Output |
Destination operand. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. It is used to store the reducemax result during SoftMax computation.
|
||
|
srcTensor |
Input |
Source operand. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. The length of the last axis must be 32-byte aligned. |
||
|
sharedTmpBuffer |
Input |
Temporary space. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. This parameter is used to store intermediate variables during complex internal API computation and is provided by developers. For details about how to obtain the temporary space size (BufferSize), see SoftMax/SimpleSoftMax Tiling. |
||
|
tiling |
Input |
Tiling information required for SoftMax computation. For details about how to obtain the tiling information, see SoftMax/SimpleSoftMax Tiling. |
||
|
softmaxShapeInfo |
Input |
Shape of src. SoftMaxShapeInfo type. The specific definition is as follows:
Note that when the input and output data is in NZ format, the last axis length is the length of the reduce axis, that is, W0 × W1 in Figure 2 and the length of each non-last axis is H0 × H1. |
Returns
None
Restrictions
- The tensor space of src and dst can be reused.
- sumTensor and maxTensor are outputs, where the length of the last axis must be fixed at 32 bytes, and the size of each non-last axis must be consistent with that of src and dst.
- The data types of sumTensor and maxTensor must be the same.
- For details about the operand address alignment requirements, see General Address Alignment Restrictions.
- The address of sharedTmpBuffer must not overlap that of the source or destination operand.
- When srcM ! is set to oriSrcM or srcK ! is set to oriSrcK in softmaxShapeInfo, for the original input (oriSrcM, oriSrcK) on the GM, you need to pad data to (srcM, srcK) in the M or K direction. The padded data will be involved in some computation. In the scenario where the input and output are reused, the computation result of the API will overwrite the original data padded to the srcTensor. In the scenario where the input and output are not reused, the computation result of the API will overwrite the data in dstTensor corresponding to the padded position of srcTensor.
Example
In this example, the shape size of the input src and output dst is [320, 64], the shape size of the intermediate computation results sumTensor and maxTensor is [320, 16], the data type is half, and the input and output data format is ND. Additionally, the space of src and dst cannot be mutually reused, and the base blocks are disabled. For more operator examples, see softmax operator sample.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
AscendC::LocalTensor<T> srcLocal = inQueueSrc.DeQue<T>(); AscendC::LocalTensor<T> sumTempLocal = sumQueue.AllocTensor<T>(); AscendC::LocalTensor<T> maxTempLocal = maxQueue.AllocTensor<T>(); AscendC::LocalTensor<T> dstLocal = outQueueDst.AllocTensor<T>(); AscendC::SoftMaxShapeInfo srcShape = {height, width, height, width}; AscendC::SoftMax<T>(dstLocal, sumTempLocal, maxTempLocal, srcLocal, tiling, srcShape); // AscendC::SoftMax<T, false, false, false, static_config>(dstLocal, sumTempLocal, // maxTempLocal, srcLocal, tiling, srcShape); Use the static_config parameter of the SoftmaxConfig type and pass the template parameter to turn the shape into a constant value. outQueueDst.EnQue<T>(dstLocal); maxQueue.FreeTensor(maxTempLocal); sumQueue.FreeTensor(sumTempLocal); inQueueSrc.FreeTensor(srcLocal); |
