SoftmaxFlash

Applicability

Product	Supported
Atlas A3 training products/Atlas A3 inference products	√
Atlas A2 training products/Atlas A2 inference products	√
Atlas 200I/500 A2 inference products	x
Atlas inference product's AI Core	√
Atlas inference product's Vector Core	x
Atlas training products	x

Function

Note: This API will be deprecated in the future. Use SoftmaxFlashV2 that has better precision and performance.

Serves as the enhanced version of SoftMax, which not only performs softmaxflash computation on the input tensor, but updates the result of the current softmax computation based on the sum and max values obtained in the previous softmax computation. In scenarios where last axis tiling is involved, the reduce result computed each time is not along the entire axis. In this case, you can use this enhanced API to update the result of the current softmax computation based on the sum and max values obtained in the previous softmax computation. The NZ format is not supported.

Currently, only input shapes in ND format are supported. The internal reduce process is performed along the last axis. When update is disabled, this API is equivalent to SoftMax.

For ease of understanding, the formula expressed through a Python script is as follows, where src, inmax, insum, and update are inputs, and dst, x_sum, x_max, and exp_max are outputs.

def softmax_flash(src, inmax=None, insum=None, update=None):
    if update == None:
      # Perform rowmax (taking the maximum value by row) processing along the last axis. 
        x_max = np.max(src, axis=-1, keepdims=True)
        x_sub = src - x_max
        x_exp = np.exp(x_sub)
       # Perform rowsum (taking the sum by row) processing along the last axis.
        x_sum = np.sum(x_exp, axis=-1, keepdims=True)
        dst = x_exp / x_sum
        exp_max = None
        return dst, x_max, x_sum, exp_max
    else:
       # Combine inmax and src to obtain rowmax.
        x_max = np.max(np.concatenate((inmax, src), axis=-1), axis=-1, keepdims=True)
        x_exp = np.exp(src - x_max)
        x_sum = np.sum(x_exp, axis=-1, keepdims=True)
        exp_max = np.exp(inmax - x_max)
        x_sum = exp_max * insum +  x_sum
        exp_max = exp_max * insum / x_sum
        dst = x_exp / x_sum
        return dst, x_max, x_sum, exp_max

Prototype

Allocate the temporary space through the API framework.

template <typename T, bool isReuseSource = false, bool isBasicBlock = false>
__aicore__ inline void SoftmaxFlash(const LocalTensor<T> &dstTensor, const LocalTensor<T> &sumTensor, const LocalTensor<T> &maxTensor, const LocalTensor<T> &srcTensor, const LocalTensor<T> &expMaxTensor, const LocalTensor<T> &inSumTensor, const LocalTensor<T> &inMaxTensor, const SoftMaxTiling &tiling, bool isUpdate = false, const SoftMaxShapeInfo &softmaxShapeInfo = {})

template <typename T, bool isReuseSource = false, bool isBasicBlock = false>
__aicore__ inline void SoftmaxFlash(const LocalTensor<half>& dstTensor, const LocalTensor<float>& sumTensor, const LocalTensor<float>& maxTensor, const LocalTensor<half>& srcTensor, const LocalTensor<half>& expMaxTensor, const LocalTensor<float>& inSumTensor, const LocalTensor<float>& inMaxTensor, const SoftMaxTiling& tiling, bool isUpdate = false, const SoftMaxShapeInfo& softmaxShapeInfo = {})

Pass to the temporary space through the sharedTmpBuffer input parameter.

template <typename T, bool isReuseSource = false, bool isBasicBlock = false>
__aicore__ inline void SoftmaxFlash(const LocalTensor<T>& dstTensor, const LocalTensor<T>& sumTensor, const LocalTensor<T>& maxTensor, const LocalTensor<T>& srcTensor, const LocalTensor<T>& expMaxTensor, const LocalTensor<T>& inSumTensor, const LocalTensor<T>& inMaxTensor, const LocalTensor<uint8_t>& sharedTmpBuffer, const SoftMaxTiling& tiling, bool isUpdate = false, const SoftMaxShapeInfo& softmaxShapeInfo = {})

template <typename T, bool isReuseSource = false, bool isBasicBlock = false>
__aicore__ inline void SoftmaxFlash(const LocalTensor<half>& dstTensor, const LocalTensor<float>& sumTensor, const LocalTensor<float>& maxTensor, const LocalTensor<half>& srcTensor, const LocalTensor<half>& expMaxTensor, const LocalTensor<float>& inSumTensor, const LocalTensor<float>& inMaxTensor, const LocalTensor<uint8_t>& sharedTmpBuffer, const SoftMaxTiling& tiling, bool isUpdate = false, const SoftMaxShapeInfo& softmaxShapeInfo = {})

Due to the complex computation involved in the internal implementation of this API, extra temporary space is required to store intermediate variables generated during computation. The temporary space can be allocated through the API framework or passed by developers through the sharedTmpBuffer input parameter.

When the API framework is used for temporary space allocation, developers do not need to allocate the space, but must reserve the required size for the space.

When the sharedTmpBuffer input parameter is used for passing the temporary space, the tensor serves as the temporary space. In this case, the API framework is not required for temporary space allocation. This enables developers to manage the sharedTmpBuffer space and reuse the buffer after calling the API, so that the buffer is not repeatedly allocated and deallocated, improving the flexibility and buffer utilization.

If the API framework is used, developers must reserve the temporary space. If sharedTmpBuffer is used, developers must allocate space for the tensor. The method of obtaining the temporary space size (BufferSize) is as follows: Obtain the required maximum and minimum temporary space sizes using the GetSoftMaxFlashMaxTmpSize/GetSoftMaxFlashMinTmpSize API provided in SoftmaxFlash Tiling. The minimum space can ensure correct functionality, while the maximum space is used to improve performance.

Parameters

**Table 1** Template parameters
Parameter	Description
T	Data type of the operand. For the Atlas A3 training products/Atlas A3 inference products, the supported data types are half and float. For the Atlas A2 training products/Atlas A2 inference products, the supported data types are half and float. For the Atlas inference product's AI Core, the supported data types are half and float.
isReuseSource	This parameter is reserved. Pass the default value false.
isBasicBlock	If the shape information and tiling strategy of both srcTensor and dstTensor meet the base block requirements, this parameter can be enabled to improve performance. By default, this parameter is disabled. Use either of the following methods to determine whether the base block requirements are met: The shape information [m, n] of srcTensor and dstTensor must meet the following requirements: The last axis length n is less than 2048 and greater than or equal to 256/sizeof(T). That is, the minimum value of n is 128 when the data type is half and 64 when the data type is float. In addition, n is a multiple of 64. The product m of non-last axis lengths is a multiple of 8. You can call IsBasicBlockInSoftMax to check whether the tiling strategy meets the tiling requirements of base blocks.

Table 2 API parameters

Parameter

Input/Output

Description

dstTensor

Output

Destination operand.