SoftmaxFlashV3

Function Usage

Serves as the enhanced version of SoftmaxFlash, corresponding to the Softmax PASA algorithm. If the product of non-last axis lengths m₀, m₁, ..., m_t of the input tensor [m₀, m₁, ..., m_t, n] (t ≥ 0) is considered as m, the shape of the input tensor is [m, n]. This API performs the following computation on the input tensor [m, n] by row. Different values of update correspond to different formulas, where x, inmax, insum, and inmean are inputs, and M, S, and E are outputs.

If update is false, the formulas are as follows.
$\text{[math]}$
If update is true, the formulas are as follows.
$\text{[math]}$

Currently, this API supports the input in ND format only. The internal reduction process is processed based on the last axis.

For ease of understanding, the Python pseudocode is used to express the formulas as follows: repeatSize is 64; elementNumPerBlk and BlkcntPerRepeat are 8; splitMeanCnt is 8; src, inmean, inmax, insum, and update are inputs; dst, x_mean, x_sum, x_max, and exp_max are outputs.

def softmax_flash_3(src, height, width, loopCnt, alpha, baseK, inmax=None, insum=None, inmean=None, update=False):
    scalar = alapha / (1 - alapha)
    #(m,n)->(m,64)
    tmpbuffer0 = BlockReduceSum(repeatSize, repeatSize, elementNumPerBlk)
    remain = int(width / repeatSize - BlkcntPerRepeat)
    tmpbuffer0 = Add(tmpbuffer0, src, remain, repeatSize * elementNumPerBlk, width)
    #(m,64)->(m,8)
    tmpbuffer0 = BlockReduceSum(1, relementNumPerBlk, elementNumPerBlk)
    #width = baseK * splitMeanCnt
    rowMeanLocal = tmpbuffer0 / baseK
    rowMeanGlobal = np.mean(src, axis=(-1), keepdims=True)
    rowMeanGlobalTmp = (rowMeanGlobal - rowMeanLocal) * scalar
    src = src - rowMeanGlobalTmp 

    if update == False:
        x_mean = rowMeanGlobal
        maxTmp = np.max(src, axis=-1, keepdims=True)
        shiftCurr = (rowMeanGlobal - x_mean) * scalar
        x_max = shiftCurr + maxTmp
        maxTmp = x_max - shiftCurr
        x_sub = src - maxTmp   
        dst = np.exp(x_sub) 
        x_sum = np.sum(dst, axis=-1, keepdims=True)
        exp_max = None
        return dst, x_max, x_sum, x_mean, exp_max
    else:
        x_mean = (rowMeanGlobal + inmean * (loopCnt - 1)) / loopCnt
        maxTmp = np.max(src, axis=-1, keepdims=True)
        shiftCurr = (rowMeanGlobal - x_mean) * scalar
        shiftPrev = (inmean - x_mean) * scalar
	x_max = shiftCurr + maxTmp
        maxTmp = shiftPrev + inmax
        x_max = np.max(np.concatenate((x_max, maxTmp), axis=(-1)), axis=(-1), keepdims=True)
        maxTmp = x_max - shiftCurr
        x_sub = src - maxTmp   
        dst = np.exp(x_sub)
        exp_max = np.exp(inmax - x_max + shiftPrev)
        x_sum = np.sum(x_exp, axis=-1, keepdims=True)
        x_sum = exp_max * insum +  x_sum
        return x_exp, x_max, x_sum, x_mean, exp_max

Prototype

Allocate the temporary space through the API framework.

template <typename T, typename U, bool isUpdate = false, bool isReuseSource = false, bool isBasicBlock = false, bool isDataFormatNZ = false, const SoftmaxConfig& config = SOFTMAX_DEFAULT_CFG>
__aicore__ inline void SoftmaxFlashV3(const LocalTensor<T>& dstTensor, const LocalTensor<U>& meanTensor, const LocalTensor<U>& expSumTensor, const LocalTensor<U>& maxTensor, const LocalTensor<T>& srcTensor, const LocalTensor<T>& expMaxTensor, const LocalTensor<U>& inMeanTensor, const LocalTensor<U>& inExpSumTensor, const LocalTensor<U>& inMaxTensor, const SoftMaxTiling& tiling, const SoftMaxParams& params)

Pass the temporary space through the sharedTmpBuffer input parameter.

template <typename T, typename U, bool isUpdate = false, bool isReuseSource = false, bool isBasicBlock = false, bool isDataFormatNZ = false, const SoftmaxConfig& config = SOFTMAX_DEFAULT_CFG>
__aicore__ inline void SoftmaxFlashV3(const LocalTensor<T>& dstTensor, const LocalTensor<U>& meanTensor,const LocalTensor<U>& expSumTensor, const LocalTensor<U>& maxTensor, const LocalTensor<T>& srcTensor,const LocalTensor<T>& expMaxTensor, const LocalTensor<U>& inMeanTensor, const LocalTensor<U>& inExpSumTensor, const LocalTensor<U>& inMaxTensor, const LocalTensor<uint8_t>& sharedTmpBuffer, const SoftMaxTiling& tiling, const SoftMaxParams& params)

Due to the complex computation involved in the internal implementation of this API, additional temporary space is required to store intermediate variables generated during computation. The temporary space can be allocated through the API framework or passed by developers through the sharedTmpBuffer input parameter.

When the API framework is used for temporary space allocation, developers do not need to allocate the space, but must reserve the required size for the space.

When the sharedTmpBuffer input parameter is used for passing the temporary space, the tensor serves as the temporary space. In this case, the API framework is not required for temporary space allocation. This enables developers to manage the sharedTmpBuffer space and reuse the buffer after calling the API, so that the buffer is not repeatedly allocated and deallocated, improving the flexibility and buffer utilization.

If the API framework is used, developers must reserve the temporary space. If sharedTmpBuffer is used, developers must allocate space for the tensor. The method of obtaining the temporary space size (BufferSize) is as follows: Obtain the required minimum and maximum temporary space sizes using the GetSoftMaxFlashV3MaxMinTmpSize API described in SoftmaxFlashV3 Tiling. The minimum space can ensure correct functionality, while the maximum space is used to improve performance.

Parameters

**Table 1** Parameters in the template
Parameter	Description
T	Data types of the input srcTensor and expMaxTensor, and the output dstTensor operands.
U	Data types of the input inMeanTensor, inExpSumTensor, and inMaxTensor and the output meanTensor, expSumTensor, and maxTensor operands.
isUpdate	Whether to set update to true in the computation.
isReuseSource	Reserved for future use. The default value false must be used.
isBasicBlock	Reserved for future use. The default value false must be used.
isDataFormatNZ	Reserved for future use. The default value false must be used.
config	Reserved for future use. The default value SOFTMAX_DEFAULT_CFG must be used.

Table 2 API parameters

Parameter

Input/Output

Description

dstTensor

Output

Destination operand.