CumSum

Applicability

Product	Supported/Unsupported
Atlas A3 training products/Atlas A3 inference products	√
Atlas A2 training products/Atlas A2 inference products	√
Atlas 200I/500 A2 inference products	x
Atlas inference product's AI Core	√
Atlas inference product's Vector Core	x
Atlas training products	x

Function

Performs cumulative sum of an input tensor along a row or column. Each element in the output is the cumulative sum of the element at the corresponding position and all preceding elements along a given row or column.

The formula is as follows.

$\text{[math]}$

Row-by-row accumulation algorithm
- First axis processing (or row-wise accumulation): The first row remains unchanged, and the subsequent rows are accumulated in sequence. The formula for calculating the element in the ith row and jth column of the output is as follows.
  $\text{[math]}$
  
  Example: If the input tensor is [[0, 1, 2], [3, 4, 5]], the output tensor is [[0, 1, 2], [3, 5, 7]].
- Last axis processing (or column-wise accumulation): The first column remains unchanged, and the subsequent columns are accumulated in sequence. The formula for calculating the element in the ith row and jth column of the output is as follows.
  $\text{[math]}$
  
  Example: If the input tensor is [[0, 1, 2], [3, 4, 5]], the output tensor is [[0, 1, 3], [3, 7, 12]].

Prototype

Pass the temporary space through the sharedTmpBuffer input parameter.

template <typename T, const CumSumConfig& config = defaultCumSumConfig>
__aicore__ inline void CumSum(LocalTensor<T>& dstTensor, LocalTensor<T>& lastRowTensor, const LocalTensor<T>& srcTensor, LocalTensor<uint8_t>& sharedTmpBuffer, const CumSumInfo& cumSumInfo)

Allocate the temporary space through the API framework.

template <typename T, const CumSumConfig& config = defaultCumSumConfig>
__aicore__ inline void CumSum(LocalTensor<T>& dstTensor, LocalTensor<T>& lastRowTensor, const LocalTensor<T>& srcTensor, const CumSumInfo& cumSumInfo)

Precision conversion is involved in the internal implementation of this API. Therefore, extra temporary space is required to store intermediate variables during computation. The temporary space can be allocated through the API framework or passed by developers through the sharedTmpBuffer input parameter.

If the API framework is used for temporary space allocation, developers do not need to request allocation of the space, but must reserve the required size for the space.

If temporary space is passed through the sharedTmpBuffer input parameter, the tensor serves as the temporary space. In this case, the API framework is not required for temporary space allocation. This enables developers to manage the sharedTmpBuffer space and reuse the buffer after calling the API, so that the buffer is not repeatedly allocated and deallocated, improving the flexibility and buffer utilization.

If the API framework is used, developers need to reserve temporary space. If sharedTmpBuffer is used, developers need to request to allocate space for the tensor. To obtain the temporary space size (BufferSize), use the API provided in GetCumSumMaxMinTmpSize.

Parameters

Table 1 Template parameters

Parameter

Description

Data type of the operand.

For the Atlas A3 training products/Atlas A3 inference products, the supported data types are half and float.

For the Atlas A2 training products/Atlas A2 inference products, the supported data types are half and float.

For the Atlas inference product's AI Core, the supported data types are half and float.

config

Parameters for compiling the CumSum API.

struct CumSumConfig {
    bool isLastAxis{true};
    bool isReuseSource{false};
    bool outputLastRow{false};
};

isLastAxis: If the value is true, the last axis is used for computation. If the value is false, the first axis is used for computation.
isReuseSource: Whether srcTensor's buffer can be reused. This parameter is reserved. You can set it to the default value false.
outputLastRow: Whether to output the last row of data.

Table 2 API parameters

Parameter

Input/Output

Description

dstTensor

Output

Destination operand. The input elements are processed along the first axis or the last axis and their cumulative sum is calculated.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

lastRowTensor

Output

Destination operand. If outputLastRow in config is set to true, the last row of data is output.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

srcTensor

Input

Source operand.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

sharedTmpBuffer

Input

Temporary buffer.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

This parameter is used to store intermediate variables during complex computation in CumSum and is provided by developers.

For details about how to obtain the temporary space size (BufferSize), see GetCumSumMaxMinTmpSize.

cumSumInfo

Input

Shape of srcTensor, CumSumInfo type. The specific definition is as follows:

struct CumSumInfo
{
    uint32_t outter{0};    // outer axis length of input data
    uint32_t inner{0};     // inner axis length of the input data
};

Note:

Both cumSumInfo.outter and cumSumInfo.inner must be greater than 0.
The product of cumSumInfo.outter × cumSumInfo.inner cannot be greater than dstTensor or srcTensor.
The product of cumSumInfo.inner × sizeof(T) must be an integer multiple of 32 bytes.
If outputLastRow in the template parameter config is set to true, the value of cumSumInfo.inner cannot be greater than the size of the last row of data output by lastRowTensor.

Returns

None

Restrictions

For details about the operand address alignment requirements, see General Address Alignment Restrictions.
The input data must be two-dimensional.
The product of cumSumInfo.inner × sizeof(T) must be an integer multiple of 32 bytes.

Example

#include "kernel_operator.h"

template <typename T, const CumSumConfig& CONFIG>
class KernelCumSum
{
public:
    __aicore__ inline KernelCumSum(){}
    __aicore__ inline void Init(
        GM_ADDR srcGm, GM_ADDR dstGm, GM_ADDR lastRowGm, const AscendC::CumSumInfo& cumSumParams)
    {
        outer = cumSumParams.outter;
        inner = cumSumParams.inner;
        srcGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ T *>(srcGm), outer * inner);
        dstGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ T *>(dstGm), outer * inner);
        lastRowGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ T *>(lastRowGm), inner);
        pipe.InitBuffer(inQueueX, 1, outer * inner * sizeof(T));
        pipe.InitBuffer(outQueue, 1, outer * inner * sizeof(T));
        pipe.InitBuffer(lastRowQueue, 1, inner * sizeof(T));
    }
    __aicore__ inline void Process()
    {
        CopyIn();
        Compute();
        CopyOut();
    }

private:
    __aicore__ inline void CopyIn()
    {
        AscendC::LocalTensor<T> srcLocal = inQueueX.AllocTensor<T>();
        AscendC::DataCopy(srcLocal, srcGlobal, outer * inner);
        inQueueX.EnQue(srcLocal);
    }
    __aicore__ inline void Compute()
    {
        AscendC::LocalTensor<T> dstLocal = outQueue.AllocTensor<T>();
        AscendC::LocalTensor<T> lastRowLocal = lastRowQueue.AllocTensor<T>();
        AscendC::LocalTensor<T> srcLocal = inQueueX.DeQue<T>();
        
        const AscendC::CumSumInfo cumSumInfo{outer, inner};
        AscendC::CumSum<T, CONFIG>(dstLocal, lastRowLocal, srcLocal, cumSumInfo);
        outQueue.EnQue<T>(dstLocal);
        lastRowQueue.EnQue<T>(lastRowLocal);
        inQueueX.FreeTensor(srcLocal);
    }
    __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<T> dstLocal = outQueue.DeQue<T>();
        AscendC::DataCopy(dstGlobal, dstLocal, outer * inner);
        outQueue.FreeTensor(dstLocal);
        AscendC::LocalTensor<T> lastRowLocal = lastRowQueue.DeQue<T>();
        AscendC::DataCopy(lastRowGlobal, lastRowLocal, inner);
        lastRowQueue.FreeTensor(lastRowLocal);
    }

private:
    AscendC::GlobalTensor<T> srcGlobal;
    AscendC::GlobalTensor<T> dstGlobal;
    AscendC::GlobalTensor<T> lastRowGlobal;
    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueX;
    AscendC::TQue<AscendC::TPosition::VECOUT, 1> outQueue;
    AscendC::TQue<AscendC::TPosition::VECOUT, 1> lastRowQueue;
    uint32_t outer{1};
    uint32_t inner{1};
};

constexpr AscendC::CumSumConfig cumSumConfig{true, false, true};

template <typename T>
__aicore__ inline void kernel_cumsum_operator(
    GM_ADDR srcGm, GM_ADDR dstGm, GM_ADDR lastRowGm, const AscendC::CumSumInfo &cumSumParams)
{
    KernelCumSum<T, cumSumConfig> op;
    op.Init(srcGm, dstGm, lastRowGm, cumSumParams);
    op.Process();
}

Parent topic: CumSum APIs