CumSum
Applicability
Product |
Supported/Unsupported |
|---|---|
√ |
|
√ |
|
x |
|
√ |
|
x |
|
x |
Function
Performs cumulative sum of an input tensor along a row or column. Each element in the output is the cumulative sum of the element at the corresponding position and all preceding elements along a given row or column.
The formula is as follows.

- Row-by-row accumulation algorithm
- First axis processing (or row-wise accumulation): The first row remains unchanged, and the subsequent rows are accumulated in sequence. The formula for calculating the element in the ith row and jth column of the output is as follows.

Example: If the input tensor is [[0, 1, 2], [3, 4, 5]], the output tensor is [[0, 1, 2], [3, 5, 7]].
- Last axis processing (or column-wise accumulation): The first column remains unchanged, and the subsequent columns are accumulated in sequence. The formula for calculating the element in the ith row and jth column of the output is as follows.

Example: If the input tensor is [[0, 1, 2], [3, 4, 5]], the output tensor is [[0, 1, 3], [3, 7, 12]].
- First axis processing (or row-wise accumulation): The first row remains unchanged, and the subsequent rows are accumulated in sequence. The formula for calculating the element in the ith row and jth column of the output is as follows.
Prototype
- Pass the temporary space through the sharedTmpBuffer input parameter.
1 2
template <typename T, const CumSumConfig& config = defaultCumSumConfig> __aicore__ inline void CumSum(LocalTensor<T>& dstTensor, LocalTensor<T>& lastRowTensor, const LocalTensor<T>& srcTensor, LocalTensor<uint8_t>& sharedTmpBuffer, const CumSumInfo& cumSumInfo)
- Allocate the temporary space through the API framework.
1 2
template <typename T, const CumSumConfig& config = defaultCumSumConfig> __aicore__ inline void CumSum(LocalTensor<T>& dstTensor, LocalTensor<T>& lastRowTensor, const LocalTensor<T>& srcTensor, const CumSumInfo& cumSumInfo)
Precision conversion is involved in the internal implementation of this API. Therefore, extra temporary space is required to store intermediate variables during computation. The temporary space can be allocated through the API framework or passed by developers through the sharedTmpBuffer input parameter.
- If the API framework is used for temporary space allocation, developers do not need to request allocation of the space, but must reserve the required size for the space.
- If temporary space is passed through the sharedTmpBuffer input parameter, the tensor serves as the temporary space. In this case, the API framework is not required for temporary space allocation. This enables developers to manage the sharedTmpBuffer space and reuse the buffer after calling the API, so that the buffer is not repeatedly allocated and deallocated, improving the flexibility and buffer utilization.
If the API framework is used, developers need to reserve temporary space. If sharedTmpBuffer is used, developers need to request to allocate space for the tensor. To obtain the temporary space size (BufferSize), use the API provided in GetCumSumMaxMinTmpSize.
Parameters
Parameter |
Description |
||
|---|---|---|---|
T |
Data type of the operand. For the For the For the |
||
config |
Parameters for compiling the CumSum API.
|
Parameter |
Input/Output |
Description |
||
|---|---|---|---|---|
dstTensor |
Output |
Destination operand. The input elements are processed along the first axis or the last axis and their cumulative sum is calculated. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. |
||
lastRowTensor |
Output |
Destination operand. If outputLastRow in config is set to true, the last row of data is output. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. |
||
srcTensor |
Input |
Source operand. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. |
||
sharedTmpBuffer |
Input |
Temporary buffer. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. This parameter is used to store intermediate variables during complex computation in CumSum and is provided by developers. For details about how to obtain the temporary space size (BufferSize), see GetCumSumMaxMinTmpSize. |
||
cumSumInfo |
Input |
Shape of srcTensor, CumSumInfo type. The specific definition is as follows:
Note:
|
Returns
None
Restrictions
- For details about the operand address alignment requirements, see General Address Alignment Restrictions.
- The input data must be two-dimensional.
- The product of cumSumInfo.inner × sizeof(T) must be an integer multiple of 32 bytes.
Example
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 | #include "kernel_operator.h" template <typename T, const CumSumConfig& CONFIG> class KernelCumSum { public: __aicore__ inline KernelCumSum(){} __aicore__ inline void Init( GM_ADDR srcGm, GM_ADDR dstGm, GM_ADDR lastRowGm, const AscendC::CumSumInfo& cumSumParams) { outer = cumSumParams.outter; inner = cumSumParams.inner; srcGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ T *>(srcGm), outer * inner); dstGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ T *>(dstGm), outer * inner); lastRowGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ T *>(lastRowGm), inner); pipe.InitBuffer(inQueueX, 1, outer * inner * sizeof(T)); pipe.InitBuffer(outQueue, 1, outer * inner * sizeof(T)); pipe.InitBuffer(lastRowQueue, 1, inner * sizeof(T)); } __aicore__ inline void Process() { CopyIn(); Compute(); CopyOut(); } private: __aicore__ inline void CopyIn() { AscendC::LocalTensor<T> srcLocal = inQueueX.AllocTensor<T>(); AscendC::DataCopy(srcLocal, srcGlobal, outer * inner); inQueueX.EnQue(srcLocal); } __aicore__ inline void Compute() { AscendC::LocalTensor<T> dstLocal = outQueue.AllocTensor<T>(); AscendC::LocalTensor<T> lastRowLocal = lastRowQueue.AllocTensor<T>(); AscendC::LocalTensor<T> srcLocal = inQueueX.DeQue<T>(); const AscendC::CumSumInfo cumSumInfo{outer, inner}; AscendC::CumSum<T, CONFIG>(dstLocal, lastRowLocal, srcLocal, cumSumInfo); outQueue.EnQue<T>(dstLocal); lastRowQueue.EnQue<T>(lastRowLocal); inQueueX.FreeTensor(srcLocal); } __aicore__ inline void CopyOut() { AscendC::LocalTensor<T> dstLocal = outQueue.DeQue<T>(); AscendC::DataCopy(dstGlobal, dstLocal, outer * inner); outQueue.FreeTensor(dstLocal); AscendC::LocalTensor<T> lastRowLocal = lastRowQueue.DeQue<T>(); AscendC::DataCopy(lastRowGlobal, lastRowLocal, inner); lastRowQueue.FreeTensor(lastRowLocal); } private: AscendC::GlobalTensor<T> srcGlobal; AscendC::GlobalTensor<T> dstGlobal; AscendC::GlobalTensor<T> lastRowGlobal; AscendC::TPipe pipe; AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueX; AscendC::TQue<AscendC::TPosition::VECOUT, 1> outQueue; AscendC::TQue<AscendC::TPosition::VECOUT, 1> lastRowQueue; uint32_t outer{1}; uint32_t inner{1}; }; constexpr AscendC::CumSumConfig cumSumConfig{true, false, true}; template <typename T> __aicore__ inline void kernel_cumsum_operator( GM_ADDR srcGm, GM_ADDR dstGm, GM_ADDR lastRowGm, const AscendC::CumSumInfo &cumSumParams) { KernelCumSum<T, cumSumConfig> op; op.Init(srcGm, dstGm, lastRowGm, cumSumParams); op.Process(); } |