Sum

Applicability

Product

Supported

Atlas A3 training products / Atlas A3 inference products

Atlas A2 training products / Atlas A2 inference products

Atlas 200I/500 A2 inference products

x

Atlas inference product 's AI Core

Atlas inference product 's Vector Core

x

Atlas training products

x

Function

Obtains the sum of elements in the last dimension.

If the input is a vector, the elements are added within the vector. If the input is a matrix, the elements in each row are summed along the last dimension. This API supports input of data no more than two dimensions.

As shown in the following figure, the operation is performed on a two-dimensional matrix with the shape of (2, 3), and the output result is [6, 15].

To conduct the above operation, some essential concepts need to be put forth. The number of rows is referred to as the outer axis length (outter), and the actual number of elements in each row is dubbed the actual quantity of elements on the inner axis (n). The number of elements, which is converted after the byte length required for storing n elements is padded to an integer multiple of 32, is referred to as the padded quantity of inner axis elements (inner). This API requires that the input inner axis length be an integer multiple of 32 bytes. If the byte length occupied by n is not a multiple of 32, you need to pad it to an integer multiple of 32. For example, in the following example, the element type is half, the actual number of elements (n) in each row is 3, and the occupied byte length is 6 bytes, which is not a multiple of 32 bytes. After padding up, 32 bytes are obtained, and the number of elements becomes 16. Therefore, outter = 2, n = 3, and inner = 16. In the figure, padding indicates the padding operation. The relationship between n and inner is as follows: inner = (n x sizeof(T) + 32 –1)/32 x 32/sizeof(T).

Prototype

  • Pass to the temporary space through the sharedTmpBuffer input parameter.
    1
    2
    template <typename T, int32_t reduceDim = -1, bool isReuseSource = false, bool isBasicBlock = false>
    __aicore__ inline void Sum(const LocalTensor<T>& dstTensor, const LocalTensor<T>& srcTensor, const LocalTensor<uint8_t>& sharedTmpBuffer, const SumParams& sumParams)
    
  • Allocate the temporary space through the API framework.
    1
    2
    template <typename T, int32_t reduceDim = -1, bool isReuseSource = false, bool isBasicBlock = false>
    __aicore__ inline void Sum(const LocalTensor<T>& dstTensor, const LocalTensor<T>& srcTensor, const SumParams& sumParams)
    

Due to the complex mathematical computation involved in the internal implementation of this API, additional temporary space is required to store intermediate variables generated during computation. The temporary space can be passed by developers through the sharedTmpBuffer input parameter or allocated through the API framework.

  • When the sharedTmpBuffer input parameter is used for passing the temporary space, the tensor serves as the temporary space. In this case, the API framework is not required for temporary space allocation. This enables you to manage the sharedTmpBuffer space and reuse the buffer after calling the API, so that the buffer is not repeatedly allocated and deallocated, improving the flexibility and buffer utilization.
  • When the API framework is used for temporary space allocation, you do not need to allocate the space, but must reserve the required size for the temporary space.

If sharedTmpBuffer is used, you must allocate the tensor space. If the API framework is used, you must reserve the temporary space. To obtain the size of the temporary space (BufferSize) to be reserved, use the API provided in GetSumMaxMinTmpSize.

Parameters

Table 1 Template parameters

Parameter

Description

T

Data type of the operand.

For the Atlas A3 training products / Atlas A3 inference products , the supported data types are half and float.

For the Atlas A2 training products / Atlas A2 inference products , the supported data types are half and float.

For the Atlas inference product 's AI Core, the supported data types are half and float.

reduceDim

Dimension along which data is summed. This API is implemented based on the last dimension. The reduceDim parameter is not supported. Pass the default value –1.

isReuseSource

Whether the source operand can be modified. This parameter is reserved. Pass the default value false.

isBasicBlock

Reserved parameter, not supported currently.

Table 2 API parameters

Parameter

Input/Output

Description

dstTensor

Output

Destination operand.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

The output value needs to be saved in a space with a size of outter x sizeof(T). You need to allocate the actual buffer space to dstTensor based on this size and the framework's alignment requirements.

NOTE:

The size of allocated buffer must be 32-byte aligned according to the framework's requirements. If the value of outter * sizeof(T) is not 32-byte aligned, it should be rounded up to the nearest multiple of 32 bytes. The extra buffer space allocated for alignment purposes should not be filled with values, but rather left with random values.

srcTensor

Input

Source operand.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

The source operand must have the same data type as the destination operand.

sharedTmpBuffer

Input

Temporary buffer.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

This parameter is used to store intermediate variables during complex computation in Sum and is provided by developers.

For details about how to obtain the temporary space size (BufferSize), see GetSumMaxMinTmpSize.

sumParams

Input

Shape of srcTensor. SumParams type. The specific definition is as follows:

1
2
3
4
5
struct SumParams{
    uint32_t outter = 1;    // outer axis length of input data.
    uint32_t inner;         // number of padded elements on the inner axis of input data. The value of inner x sizeof(T) must be an integer multiple of 32 bytes.
    uint32_t n;             // actual number of elements on the inner axis of input data.
};
  • The value of sumParams.inner x sizeof(T) must be an integer multiple of 32 bytes.
  • sumParams.inner is the value obtained by converting the sumParams.n byte size and padding it up to the nearest 32-aligned integer, where inner = (n x sizeof(T) + 32 –1)/32 x 32/sizeof(T). Therefore, the size of sumParams.n should satisfy: 1 ≤ sumParams.n ≤ sumParams.inner.

Returns

None

Restrictions

  • For details about the operand address alignment requirements, see General Address Alignment Restrictions.
  • The source operand address must not overlap the destination operand address.
  • sharedTmpBuffer must not overlap the addresses of the source operand and destination operand.
  • Currently, only the ND format is supported.
  • For one-dimensional inputs, the outter value should be filled with 1. For two-dimensional inputs, fill in the outter and n values according to the actual situation, and calculate the inner value using the aforementioned formula. Failure to do so may result in incorrect functions.
  • srcTensor needs to be able to accommodate the space occupied by the data after inner axis alignment, and dstTensor needs to be able to accommodate the space occupied by the outter number of aligned results.
  • The internal bottom-layer addition mode of Sum is the same as that of ReduceSum and WholeReduceSum. The binary tree mode is used to add two elements at a time.

    Assume that the source operand is 128 data elements of the half type [data0, data1, data2, ..., data127], the computation can be completed in one repeat. The computation process is as follows:

    1. Add data0 and data1 to obtain data00, add data2 and data3 to obtain data01, ..., add data124 and data125 to obtain data62, and add data126 and data127 to obtain data63.
    2. Add data00 and data01 to obtain data000, add data02 and data03 to obtain data001, ..., and add data62 and data63 to obtain data031.
    3. By analogy, the destination operand is one data element of the half type.

Example

For a complete operator example, see Sum operator sample.

1
2
3
4
5
6
7
AscendC::SumParams params;
params.inner = inner;
params.outter = outter;
params.n = n;
T scalar(0);
AscendC::Duplicate<T>(yLocal, scalar, out_inner);
AscendC::Sum(yLocal, xLocal, sharedTmpBuffer, params);
The following is an example. If the input is two-dimensional data with a size of 2 × 3 and an element type of half, then outter is 2, n is 3, sizeof(T) is 2, and inner is 16 {(3 × 2 + 32 – 1)/32 x 32/2 = 16}.
1
2
3
Input (srcLocal): [[1 2 3 0 0 0 0 0 0 0 0 0 0 0 0 0],
                     [4 5 6 0 0 0 0 0 0 0 0 0 0 0 0 0]]
Output (dstLocal): [6 15 0 0 0 0 0 0 0 0 0 0 0 0 0 0]