ReduceSum

Applicability

Product

Supported

Atlas A3 training products / Atlas A3 inference products

Atlas A2 training products / Atlas A2 inference products

Atlas 200I/500 A2 inference products

x

Atlas inference product 's AI Core

x

Atlas inference product 's Vector Core

x

Atlas training products

x

Function

Accumulates data of a multi-dimensional vector based on a specified dimension.

The specified dimension (Reduce axis) is defined as the R axis, and the non-specified dimension (Normal axis) is defined as the A axis. As shown in the following figure, for a two-dimensional matrix with the shape of (2, 3), if the data accumulation is performed on the first dimension, the output result is [5, 7, 9]; if the data accumulation is performed on the second dimension, the output result is [6, 15].

Figure 1 Example of summation along the first dimension using ReduceSum
Figure 2 Example of summation along the last dimension using ReduceSum

Prototype

  • Pass the temporary space through the sharedTmpBuffer input parameter.
    1
    2
    template <class T, class pattern, bool isReuseSource = false>
    __aicore__ inline void ReduceSum(const LocalTensor<T>& dstTensor, const LocalTensor<T>& srcTensor, const LocalTensor<uint8_t>& sharedTmpBuffer, const uint32_t srcShape[], bool srcInnerPad)
    
  • Allocate the temporary space through the API framework.
    1
    2
    template <class T, class pattern, bool isReuseSource = false>
    __aicore__ inline void ReduceSum(const LocalTensor<T>& dstTensor, const LocalTensor<T>& srcTensor, const uint32_t srcShape[], bool srcInnerPad)
    

Due to the complex mathematical computation involved in the internal implementation of this API, additional temporary space is required to store intermediate variables generated during computation. The temporary space can be passed through the sharedTmpBuffer input parameter or allocated through the API framework.

  • When the sharedTmpBuffer input parameter is used for passing the temporary space, the tensor serves as the temporary space. In this case, the API framework is not required for temporary space allocation. This enables you to manage the sharedTmpBuffer space and reuse the buffer after calling the API, so that the buffer is not repeatedly allocated and deallocated, improving the flexibility and buffer utilization.
  • When the API framework is used for temporary space allocation, you do not need to allocate the space, but must reserve the required size for the space.

If sharedTmpBuffer is used, you must allocate the tensor space. If the API framework is used, you must reserve the temporary space. To obtain the temporary space (BufferSize) to be reserved, use the API provided in GetReduceSumMaxMinTmpSize.

Parameters

Table 1 Template parameters

Parameter

Description

T

Data type of the operand.

For the Atlas A3 training products / Atlas A3 inference products , the supported data type is float.

For the Atlas A2 training products / Atlas A2 inference products , the supported data type is float.

pattern

ReduceSum computation axes, including the Reduce axis and Normal axis. pattern is a string composed of letters A (standing for normal axis) and R (standing for reduced axis), with the number of letters equal to the number of dimensions in the vector. For example, AR indicates performing a ReduceSum operation on a 2D vector: The first dimension is the normal axis, and the second dimension is the reduced axis, meaning that the data is summed along the second dimension.

pattern is a struct defined in the AscendC::Pattern::Reduce namespace. You can ignore its member variables.

Currently, the pattern can only be AR or RA.

isReuseSource

Whether the source operand can be modified. The default value is false. If you allow the source operand to be modified, enable this parameter to reduce memory space usage.

If this parameter is set to true, the src memory space is reused during internal computation of this API to reduce memory space usage. If this parameter is set to false, the src memory space is not reused during internal computation of this API.

For details about how to use isReuseSource, see Example 4.

Table 2 API parameters

Parameter

Input/Output

Description

dstTensor

Output

Destination operand.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

srcTensor

Input

Source operand.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

The source operand must have the same data type as the destination operand.

sharedTmpBuffer

Input

Temporary buffer.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

This parameter is used to store intermediate variables during complex computation in ReduceSum and is provided by developers.

For details about how to obtain the temporary space size (BufferSize), see GetReduceSumMaxMinTmpSize.

srcShape

Input

An array of the uint32_t type, indicating the shape information of the source operand. The dimension of the shape must be the consistent with that of the template parameter pattern. For example, if pattern is AR, the shape dimension must be two-dimensional.

Atlas A3 training products / Atlas A3 inference products . Currently, only two-dimensional shapes are supported.

Atlas A2 training products / Atlas A2 inference products . Currently, only two-dimensional shapes are supported.

srcInnerPad

Input

Whether the innermost axis data to be computed is 32-byte aligned.

Atlas A3 training products / Atlas A3 inference products . Currently, only true is supported.

Atlas A2 training products / Atlas A2 inference products . Currently, only true is supported.

Returns

None

Restrictions

  • The source operand address must not overlap the destination operand address.
  • The address of sharedTmpBuffer cannot overlap that of the source or destination operand.
  • The internal algorithm does not process data overflow during accumulation. In the overflow scenario, the API precision is not ensured.

Example

1
2
3
4
5
6
AscendC::LocalTensor<float> dstLocal = outQueue.AllocTensor<float>();
AscendC::LocalTensor<float> srcLocal = inQueue.DeQue<float>();
AscendC::LocalTensor<uint8_t> tmp = tbuf.Get<uint8_t>();
uint32_t shape[] = { 2, 8 };
constexpr bool isReuse = true;
AscendC::ReduceSum<float, AscendC::Pattern::Reduce::AR, isReuse>(dstLocal, srcLocal, tmp, shape, true);

Result example:

1
2
3
4
5
6
7
The input and output data type is float.
Input (src):
[[ 0.0 4.0 2.0 0.0 -1.0 2.0 -1.0 7.0],
 [ 0.0 1.0 -9.0 2.0 2.0 2.0 8.0 3.0]]
Input pattern: AR
Input shape: (2, 8)
Output data (dst): [13.0 9.0]