ReduceSum

Function Usage

Sums up all input data. For details about reduction instructions, see Reduction Instructions.

ReduceSum can be implemented in either of the following ways:
  • Method 1: Binary tree accumulation is first performed on a repeat and then performed on the results of different repeats.

    Assume that the source operand is 128 pieces of half data [data0, data1, data2, ..., data127], the computation can be completed in one repeat. The computation process is as follows:

    1. Add data0 and data1 to obtain data00, add data2 and data3 to obtain data01, ..., add data124 and data125 to obtain data62, and add data126 and data127 to obtain data63.
    2. Add data00 and data01 to obtain data000, add data02 and data03 to obtain data001, ..., add data62 and data63 to obtain data031.
    3. By analogy, the destination operand is one piece of half data.

    When being greater than 65504, the computation result is truncated to 65504. For example, the source operand is [60000, 60000, –30000, 100], 60000 + 60000 > 65504, meaning that the result overflows. In this case, the maximum value 65504 will be used as the result. Similarly, –30000 + 100 = –29900, 65504 – 29900 = 35604.

  • Method 2: Binary tree accumulation is performed on a repeat, and results of different repeats are accumulated in sequence.

The following table lists the ReduceSum methods for different hardware forms.

For the Atlas Training Series Product , method 1 is used.

  • workLocal supports the following processing methods:
    • Method 1: Calculate the minimum space required according to the following formula:
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      // Define a round-up function.
      int RoundUp(int a, int b)
      { 
          return (a + b - 1) / b;
      }
      
      // Define the data types involved in the computation.
      int typeSize = 2;                           // half occupies 2 bytes and float occupies 4 bytes. Set this parameter as required.
      
      // Define two units based on the data type.
      int elementsPerBlock = 32 / typeSize;       // Number of elements that a data block can hold
      int elementsPerRepeat = 256 / typeSize;     // Number of elements that can be processed in a repeat
      
      // Determine the first maximum repeat value.
      int firstMaxRepeat = repeatTimes;           // For high-dimensional tensor sharding computation APIs, firstMaxRepeat is repeatTimes. For APIs of computing the first n data elements of a tensor, firstMaxRepeat is count/elementsPerRepeat. For example, firstMaxRepeat is count/128 for the half type and count/64 for the float type. Set this parameter as required. For count<elementsPerRepeat, the value of firstMaxRepeat is 1.
      
      int iter1OutputCount = firstMaxRepeat;                                              // Number of elements generated in the first repeat
      int iter1AlignEnd = RoundUp(iter1OutputCount, elementsPerBlock) * elementsPerBlock; // Round up the number of elements generated in the first repeat.
      int finalWorkLocalNeedSize = iter1AlignEnd;                                         // The size of elements space required by workLocal is the round-up number of elements generated in the first repeat.
      
    • Method 2: Pass workLocal of any size. The value of workLocal remains unchanged.

Prototype

  • Computation of the first n data elements of a tensor
    1
    2
    template <typename T, bool isSetMask = true>
    __aicore__ inline void ReduceSum(const LocalTensor<T>& dstLocal, const LocalTensor<T>& srcLocal, const LocalTensor<T>& workLocal, const int32_t count)
    
  • High-dimensional tensor sharding computation
    • Bitwise mask mode
      1
      2
      template <typename T>
      __aicore__ inline void ReduceSum(const LocalTensor<T>& dstLocal, const LocalTensor<T>& srcLocal, const LocalTensor<T>& workLocal, const uint64_t mask[], const int32_t repeatTimes, const int32_t srcRepStride)
      
    • Contiguous mask mode
      1
      2
      template <typename T>
      __aicore__ inline void ReduceSum(const LocalTensor<T>& dstLocal, const LocalTensor<T>& srcLocal, const LocalTensor<T>& workLocal, const int32_t mask, const int32_t repeatTimes, const int32_t srcRepStride)
      

Parameters

Table 1 Template parameters

Parameter

Description

T

Operand data type.

For the Atlas Training Series Product , the supported data type is half.

isSetMask

Reserved parameter for future functions. Retain the default value.

Table 2 Parameters

Parameter

Input/Output

Meaning

dstLocal

Output

Destination operand.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

The start address of the LocalTensor must be 2-byte aligned (for data of the half type) or 4-byte aligned (for data of the float type).

srcLocal

Input

Source operand.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

The start address of the LocalTensor must be 32-byte aligned.

The source operand must have the same data type as the destination operand.

workLocal

Input

A tensor for storing temporary results during instruction execution to compute the required workspace. Pay attention to the size. For details, see the instruction restrictions.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

The start address of the LocalTensor must be 32-byte aligned.

The source operand must have the same data type as the destination operand.

count

Input

Number of elements of the input data.

The parameter value range is related to the operand data type. The maximum number of elements that can be processed varies according to the data type. However, the maximum size of data that can be processed cannot exceed the UB size limit.

mask

Input

mask is used to control the elements that participate in computation in each iteration.

  • Contiguous mode: indicates the number of contiguous elements that participate in computation. The value range is related to the operand data type. The maximum number of elements that can be processed in each iteration varies according to the data type. When the operand is 16-bit, mask ∈ [1, 128]. When the operand is 32-bit, mask ∈ [1, 64]. When the operand is 64-bit, mask ∈ [1, 32].
  • Bitwise mode: controls the elements that participate in computation by bit. If a bit is set to 1, the corresponding element participates in the computation. If a bit is set to 0, the corresponding element is masked in the computation. The parameter type is a uint64_t array whose length is 2.

    For example, if mask = [0, 8] and 8 = 0b1000, only the fourth element participates in computation.

    The parameter value range is related to the operand data type. The maximum number of elements that can be processed in each iteration varies according to the data type. When the operand is 16-bit, mask[0] and mask[1] ∈ [0, 264 -1] and cannot be 0 at the same time. When the operand is 32-bit, mask[1] is 0 and mask[0] ∈ (0, 264 – 1]. When the operand is 64-bit, mask[1] is 0 and mask[0] ∈ (0, 232 – 1].

repeatTimes

Input

Number of repeats (iterations). In contrast with Common Parameters, this parameter supports a larger value range. Ensure that the value does not exceed the maximum value of int32_t.

srcRepStride

Input

Address stride between adjacent iterations of the source operand, that is, the number of data blocks skipped in each iteration of the source operand. For details, see repeatStride.

Returns

None

Availability

Atlas Training Series Product

Constraints

  • To save address space, developers can define a tensor for srcLocal, dstLocal, and workLocal to use at the same time (address overlapping). The constraints on address overlapping are as follows:
    • If there is a dependency between the source operand and the destination operand, that is, the destination operand of the Nth iteration is the source operand of the (N + 1)th iteration, address overlapping is not allowed.
    • When srcLocal or dstLocal and workLocal addresses overlap, workLocal must meet the minimum space requirement. Otherwise, address overlapping is not supported.
    • If the addresses of the operands overlap, the addresses must be completely overlapped. Partial overlapping is not supported.
  • The ReduceSum API is implemented through software simulation. In some scenarios, the performance of this API may be lower than that of the BlockReduceSum and WholeReduceSum APIs implemented by using hardware instructions. Proper use of the reduction instruction in different scenarios can improve performance. For details about the introduction, see Using the Reduction Instruction Properly in Different Scenarios. For details about examples, see ReduceCustom.

Example

  • Example of high-dimensional tensor sharding computation (contiguous mask mode)
    1
    2
    3
    // dstLocal, srcLocal, and workLocal are of the half type. For srcLocal, the computation data is of size 8320 and is continuously arranged. It uses the high-dimensional tensor sharding computation API. repeatTimes is set to 65. mask is set to involving all elements in the computation.
    uint64_t mask = 128;
    AscendC::ReduceSum<half>(dstLocal, srcLocal, workLocal, mask, 65, 8);
    
  • Example of high-dimensional tensor sharding computation (bitwise mask mode)
    1
    2
    3
    // dstLocal, srcLocal, and workLocal are of the half type. For srcLocal, the computation data is of size 8320 and is continuously arranged. It uses the high-dimensional tensor sharding computation API. repeatTimes is set to 65. mask is set to involving all elements in the computation.
    uint64_t mask[2] = { 0xFFFFFFFFFFFFFFFF, 0xFFFFFFFFFFFFFFFF };
    AscendC::ReduceSum<half>(dstLocal, srcLocal, workLocal, mask, 65, 8);
    
  • Example of computing the first n data elements of a tensor
    1
    2
    // dstLocal, srcLocal, and workLocal are of the half type. For srcLocal, the computation data is of size 8320 and is continuously arranged. It uses the computation API for the first n tensor elements.
    AscendC::ReduceSum<half>(dstLocal, srcLocal, workLocal, 8320);
    
  • The following is a complete example of the high-dimensional tensor sharding computation API:
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    #include "kernel_operator.h"
    class KernelReduce {
    public:
        __aicore__ inline KernelReduce() {}
        __aicore__ inline void Init(__gm__ uint8_t* src, __gm__ uint8_t* dstGm)
        {
            srcGlobal.SetGlobalBuffer((__gm__ half*)src);
            dstGlobal.SetGlobalBuffer((__gm__ half*)dstGm);
            repeat = srcDataSize / mask;
            pipe.InitBuffer(inQueueSrc, 1, srcDataSize * sizeof(half));
            pipe.InitBuffer(workQueue, 1, 80 * sizeof(half)); // Based on the formula, the minimum work space required is 80, that is 160 bytes.
            pipe.InitBuffer(outQueueDst, 1, dstDataSize * sizeof(half));
        }
        __aicore__ inline void Process()
        {
            CopyIn();
            Compute();
            CopyOut();
        }
    private:
        __aicore__ inline void CopyIn()
        {
            AscendC::LocalTensor<half> srcLocal = inQueueSrc.AllocTensor<half>();
            AscendC::DataCopy(srcLocal, srcGlobal, srcDataSize);
            inQueueSrc.EnQue(srcLocal);
        }
        __aicore__ inline void Compute()
        {
            AscendC::LocalTensor<half> srcLocal = inQueueSrc.DeQue<half>();
            AscendC::LocalTensor<half> dstLocal = outQueueDst.AllocTensor<half>();
            AscendC::LocalTensor<half> workLocal = workQueue.AllocTensor<half>();
            // level0
            AscendC::ReduceSum<half>(dstLocal, srcLocal, workLocal, mask, repeat, repStride);
            outQueueDst.EnQue<half>(dstLocal);
            inQueueSrc.FreeTensor(srcLocal);
            workQueue.FreeTensor(workLocal);
        }
        __aicore__ inline void CopyOut()
        {
            AscendC::LocalTensor<half> dstLocal = outQueueDst.DeQue<half>();
            AscendC::DataCopy(dstGlobal, dstLocal, dstDataSize);
            outQueueDst.FreeTensor(dstLocal);
        }
    private:
        AscendC::TPipe pipe;
        AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueueSrc;
        AscendC::TQue<AscendC::QuePosition::VECOUT, 1> workQueue;
        AscendC::TQue<AscendC::QuePosition::VECOUT, 1> outQueueDst;
        AscendC::GlobalTensor<half> srcGlobal, dstGlobal;
        int srcDataSize = 8320;
        int dstDataSize = 16;
        int mask = 128;
        int repStride = 8;
        int repeat = 0;
    };
    

    The following is an example:

    Input (src_gm):
    [1. 1. 1. ... 1. 1. 1.]
    Output (dst_gm):
    [8320.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.
        0.    0.    0.    0.]
  • The following is an example of the computation API for the first n pieces of tensor elements:
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    #include "kernel_operator.h"
    class KernelReduce {
    public:
        __aicore__ inline KernelReduce() {}
        __aicore__ inline void Init(__gm__ uint8_t* src, __gm__ uint8_t* dstGm)
        {
            srcGlobal.SetGlobalBuffer((__gm__ half*)src);
            dstGlobal.SetGlobalBuffer((__gm__ half*)dstGm);
            repeat = srcDataSize / mask;
            pipe.InitBuffer(inQueueSrc, 1, srcDataSize * sizeof(half));
            pipe.InitBuffer(workQueue, 1, 16 * sizeof(half)); // Based on the formula, the minimum work space required is 16, that is 32 bytes.
            pipe.InitBuffer(outQueueDst, 1, dstDataSize * sizeof(half));
        }
        __aicore__ inline void Process()
        {
            CopyIn();
            Compute();
            CopyOut();
        }
    private:
        __aicore__ inline void CopyIn()
        {
            AscendC::LocalTensor<half> srcLocal = inQueueSrc.AllocTensor<half>();
            AscendC::DataCopy(srcLocal, srcGlobal, srcDataSize);
            inQueueSrc.EnQue(srcLocal);
        }
        __aicore__ inline void Compute()
        {
            AscendC::LocalTensor<half> srcLocal = inQueueSrc.DeQue<half>();
            AscendC::LocalTensor<half> dstLocal = outQueueDst.AllocTensor<half>();
            AscendC::LocalTensor<half> workLocal = workQueue.AllocTensor<half>();
            AscendC::ReduceSum<half>(dstLocal, srcLocal, workLocal, srcDataSize);
            outQueueDst.EnQue<half>(dstLocal);
            inQueueSrc.FreeTensor(srcLocal);
            workQueue.FreeTensor(workLocal);
        }
        __aicore__ inline void CopyOut()
        {
            AscendC::LocalTensor<half> dstLocal = outQueueDst.DeQue<half>();
            AscendC::DataCopy(dstGlobal, dstLocal, dstDataSize);
            outQueueDst.FreeTensor(dstLocal);
        }
    private:
        AscendC::TPipe pipe;
        AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueueSrc;
        AscendC::TQue<AscendC::QuePosition::VECOUT, 1> workQueue;
        AscendC::TQue<AscendC::QuePosition::VECOUT, 1> outQueueDst;
        AscendC::GlobalTensor<half> srcGlobal, dstGlobal;
        int srcDataSize = 288;
        int dstDataSize = 16;
        int mask = 128;
        int repStride = 8;
        int repeat = 0;
    };
    

    The following is an example:

    Input (src_gm):
    [1. 1. 1. ... 1. 1. 1.]
    Output (dst_gm):
    [288.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
       0.   0.]