Using the Reduction Instruction Properly in Different Scenarios

[Priority] Medium

[Description] In the scenario where all elements in a contiguous buffer need to be accumulated to one element, using WholeReduceSum improves the accumulation efficiency, but leads to slower single-instruction execution speed than BlockReduceSum. Therefore, to balance the accumulation efficiency and use instructions that are executed as fast as possible, different instruction combinations need to be used based on different shapes to achieve optimal performance.

For example, for the float input with a shape of 256, you can use WholeReduceSum twice or BlockReduceSum three times to obtain the accumulation result of 256 floats. Considering the combinations of reduction instructions, one BlockReduceSum plus one WholeReduceSum can also achieve the preceding accumulation effect. BlockReduceSum is faster than WholeReduceSum because it involves only a single instruction. Therefore, BlockReduceSum achieves better performance than two WholeReduceSum. In addition, because only two reduction instructions are used, BlockReduceSum also achieves better performance than three BlockReduceSum.

[Negative Example]

Incorrect example 1:
...
static constexpr uint32_t REP_LEN = 256;
TBuf<QuePosition::VECCALC> calcBuf;
pipe.InitBuffer(calcBuf, totalLength * sizeof(float));
AscendC::LocalTensor<float> tempTensor1 = calcBuf.Get<float>();
constexpr uint32_t repCount = REP_LEN / sizeof(float);
const uint32_t repNum0 = (totalLength + repCount - 1) / repCount;
AscendC::SetMaskCount();
AscendC::SetVectorMask<float>(0, totalLength);
AscendC::WholeReduceSum<float, false>(tempTensor1, xLocal, AscendC::MASK_PLACEHOLDER, 1,
    DEFAULT_BLK_STRIDE, DEFAULT_BLK_STRIDE, DEFAULT_REP_STRIDE);
AscendC::PipeBarrier<PIPE_V>();
AscendC::SetVectorMask<float>(0, repNum0);
AscendC::WholeReduceSum<float, false>(zLocal, tempTensor1, AscendC::MASK_PLACEHOLDER, 1,
    DEFAULT_BLK_STRIDE, DEFAULT_BLK_STRIDE, DEFAULT_REP_STRIDE);
AscendC::PipeBarrier<PIPE_V>();
AscendC::SetMaskNorm();
...
Negative example 2:
...
constexpr uint32_t c0Count = BLK_LEN / sizeof(DTYPE_X);
const uint32_t blockNum0 = (totalLength + c0Count - 1) / c0Count;
const uint32_t blockNum1 = (blockNum0 + c0Count - 1) / c0Count;
AscendC::SetMaskCount();
AscendC::SetVectorMask<DTYPE_X>(0, totalLength);
AscendC::BlockReduceSum<DTYPE_X, false>(tempTensor1, xLocal, AscendC::MASK_PLACEHOLDER, 1,
    DEFAULT_BLK_STRIDE, DEFAULT_BLK_STRIDE, DEFAULT_REP_STRIDE);
AscendC::PipeBarrier<PIPE_V>();
AscendC::SetVectorMask<DTYPE_X>(0, blockNum0);
AscendC::BlockReduceSum<DTYPE_X, false>(tempTensor1, tempTensor1, AscendC::MASK_PLACEHOLDER, 1,
    DEFAULT_BLK_STRIDE, DEFAULT_BLK_STRIDE, DEFAULT_REP_STRIDE);
AscendC::PipeBarrier<PIPE_V>();
AscendC::SetVectorMask<DTYPE_X>(0, blockNum1);
AscendC::BlockReduceSum<DTYPE_X, false>(zLocal, tempTensor1, AscendC::MASK_PLACEHOLDER, 1,
    DEFAULT_BLK_STRIDE, DEFAULT_BLK_STRIDE, DEFAULT_REP_STRIDE);
AscendC::PipeBarrier<PIPE_V>();
AscendC::SetMaskNorm();
...

[Positive Example]

When the input shape is 256 and the input type is float. A combination of one BlockReduceSum and one WholeReduceSum will sum up 256 float elements. For details about the complete sample, see ReduceCustom.

...
static constexpr uint32_t BLK_LEN = 32;
TBuf<QuePosition::VECCALC> calcBuf;
pipe.InitBuffer(calcBuf, totalLength * sizeof(float));
AscendC::LocalTensor<float> tempTensor1 = calcBuf.Get<float>();
constexpr uint32_t c0Count = BLK_LEN / sizeof(float);
const uint32_t blockNum0 = (totalLength + c0Count - 1) / c0Count;
AscendC::SetMaskCount();
AscendC::SetVectorMask<float>(0, totalLength);
AscendC::BlockReduceSum<float, false>(tempTensor1, xLocal, AscendC::MASK_PLACEHOLDER, 1,
    DEFAULT_BLK_STRIDE, DEFAULT_BLK_STRIDE, DEFAULT_REP_STRIDE);
AscendC::PipeBarrier<PIPE_V>();
AscendC::SetVectorMask<float>(0, blockNum0);
AscendC::WholeReduceSum<float, false>(zLocal, tempTensor1, AscendC::MASK_PLACEHOLDER, 1,
    DEFAULT_BLK_STRIDE, DEFAULT_BLK_STRIDE, DEFAULT_REP_STRIDE);
AscendC::PipeBarrier<PIPE_V>();
AscendC::SetMaskNorm();
...

[Performance Data]

The input shape is 256 and the data type is float. The profile data in the preceding example is as follows:

**Table 1** Performance data of three accumulation modes (total time of 100 cycles): two WholeReduceSum (negative example 1), three BlockReduceSum (negative example 2), one WholeReduceSum plus one BlockReduceSum (positive example)
Negative Example 1	Negative Example 2	Correct Example
13us	13.94us	8.44us

Parent topic: API Usage Optimization