Selecting Low-Latency Instructions to Optimize Reduction Operation Performance
[Priority] High
[Description]
Instruction execution latency refers to the duration from the time when an instruction is executed to the time when the instruction is completely executed (that is, all operations are complete and the result is available). It directly affects the response speed and real-time performance of a program. In latency-sensitive scenarios, reducing the instruction execution latency is the key to improving performance. The following uses the reduction operation as an example to compare the performance of several reduction solutions, so that you can select a higher-performance solution based on the specific data scale and scenario when using reduction instructions.
Comparison between the binary accumulation solution and the reduction instruction solution
According to the single-instruction performance test data (you can test it yourself), the latency of reduction instructions such as WholeReduceSum is about 2 to 5 times that of the Add instruction. Therefore, for reduction operations on continuous data, you can use a combination of the Add and WholeReduceSum instructions to optimize the overall performance. This solution is referred to as the binary accumulation solution. The details are as follows:
- Binary accumulation: Tile the data into two pieces and use the Add instruction to add the two pieces of data. Tile the sum into two pieces of data again and continue to use the Add instruction to accumulate the two pieces of data. Repeat this process.
- When the data size after binary accumulation is less than or equal to 256 bytes (the data operation volume of one instruction in one Repeat), use the WholeReduceSum instruction to obtain the reduction result at a time.
Assume that the data type of the input data is float and the shape is (5, 256). The following figure shows the execution process of a row of data.
The preceding process is executed for each row to obtain the final reduction result, that is, the data whose shape is (m, k). After the reduction is complete, the shape is (m, 1).
The ReduceSum API is implemented by combining multiple instructions. Generally, in scenarios where the data size is large and the number of cycles is large, the performance of the binary accumulation solution > WholeReduceSum single-instruction operation performance > ReduceSum API performance. In scenarios with a small data size or special shapes, data tilting is required and you need to analyze the problem based on the instruction execution time and number of executed instructions.
The following compares the core code snippets and profile data of the binary accumulation solution and the reduction instruction solution. For details about the complete example, click ReduceCustom.
[Performance data]
When the input shape is 30000 and the data type is float, the profile data is compared as follows. The data unit is cycle, and the data is obtained by using the GetSystemCycle API.
|
Binary accumulation solution |
WholeReduceSum single-instruction operation |
|---|---|
|
172 |
242 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
__aicore__ inline void BinaryReduceSumImpl(const AscendC::LocalTensor<float>& dst, const AscendC::LocalTensor<float>& src, const uint32_t bsLength, const uint32_t hLength) { // src is two-dimensional data, and the shape is (bsLength, hLength). The shape of dst is (bsLength, 1). AscendC::BinaryRepeatParams binaryParams; AscendC::UnaryRepeatParams unaryParams; AscendC::SetMaskCount(); for (uint32_t i = 0; i < bsLength; i++) { AscendC::LocalTensor<float> srcTmp = src[i * hLength]; AscendC::LocalTensor<float> dstTmp = dst[i * hLength]; uint32_t totalNum = hLength / 16 * 16; uint32_t remaining = hLength - totalNum; AscendC::LocalTensor<float> remainingTensor = srcTmp[totalNum]; while (totalNum > ONE_REPEAT_FLOAT_SIZE) { uint32_t halfNum = AscendC::DivCeil(totalNum, 16) * DEFAULT_REP_STRIDE; AscendC::SetVectorMask<uint8_t, AscendC::MaskMode::COUNTER>(0, totalNum - halfNum); AscendC::Add<float, false>(dstTmp, srcTmp, srcTmp[halfNum], AscendC::MASK_PLACEHOLDER, 1, binaryParams); totalNum = halfNum; srcTmp = dstTmp; } if (remaining != 0 && hLength > ONE_REPEAT_FLOAT_SIZE) { AscendC::SetVectorMask<uint8_t, AscendC::MaskMode::COUNTER>(0, remaining); AscendC::Add<float, false>(dstTmp, dstTmp, remainingTensor, AscendC::MASK_PLACEHOLDER, 1, binaryParams); } AscendC::SetVectorMask<uint8_t, AscendC::MaskMode::COUNTER>(0, totalNum); AscendC::WholeReduceSum<float, false>(dstTmp, srcTmp, AscendC::MASK_PLACEHOLDER, 1, DEFAULT_BLK_STRIDE, DEFAULT_BLK_STRIDE, DEFAULT_REP_STRIDE); } AscendC::ResetMask(); AscendC::SetMaskNorm(); } |
[WholeReduceSum single-instruction operation]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
__aicore__ inline void WholeReduceSumImpl(const AscendC::LocalTensor<float>& dst, const AscendC::LocalTensor<float>& src, const uint32_t bsLength, const uint32_t hLength) { // src is a two-dimensional array, and the shape is (bsLength, hLength). The shape of dst is (bsLength, 1). AscendC::SetMaskCount(); for (uint32_t i = 0; i < bsLength; i++) { uint32_t totalNum = hLength; AscendC::LocalTensor<float> srcTmp = src[i * hLength]; AscendC::LocalTensor<float> dstTmp = dst[i * hLength]; while (totalNum > 1) { AscendC::SetVectorMask<uint8_t, AscendC::MaskMode::COUNTER>(0, totalNum); AscendC::WholeReduceSum<float, false>(dstTmp, srcTmp, AscendC::MASK_PLACEHOLDER, 1, DEFAULT_BLK_STRIDE, DEFAULT_BLK_STRIDE, DEFAULT_REP_STRIDE); totalNum = AscendC::DivCeil(totalNum, ONE_REPEAT_FLOAT_SIZE); srcTmp = dstTmp; } } AscendC::ResetMask(); AscendC::SetMaskNorm(); } |
Comparison Between BlockReduceSum and WholeReduceSum Reduction Solutions
According to further test analysis, the execution efficiency of BlockReduceSum is better than that of WholeReduceSum. Therefore, different instruction combinations can be used to achieve better performance based on different shapes.
For example, for data of the float type and with a shape size of 256, the reduction result can be obtained in the following three ways:
- Two WholeReduceSum operations;
- Three BlockReduceSum operations;
- One BlockReduceSum operation and one WholeReduceSum operation.
According to the analysis of the single-instruction performance data (you can test it by yourself), the performance of one BlockReduceSum operation + one WholeReduceSum operation is better than that of two WholeReduceSum operations or three BlockReduceSum operations.
The following provides the core code snippets and profile data comparison of the three solutions. For details about the complete example, click ReduceCustom.
[Performance data]
The input shape is 256 and the data type is float. The profile data is as follows:
|
Two WholeReduceSum operations |
Three BlockReduceSum operations |
One BlockReduceSum operation + one WholeReduceSum operation |
|---|---|---|
|
13us |
13.94us |
8.44us |
[Two WholeReduceSum operations]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
... pipe.InitBuffer(calcBuf, totalLength * sizeof(DTYPE)); AscendC::LocalTensor<DTYPE> tempTensor1 = calcBuf.Get<DTYPE>(); const uint32_t repeatNum = (totalLength * sizeof(DTYPE) + REP_LEN - 1) / REP_LEN; AscendC::SetMaskCount(); AscendC::SetVectorMask<DTYPE>(0, totalLength); AscendC::WholeReduceSum<DTYPE, false>(tempTensor1, xLocal, 1, AscendC::MASK_PLACEHOLDER, DEFAULT_BLK_STRIDE, DEFAULT_BLK_STRIDE, DEFAULT_REP_STRIDE); AscendC::PipeBarrier<PIPE_V>(); AscendC::SetVectorMask<DTYPE>(0, repeatNum); AscendC::WholeReduceSum<DTYPE, false>(zLocal, tempTensor1, 1, AscendC::MASK_PLACEHOLDER, DEFAULT_BLK_STRIDE, DEFAULT_BLK_STRIDE, DEFAULT_REP_STRIDE); AscendC::PipeBarrier<PIPE_V>(); AscendC::SetMaskNorm(); ... |
[Three BlockReduceSum operations]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
... static constexpr uint32_t BLK_LEN = 32; TBuf<TPosition::VECCALC> calcBuf; constexpr uint32_t c0Count = BLK_LEN / sizeof(DTYPE_X); const uint32_t blockNum0 = (totalLength + c0Count - 1) / c0Count; const uint32_t blockNum1 = (blockNum0 + c0Count - 1) / c0Count; AscendC::SetMaskCount(); AscendC::SetVectorMask<DTYPE_X>(0, totalLength); AscendC::BlockReduceSum<DTYPE_X, false>(tempTensor1, xLocal, AscendC::MASK_PLACEHOLDER, 1, DEFAULT_BLK_STRIDE, DEFAULT_BLK_STRIDE, DEFAULT_REP_STRIDE); AscendC::PipeBarrier<PIPE_V>(); AscendC::SetVectorMask<DTYPE_X>(0, blockNum0); AscendC::BlockReduceSum<DTYPE_X, false>(tempTensor1, tempTensor1, AscendC::MASK_PLACEHOLDER, 1, DEFAULT_BLK_STRIDE, DEFAULT_BLK_STRIDE, DEFAULT_REP_STRIDE); AscendC::PipeBarrier<PIPE_V>(); AscendC::SetVectorMask<DTYPE_X>(0, blockNum1); AscendC::BlockReduceSum<DTYPE_X, false>(zLocal, tempTensor1, AscendC::MASK_PLACEHOLDER, 1, DEFAULT_BLK_STRIDE, DEFAULT_BLK_STRIDE, DEFAULT_REP_STRIDE); AscendC::PipeBarrier<PIPE_V>(); AscendC::SetMaskNorm(); ... |
[BlockReduceSum + WholeReduceSum operations]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
... pipe.InitBuffer(calcBuf, totalLength * sizeof(DTYPE)); AscendC::LocalTensor<DTYPE> tempTensor1 = calcBuf.Get<DTYPE>(); constexpr uint32_t c0Count = BLK_LEN / sizeof(DTYPE); const uint32_t blockNum0 = (totalLength + c0Count - 1) / c0Count; AscendC::SetMaskCount(); AscendC::SetVectorMask<DTYPE>(0, totalLength); AscendC::BlockReduceSum<DTYPE, false>(tempTensor1, xLocal, 1, AscendC::MASK_PLACEHOLDER, DEFAULT_BLK_STRIDE, DEFAULT_BLK_STRIDE, DEFAULT_REP_STRIDE); AscendC::PipeBarrier<PIPE_V>(); AscendC::SetVectorMask<DTYPE>(0, blockNum0); AscendC::WholeReduceSum<DTYPE, false>(zLocal, tempTensor1, 1, AscendC::MASK_PLACEHOLDER, DEFAULT_BLK_STRIDE, DEFAULT_BLK_STRIDE, DEFAULT_REP_STRIDE); AscendC::PipeBarrier<PIPE_V>(); AscendC::SetMaskNorm(); ... |