Using the Counter Mode for Vector Operators

[Priority] High

[Description] In normal mode, the amount of data to be computed by the vector compute API of a vector operator is controlled by repeatTimes (number of iterations) and mask. To specify the total number of elements to be computed by the API, check whether there are different main and tail blocks. For the main blocks, set mask to let all elements that participate in the computation, compute the number of required iterations, and then reset mask based on the number of remaining elements in the tail block, which will be further computed. This process involves a large number of scalar computations. In counter mode, however, it is not required to compute the number of iterations or determine whether is a tail block. After the mask mode is set to the counter mode, you only need to set mask to a value in the range of {0, Total number of elements}, and then call the corresponding API. This simplifies the processing logic and reduces the code amount and scalar computation amount.

Note: For details about the normal mode, counter mode, and mask, see Mask Operations.

In the following negative and positive examples, the AddCustom operator is used as an example. The calling code of the Add API is modified to describe the advantages of the counter mode.

AscendC::Add(zLocal, xLocal, yLocal, this->tileLength);

[Negative Example]

The inputs are xLocal and yLocal of the half type, with the data amount of 15,000. In normal mode, the maximum number of computed elements in each iteration is as follows: 256B/sizeof(half) = 128. Therefore, the 15,000 Add computations are divided into 117 (15000/128) iterations for the main block computation, with 128 elements involved in each iteration. The tail block is computed in one iteration, in which 24 elements (15000 – 117 x 128) are involved in the computation. From the perspective of code, it requires computation of repeatTimes for main blocks (mask = 128) and the number of elements in the tail block (mask = 24). All these processes involve scalar computations.

uint32_t ELE_SIZE = 15000;
AscendC::BinaryRepeatParams binaryParams;

uint32_t numPerRepeat = 256/sizeof(DTYPE_X); // DTYPE_X is of the half type.
uint32_t mainRepeatTimes = ELE_SIZE / numPerRepeat;
uint32_t tailEleNum = ELE_SIZE % numPerRepeat;

AscendC::SetMaskNorm();
AscendC::SetVectorMask<DTYPE_X, AscendC::MaskMode::NORMAL>(numPerRepeat); // Set the mask mode to normal, so that 128 elements are computed in each iteration.
AscendC::Add<DTYPE_X, false>(zLocal, xLocal, yLocal, AscendC::MASK_PLACEHOLDER, mainRepeatTimes, binaryParams);   // Set MASK_PLACEHOLDER to 0. It is a placeholder here and should be replaced with the actual value of SetVectorMask.
if (tailEleNum > 0) {
     AscendC::SetVectorMask<DTYPE_X, AscendC::MaskMode::NORMAL>(tailEleNum); // Set the mask mode to normal, so that 24 elements are computed in each iteration.
     // Start address of the offset tensor. Compute the tail block at the 14976th elements of xLocal and yLocal.
     AscendC::Add<DTYPE_X, false>(zLocal[mainRepeatTimes * numPerRepeat], xLocal[mainRepeatTimes * numPerRepeat], 
           yLocal[mainRepeatTimes * numPerRepeat], AscendC::MASK_PLACEHOLDER, 1, binaryParams);  
}
AscendC::ResetMask (); // Reset the mask value.

[Positive Example]

The inputs are xLocal and yLocal of the half type, with the data amount of 15,000. In counter mode, you only need to set mask to the number of all computed elements to 15,000, and then directly call the Add instruction to complete all computations. This spares you complex computations of the main and tail blocks, making your code clean.

To process multiple vector computations of 15,000 elements, the counter mode has more obvious advantages, freeing you of modifying different mask values of the main and tail blocks.

uint32_t ELE_SIZE = 15000;
AscendC::BinaryRepeatParams binaryParams;
AscendC::SetMaskCount();
AscendC::SetVectorMask<DTYPE_X, AscendC::MaskMode::COUNTER>(ELE_SIZE);  // Set the mask mode to counter, with a total number of 15000 elements.
AscendC::Add<DTYPE_X, false>(zLocal, xLocal, yLocal, AscendC::MASK_PLACEHOLDER, 1, binaryParams);                // Set MASK_PLACEHOLDER to 0. It is a placeholder here and should be replaced with the actual value of SetVectorMask.
AscendC::ResetMask (); // Reset the mask value.

[Performance Data]

Figure 1 Scalar time in normal mode vs. counter mode
Figure 2 AIV cycles in normal mode vs. counter mode
Figure 3 Total time in normal mode vs. counter mode

According to the preceding figures and sample code, the counter mode can greatly simplify code, facilitate maintenance, and shorten the time required for scalar and vector computation, improving performance.