Tail Block Processing

As shown in the following figure, the input shape of the operator is (1, 2048), and the supported data type is half. The data can be aligned to the size of a data block (32 bytes) or evenly allocated to each core (assume that eight cores are used). Each core processes 256 pieces of data, that is, 16 data blocks. In this case, tail block processing is not required.

Figure 1 Shape-aligned scenario

For some shapes, for example, the input shape of an operator is (1, 1999), and the supported data type is half. The data cannot be aligned to the size of a data block (32 bytes) or evenly allocated to each core. After multi-core splitting, tail block processing is required.

For data movement and Vector computation, the data length and the start address of the UB must be 32-byte aligned. Therefore, the data to be processed must be aligned to the size of a data block. For details about subsequent data movement and computation in this scenario, see Non-Alignment Scenario. The following figure and code snippet show an example of aligning data to the data block size.

Figure 2 Aligning to the size of a data block

        
             constexpr uint32_t SIZE_OF_HALF = 2;
constexpr uint32_t BLOCK_SIZE = 32;
constexpr uint32_t ALIGN_NUM = BLOCK_SIZE / SIZE_OF_HALF;
// The shape needs to be aligned to the data block. Assuming the original totalLength is 1999, rounding up to meet the 32-byte alignment requirement would make it 2000.
uint32_t totalLengthAligned = ((totalLength + ALIGN_NUM - 1) / ALIGN_NUM) * ALIGN_NUM;

The data aligned to the size of a data block must be evenly distributed to each core. If the data cannot be evenly allocated, then first allocate the parts that can be evenly allocated. The remaining data is allocated to some of the cores, which will result in the case that some cores compute an additional data block. The following figure is an example of multi-core tiling when data cannot be evenly distributed. After being aligned to the size of a data block, there are 2000 pieces of half-type data, that is, 125 data blocks in total. The result of 125/8 is 15, with the remainder of 5. It means that the parts that can be evenly allocated are evenly allocated, and each core is allocated with 15 data blocks. The remaining five data blocks are allocated to five cores. Therefore, five cores will have 16 data blocks each, and the other three cores will have 15 data blocks each.

Figure 3 Example when data cannot be evenly allocated to each core

Based on the above description, the following tiling parameters may be designed:

formerNum: number of cores allocated with large blocks
tailNum: number of cores allocated with small blocks
formerLength: data size for large-block compute
tailLength: data size for small-block compute
alignNum: number of elements contained in a data block

Compute methods:

       
        
          
          
            constexpr uint32_t BLOCK_DIM = 8;
constexpr uint32_t SIZE_OF_HALF = 2;
constexpr uint32_t BLOCK_SIZE = 32;
// Minimum unit to which the shape needs to be aligned.
constexpr uint32_t ALIGN_NUM = BLOCK_SIZE / SIZE_OF_HALF;
...
uint8_t *GenerateTiling()
{
    // The shape needs to be aligned to the data block. Assuming the original totalLength is 1999, rounding up to meet the 32-byte alignment requirement would make it 2000.
    uint32_t totalLengthAligned = ((totalLength + ALIGN_NUM - 1) / ALIGN_NUM) * ALIGN_NUM;
    // Allocate all data to each core as evenly as possible.
    // If the data cannot be evenly allocated, then first allocate the parts that can be evenly allocated. The remaining data is allocated to some of the cores, which will result in the case that some cores compute an additional data block.
    // Through compute, obtain the number of cores that compute one additional data block and the number of remaining cores.
    // For example, after alignment, 1999 pieces of data are round up to 2000 pieces, with 8 cores, and each data block contains 16 pieces.
    // Total number of data blocks: 2000/16 = 125
    // Five cores are allocated with 16 data blocks: 125 % 8 = 5, which are referred to as large blocks.
    // Three cores are allocated with 15 data blocks: 8 – 5 = 3, which are referred to as small blocks.
    uint32_t formerNum = (totalLengthAligned / ALIGN_NUM) % BLOCK_DIM;
    uint32_t tailNum = BLOCK_DIM - formerNum;
    // Data size for large block compute: totalLengthAligned/BLOCK_DIM indicates the number of elements computed on each core, and formerLength indicates the result of upwards 32-byte alignment of the elements.
    uint32_t formerLength = ((totalLengthAligned / BLOCK_DIM + ALIGN_NUM - 1) / ALIGN_NUM) * ALIGN_NUM;
    // Data size for small block compute: totalLengthAligned / BLOCK_DIM indicates the number of elements computed on each core, and tailLength indicates the result of downwards 32-byte alignment of the elements.
    uint32_t tailLength = (totalLengthAligned / BLOCK_DIM / ALIGN_NUM) * ALIGN_NUM;

...
}

           

         

       
      

On the kernel, an example of computing the offset of each core and the size of each block is as follows:

     
      
        
        
          __aicore__ inline void Init(GM_ADDR x, GM_ADDR y, GM_ADDR z, uint32_t formerNum, uint32_t tailNum, uint32_t formerLength, uint32_t tailLength, uint32_t alignNum)
{
    if (GetBlockIdx() < formerNum) {
        this->tileLength = formerLength;
        xGm.SetGlobalBuffer((__gm__ half *)x + formerLength * GetBlockIdx(), formerLength);
        yGm.SetGlobalBuffer((__gm__ half *)y + formerLength * GetBlockIdx(), formerLength);
        zGm.SetGlobalBuffer((__gm__ half *)z + formerLength * GetBlockIdx(), formerLength);
    } else {
        this->tileLength = tailLength;
        xGm.SetGlobalBuffer((__gm__ half *)x + formerLength * formerNum + tailLength * (GetBlockIdx() - formerNum), tailLength);
        yGm.SetGlobalBuffer((__gm__ half *)y + formerLength * formerNum + tailLength * (GetBlockIdx() - formerNum), tailLength);
        zGm.SetGlobalBuffer((__gm__ half *)z + formerLength * formerNum + tailLength * (GetBlockIdx() - formerNum), tailLength);
    }
    pipe.InitBuffer(inQueueX, BUFFER_NUM, this->tileLength * sizeof(half));
    pipe.InitBuffer(inQueueY, BUFFER_NUM, this->tileLength * sizeof(half));
    pipe.InitBuffer(outQueueZ, BUFFER_NUM, this->tileLength * sizeof(half));
}

         

       

     
    

Parent topic: Vector Programming