Tail Core Tiling

When data is tiled for inputs of varying shapes, the data may not be evenly allocated across cores. For example, if the input shape of an operator is [1, 1999], the number of used cores is 8, and the data type is half, the total amount of data to be computed is 1 x 1999 x sizeof(half) = 3998 bytes. 3998 bytes neither are 32-byte aligned nor can be evenly allocated across eight cores. Therefore, in this scenario, after multi-core tiling is performed on the data, the amount of data computed by each core is different. In this case, the data should be evenly allocated as much as possible. Based on the amount of computed data, cores can be divided into whole cores (with a larger amount) and tail cores (with a smaller amount).

Figure 1 Data alignment

Tiling Implementation

For data movement and vector computation by Ascend AI Processor, the data length and the start address of Unified Buffer must be 32-byte aligned. Therefore, the data to be processed must be rounded up to 32-byte aligned. For details about subsequent data movement and computation in this scenario, see Non-Alignment Scenario. The following code snippet shows an example of aligning data to the data block size.

        
             constexpr uint32_t SIZE_OF_HALF = 2;
constexpr uint32_t BLOCK_SIZE = 32;
constexpr uint32_t BLOCK_DIM = 8;
constexpr uint32_t ALIGN_NUM = BLOCK_SIZE / SIZE_OF_HALF;
// The shape needs to be aligned to 32 bytes. Assuming the original totalLength is 1999, rounding up to meet the 32-byte alignment requirement would make it 2000.
uint32_t totalLengthAligned = ((totalLength + ALIGN_NUM - 1) / ALIGN_NUM) * ALIGN_NUM;

The data aligned to 32 bytes must be evenly allocated across cores. If the data cannot be evenly allocated, first allocate the parts that can be evenly allocated. The remaining data is allocated among some of the cores, which will result in the case that some cores compute an additional data block. To ensure that the tiled data is still 32-byte aligned, allocate the data among all cores at the granularity of ALIGN_NUM (the number of ALIGN_NUM data elements are 32 bytes). In this sample, the data type is half and ALIGN_NUM = BLOCK_SIZE / sizeof(half) is 16. The total amount of aligned data is tiled into x data blocks at the granularity of ALIGN_NUM, and x = 2000/16 = 125.
The number (BLOCK_DIM) of cores of the AI Processor is 8. As a result, 125 data blocks cannot be evenly allocated across the eight cores. Perform the following steps to evenly allocate the data blocks among the core as much as possible:
1. Compute x / BLOCK_DIM = 15.
2. Compute x % BLOCK_DIM = 5.
According to the preceding steps, if 15 data blocks are allocated to each core, five data blocks are left. The five remaining data blocks are allocated to five cores. In this way, five whole cores that compute 16 data blocks and three tail cores that compute 15 data blocks are obtained. The following figure shows an example of multi-core tiling when data cannot be evenly allocated.
Figure 2 Example in which data cannot be evenly allocated across cores

Based on the above description, the following operator tiling structure members are designed:

formerNum: number of cores allocated with a larger amount of data, that is, the number of whole cores
tailNum: number of cores allocated with a smaller amount of data, that is, the number of tail cores
formerLength: length of data computed by whole cores
tailLength: length of data computed by tail cores

The computation code of the tiling parameters is as follows:

      
       
         
         
           constexpr uint32_t BLOCK_DIM = 8;
constexpr uint32_t SIZE_OF_HALF = 2;
constexpr uint32_t BLOCK_SIZE = 32;
// Minimum unit to which the shape needs to be aligned.
constexpr uint32_t ALIGN_NUM = BLOCK_SIZE / SIZE_OF_HALF;
...
uint8_t *GenerateTiling()
{
    // The shape needs to be aligned to the data block. Assuming the original totalLength is 1999, rounding up to meet the 32-byte alignment requirement would make it 2000.
    uint32_t totalLengthAligned = ((totalLength + ALIGN_NUM - 1) / ALIGN_NUM) * ALIGN_NUM;
    // If the number of cores is 8 and a data block contains 16 data elements, the total number of data blocks is 2000/16 = 125.
    // Five cores are allocated with 16 data blocks: 125 % 8 = 5, which are referred to as whole cores.
    // Three cores are allocated with 15 data blocks: 8 – 5 = 3, which are referred to as tail cores.
    uint32_t formerNum = (totalLengthAligned / ALIGN_NUM) % BLOCK_DIM; 
    uint32_t tailNum = BLOCK_DIM - formerNum;
    // Length of data computed by whole cores: totalLengthAligned / BLOCK_DIM indicates the number of elements computed on each core, and formerLength indicates the result of upwards 32-byte alignment of the elements.
    uint32_t formerLength = ((totalLengthAligned / BLOCK_DIM + ALIGN_NUM - 1) / ALIGN_NUM) * ALIGN_NUM;
    // Length of data computed by tail cores: totalLengthAligned / BLOCK_DIM indicates the number of elements computed on each core, and tailLength indicates the result of downwards 32-byte alignment of the elements.
    uint32_t tailLength = (totalLengthAligned / BLOCK_DIM / ALIGN_NUM) * ALIGN_NUM;
    ...
}

          

        

      
     

Operator Class Implementation

In the Init function on the kernel, when the memory offset address of the input in the global memory is computed, the whole core and tail core should be distinguished.

On the whole core, the code for computing the memory offset address of the input is as follows:

      
           xGm.SetGlobalBuffer((__gm__ half *)x + formerLength * AscendC::GetBlockIdx(), formerLength);

On the tail core, when the memory offset address of the input is computed, you need to add the offset of the tail cores to the data length of whole cores. The code is as follows:

      
           xGm.SetGlobalBuffer((__gm__ half *)x + formerLength * formerNum + tailLength * (AscendC::GetBlockIdx() - formerNum), tailLength);

The complete implementation code of the Init function is as follows:

      
       
         
         
           __aicore__ inline void Init(GM_ADDR x, GM_ADDR y, GM_ADDR z, AddCustomTilingData tiling)
{
    if (AscendC::GetBlockIdx() < formerNum) {
        this->tileLength = formerLength;
        xGm.SetGlobalBuffer((__gm__ half *)x + formerLength * AscendC::GetBlockIdx(), formerLength);
        yGm.SetGlobalBuffer((__gm__ half *)y + formerLength * AscendC::GetBlockIdx(), formerLength);
        zGm.SetGlobalBuffer((__gm__ half *)z + formerLength * AscendC::GetBlockIdx(), formerLength);
    } else {
        this->tileLength = tailLength;
        xGm.SetGlobalBuffer((__gm__ half *)x + formerLength * formerNum + tailLength * (AscendC::GetBlockIdx() - formerNum), tailLength);
        yGm.SetGlobalBuffer((__gm__ half *)y + formerLength * formerNum + tailLength * (AscendC::GetBlockIdx() - formerNum), tailLength);
        zGm.SetGlobalBuffer((__gm__ half *)z + formerLength * formerNum + tailLength * (AscendC::GetBlockIdx() - formerNum), tailLength);
    }
    pipe.InitBuffer(inQueueX, 1, this->tileLength * sizeof(half));
    pipe.InitBuffer(inQueueY, 1, this->tileLength * sizeof(half));
    pipe.InitBuffer(outQueueZ, 1, this->tileLength * sizeof(half));
}

          

        

      
     

The implementation code of other functions is the same as that in multi-core tiling.

Parent topic: Multi-Core Tiling