Tail Cores and Tail Blocks

When data is tiled for inputs of varying shapes, it may not be evenly allocated across cores, and the data within each core may also be unevenly allocated. You can consider the processing methods of tail blocks of all cores and tail cores by referring to the tail block processing in even allocation across cores and tail core processing in uneven allocation across cores.

Tiling Implementation

In this scenario, inter-core and intra-core data cannot be evenly allocated. Therefore, two member variables are added to the tiling structure defined in the tail core processing in uneven allocation across cores.

  • formerLastTileLength: size of the last block of cores with a larger amount of data, that is, size of the tail block of whole cores

    During computation, cores with a larger amount of data are tiled first based on the core allocation strategy mentioned in Tail Core Tiling.

    1
    2
    3
    4
    5
    6
    // Data blocks to which the shape needs to be aligned,
    uint32_t totalLengthAligned = ((totalLength + ALIGN_NUM - 1) / ALIGN_NUM) * ALIGN_NUM;
    // Compute the number of whole cores.
    uint32_t formerNum = (totalLengthAligned / ALIGN_NUM) % BLOCK_DIM;
    // Compute the amount of data of the whole core.
    uint32_t formerLength = ((totalLengthAligned / BLOCK_DIM + ALIGN_NUM - 1) / ALIGN_NUM) * ALIGN_NUM;
    

    Then, the length of the tail block is computed based on the tiling strategy in Tail Block Tiling.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    uint32_t formerTileNum = formerLength / ALIGN_NUM/ UB_BLOCK_NUM;
    if (formerTileNum == 0) {
        formerTileLength = 0;
        formerLastTileLength = formerLength / ALIGN_NUM * ALIGN_NUM;
    } else if ((formerLength / ALIGN_NUM) % UB_BLOCK_NUM == 0) {
        formerTileLength = UB_BLOCK_NUM * ALIGN_NUM;
        lastTileLength = 0;
    } else {
        formerTileLength = UB_BLOCK_NUM * ALIGN_NUM;
        formerLastTileLength = formerLength - formerTileNum * formerTileLength;
    }
    
  • tailLastTileLength: size of the last block of cores with a smaller amount of data, that is, size of the tail block of tail cores

    During computation, cores with a smaller amount of data are tiled first based on the core allocation strategy mentioned in Tail Core Tiling.

    1
    2
    3
    4
    // Compute the number of tail cores.
    uint32_t tailNum = BLOCK_DIM - formerNum;
    // Compute the amount of data of the tail core.
    uint32_t tailLength = (totalLengthAligned / BLOCK_DIM / ALIGN_NUM) * ALIGN_NUM;
    

    Then, the length of the tail block is computed based on the tiling strategy in Tail Block Tiling.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    uint32_t tailTileNum = tailLength / ALIGN_NUM/ UB_BLOCK_NUM;
    if (tailTileNum == 0) {
        tailTileLength = 0;
        tailLastTileLength = tailLength / ALIGN_NUM * ALIGN_NUM;
    } else if ((tailLength / ALIGN_NUM) % UB_BLOCK_NUM == 0) {
        tailTileLength = UB_BLOCK_NUM * ALIGN_NUM;
        tailLastTileLength = 0;
    } else {
        tailTileLength = UB_BLOCK_NUM * ALIGN_NUM;
        tailLastTileLength = tailLength - tailTileNum * tailTileLength ;
    }
    

Operator Class Implementation

The Init and Process functions on the kernel need to be implemented based on the combination of the tail block processing in even allocation across cores and tail core processing in uneven allocation across cores.

In the Init function, tileLength and lastTileLength of the whole core and tail core are different. Therefore, the whole core and tail core need to be processed separately as described in the tail core processing in uneven allocation across cores. The subsequent processing of the CopyIn, Compute, and CopyOut functions for the main block and tail block is the same as that in processing in even allocation across cores.

The implementation code of the Init function is as follows:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
__aicore__ inline void Init(GM_ADDR x, GM_ADDR y, GM_ADDR z, AddCustomTilingData tiling)
{
    if (AscendC::GetBlockIdx() < formerNum) {
        this->tileNum = tiling.formerTileNum;
        this->tileLength = tiling.formerTileLength;
        this->lastTileLength = tiling.formerLastTileLength;

        xGm.SetGlobalBuffer((__gm__ half *)x + tiling.formerLength * AscendC::GetBlockIdx(),
            tiling.formerLength);
        yGm.SetGlobalBuffer((__gm__ half *)y + tiling.formerLength * AscendC::GetBlockIdx(),
            tiling.formerLength);
        zGm.SetGlobalBuffer((__gm__ half *)z + tiling.formerLength * AscendC::GetBlockIdx(),
            tiling.formerLength);
    } else {
        this->tileNum = tiling.tailTileNum;
        this->tileLength = tiling.tailTileLength;
        this->lastTileLength = tiling.tailLastTileLength;

        xGm.SetGlobalBuffer((__gm__ half *)x + tiling.formerLength * tiling.formerNum +
            tiling.tailLength * (AscendC::GetBlockIdx() - tiling.formerNum), tiling.tailLength);
        yGm.SetGlobalBuffer((__gm__ half *)y + tiling.formerLength * tiling.formerNum +
            tiling.tailLength * (AscendC::GetBlockIdx() - tiling.formerNum), tiling.tailLength);
        zGm.SetGlobalBuffer((__gm__ half *)z + tiling.formerLength * tiling.formerNum +
            tiling.tailLength * (AscendC::GetBlockIdx() - tiling.formerNum), tiling.tailLength);
    }
    
    // If there are tail blocks only, tileLength is 0. Therefore, the larger value of tileLength and lastTileLength is used for initialization.
    uint32_t initBufferLength = AscendC::Std::max(this->tileLength, this->lastTileLength);
    pipe.InitBuffer(inQueueX, 1, initBufferLength * sizeof(half));
    pipe.InitBuffer(inQueueY, 1, initBufferLength * sizeof(half));
    pipe.InitBuffer(outQueueZ, 1, initBufferLength * sizeof(half));
}