Tail Block Tiling

As shown in the following figure, the input shape of the operator is (1, 2048), and the supported data type is half. The input data can be aligned to the size of a data block (32 bytes) and contains 128 (2048 × 2/32) data blocks. Therefore, the data can be evenly allocated to each core (assuming that eight cores are used). Each core processes 256 data elements, that is, 16 data blocks. In this case, tail block processing is not required.

Figure 1 Shape-aligned scenario

For example, the input shape of the operator is (1, 1904), and the supported data type is half. The input data can be aligned to the size of a data block (32 bytes) or evenly allocated to each core (assuming that eight cores are used). Each core processes 238 data elements, which cannot be evenly allocated to the data blocks. After 14 data blocks are fully allocated, 14 data elements (28 bytes) are left. Tail block processing is required after multi-core tiling.

When data is tiled for inputs of varying shapes, the tiled data may be evenly allocated across cores, but the data in each core cannot be evenly allocated. In this scenario, the variable lastTileLength is added to the tiling parameters to indicate the size of the last block, that is, the tail block. Therefore, the following four members are included when the tiling structure of the operator is defined:

  • blockLength: length of data computed by each core.
  • tileNum: number of main data blocks tiled on each core.
  • tileLength: length of each main data block on each core.
  • lastTileLength: length of the tail block on each core.
Figure 2 Tail block after multi-core tiling

Tiling Implementation

The tiling structure of the operator is defined as follows:

1
2
3
4
5
6
struct AddCustomTilingData {
    uint32_t blockLength;
    uint32_t tileNum;
    uint32_t tileLength;
    uint32_t lastTileLength;
};

The main content of tiling implementation on the host is to compute the preceding four member variables. The procedure is as follows:

  1. Check whether the total data length totalLength is 32-byte aligned. If it is not 32-byte aligned, compute the length totalLengthAligned after totalLength is rounded up to 32-byte aligned.
    1
    2
    3
    4
    5
    6
    constexpr uint32_t BLOCK_SIZE = 32;
    // To facilitate computation, define the variable alignNum as the alignment number based on the data type.
    uint32_t alignNum = BLOCK_SIZE / dataTypeSize;
    // totalLength indicates the total amount of data.
    uint32_t totalLengthAligned = (totalLength % alignNum == 0)?
            totalLength : ((totalLength + alignNum - 1) / alignNum) * alignNum;
    
  2. Check whether totalLengthAligned can be evenly allocated based on the number of used cores (BlockDim). If it can be evenly allocated, compute the data length blockLength on each core.
    1
    2
    3
    4
    5
    6
    constexpr uint32_t BLOCK_DIM = 8;
    constexpr uint32_t UB_BLOCK_NUM = 20;  // To facilitate verification, UB_BLOCK_NUM is used as the number of available blocks on Unified Buffer. Therefore, the available UB space size is UB_BLOCK_NUM * BLOCK_SIZE.
    uint32_t blockLength, tileNum;
    if (totalLengthAligned % BLOCK_DIM == 0) {
        blockLength = totalLengthAligned / BLOCK_DIM;
    }
    
  3. Compute tileNum. To reduce data movement overhead, Unified Buffer in the core should be fully utilized. tileNum is computed based on the computation amount on each core and the size of the available space on Unified Buffer.
    1
    tileNum = blockLength / alignNum / UB_BLOCK_NUM;
    
  4. Compute tileLength and lastTileLength based on the computed tileNum.

    If the computation amount of each core can be evenly allocated to the available space on Unified Buffer, the scenario without tail blocks is used.

    1
    2
    3
    4
    5
    if ((blockLength / alignNum) % UB_BLOCK_NUM == 0) {
        // The computation amount of each core can be evenly allocated to the available UB space. There are main blocks only, and no tail block exists.
        tileLength = UB_BLOCK_NUM * alignNum;
        lastTileLength = 0;
    }
    

    Otherwise, the scenario with tail blocks is used. The length of the tail block is the length of data computed by a single core: tileNum × tileLength.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    if (tileNum == 0) {
        // If the length of data to be computed by a single core is smaller than the available UB space, only tail block processing is required.
        tileLength = 0;
        lastTileLength = ((blockLength + alignNum - 1) / alignNum) * alignNum;
    } else {
        // Both main and tail blocks are available.
        tileLength = UB_BLOCK_NUM * alignNum;
        lastTileLength = blockLength - tileNum * tileLength;
    }
    

The tiling implementation code on the host is as follows:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
constexpr uint32_t BLOCK_SIZE = 32;
constexpr uint32_t BLOCK_DIM = 8;
constexpr uint32_t UB_BLOCK_NUM = 20;  // To facilitate verification, UB_BLOCK_NUM is used as the number of available blocks on UB. Therefore, the available UB space size is UB_BLOCK_NUM * BLOCK_SIZE.
...

uint32_t alignNum = BLOCK_SIZE / dataTypeSize;  // For ease of computation, alignNum is defined as the alignment number based on the data type, and dataTypeSize is defined as the number of bytes corresponding to the computing data type.
// totalLength indicates the total amount of data.
uint32_t totalLengthAligned = (totalLength % alignNum == 0)?
        totalLength : ((totalLength + alignNum - 1) / alignNum) * alignNum;
uint32_t blockLength, tileNum;
if (totalLengthAligned % BLOCK_DIM == 0) {
    blockLength = totalLengthAligned / BLOCK_DIM;
    tileNum = blockLength / alignNum / UB_BLOCK_NUM;

    if (tileNum == 0) {
        // If the length of data to be computed by a single core is smaller than the available UB space, only tail block processing is required.
        tileLength = 0;
        lastTileLength = ((blockLength + alignNum - 1) / alignNum) * alignNum;
    } else if ((blockLength / alignNum) % UB_BLOCK_NUM == 0) {
        // The computation amount of each core can be evenly allocated to the available UB space. There are main blocks only, and no tail block exists.
        tileLength = UB_BLOCK_NUM * alignNum;
        lastTileLength = 0;
    } else {
        // Both main and tail blocks are available.
        tileLength = UB_BLOCK_NUM * alignNum;
        lastTileLength = blockLength - tileNum * tileLength;
    }
    ...
}

After the input data with the shape of (1, 1904) is computed, the values of variables in the tiling structure are as follows:

1
2
3
4
5
6
struct AddCustomTilingData {
    uint32_t blockLength = 238; // Each core computes 238 pieces of half-type data, and eight cores compute a total of 1904 pieces of half-type data.
    uint32_t tileNum = 0; // The available UB space is enough, and only tail block processing is required.
    uint32_t tileLength = 0; // There is no main block, and the length of the main block is 0.
    uint32_t lastTileLength = 240; // 238 pieces of half-type data are not 32-byte aligned and are rounded up to 240 pieces of half-type data for movement.
};

Operator Class Implementation

Compared with multi-core tiling, when TPipe is used in the Init function to allocate memory for the input and output queues, the larger value of tileLength and lastTileLength is used as the allocated memory length. For example, when the length of data to be computed by a single core is smaller than the available UB space, only tail block processing is required. In this case, the length tileLength is 0, and the length lastTileLength is the length of the data block. Therefore, the larger value of the two is used to allocate memory.

1
2
uint32_t initBufferLength = AscendC::Std::max(this->tileLength, this->lastTileLength);
pipe.InitBuffer(inQueueX, 1, initBufferLength * sizeof(dataType));

The length of the tail block is lastTileLength, which is different from the length of the main data blocks. Therefore, the length parameter tileLength of the data blocks to be processed in this loop, that is, the length of to-be-processed data in the main blocks or tail block, is passed to the CopyIn, Compute, and CopyOut functions.

The implementation code of the Process function is as follows:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
__aicore__ inline void Process()
{
    // Calculate the data in main blocks. The corresponding data block length is tileLength.
    for (uint32_t i = 0; i < this->tileNum; i++) {
        CopyIn(i, this->tileLength);
        Compute(i, this->tileLength);
        CopyOut(i, this->tileLength);
    }
    // Calculate the data in the tail block. The corresponding data block length is lastTileLength.
    if (this->lastTileLength > 0) {
        CopyIn(this->tileNum, this->lastTileLength);
        Compute(this->tileNum, this->lastTileLength);
        CopyOut(this->tileNum, this->lastTileLength);
    }
}
The implementation code of the CopyIn function is as follows:
1
2
3
4
5
6
7
8
9
__aicore__ inline void CopyIn(int32_t progress, uint32_t tileLength)
{
    AscendC::LocalTensor<dataType> xLocal = inQueueX.AllocTensor<dataType>();
    AscendC::LocalTensor<dataType> yLocal = inQueueY.AllocTensor<dataType>();
    AscendC::DataCopy(xLocal, xGm[progress * this->tileLength], tileLength);
    AscendC::DataCopy(yLocal, yGm[progress * this->tileLength], tileLength);
    inQueueX.EnQue(xLocal);
    inQueueY.EnQue(yLocal);
}
The implementation code of the Compute function is as follows:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
__aicore__ inline void Compute(int32_t progress, uint32_t tileLength)
{
    AscendC::LocalTensor<dataType> xLocal = inQueueX.DeQue<dataType>();
    AscendC::LocalTensor<dataType> yLocal = inQueueY.DeQue<dataType>();
    AscendC::LocalTensor<dataType> zLocal = outQueueZ.AllocTensor<dataType>();
    AscendC::Add(zLocal, xLocal, yLocal, tileLength);
    outQueueZ.EnQue<dataType>(zLocal);
    inQueueX.FreeTensor(xLocal);
    inQueueY.FreeTensor(yLocal);
}
The implementation code of the CopyOut function is as follows:
1
2
3
4
5
6
__aicore__ inline void CopyOut(int32_t progress, uint32_t tileLength)
{
    AscendC::LocalTensor<dataType> zLocal = outQueueZ.DeQue<dataType>();
    AscendC::DataCopy(zGm[progress * this->tileLength], zLocal, tileLength);
    outQueueZ.FreeTensor(zLocal);
}