DoubleBuffer Scenario

The DoubleBuffer mechanism is adopted to fully utilize hardware resources and enable the execution of multiple pipelines in parallel, because data is moved in and out for multiple times in operators. DoubleBuffer is used to tile input data into two equally-sized blocks to fully utilize the hardware resources of the AI Core, implementing parallel execution of data copy-in, computation, and data copy-out. The following uses the example of "different amounts of data in a core and across cores" as an example to describe the implementation of DoubleBuffer in operators. For details about the complete sample code, see Add operator sample with DoubleBuffer.

Figure 1 DoubleBuffer data tiling

Tiling Implementation

After DoubleBuffer is enabled, each data block can be tiled into two equally-sized sub-blocks. Therefore, to enable DoubleBuffer, the total amount of data should be evenly allocated. To simplify the processing, the available space on Unified Buffer is tiled into n data blocks at the granularity of 32 bytes. If n is not an even number, n is reduced by 1. In this way, a set of code can be compatible with DoubleBuffer being enabled or disabled. The procedure is as follows:

  1. Check whether the total data length totalLength is 32-byte aligned. If the length is not 32-byte aligned, compute the length totalLengthAligned after totalLength is rounded up to 32-byte aligned.
    1
    2
    3
    4
    5
    6
    constexpr uint32_t BLOCK_SIZE = 32;
    // To facilitate computation, define the variable alignNum as the alignment number based on the data type.
    uint32_t alignNum = BLOCK_SIZE / dataTypeSize;
    // totalLength indicates the total amount of data.
    uint32_t totalLengthAligned = (totalLength % alignNum == 0)?
            totalLength : ((totalLength + alignNum - 1) / alignNum) * alignNum;
    
  2. Compute the data length blockLength of each core based on totalLengthAligned. For details about the core allocation strategy, see Tail Core Tiling.
  3. Compute other tiling parameters.
    The available space on the current Unified Buffer is tiled at the granularity of 32 bytes, and the number (UB_BLOCK_NUM) of data blocks is computed. The maximum number of available data blocks is computed based on whether DoubleBuffer is enabled, and is recorded as MAX_AVAILABLE_UB_BLOCK_NUM. Finally, blockLength is tiled at the granularity of MAX_AVAILABLE_UB_BLOCK_NUM. For ease of demonstration, the following code directly provides UB_BLOCK_NUM, which is the number of blocks (32 bytes) contained in the available space on the current Unified Buffer.
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    constexpr uint32_t BUFFER_NUM = 2;
    constexpr uint32_t UB_BLOCK_NUM = 21;  // Maximum number of available blocks on UB.
    constexpr uint32_t MAX_AVAILABLE_UB_BLOCK_NUM = UB_BLOCK_NUM / BUFFER_NUM * BUFFER_NUM;
    
    tileNum = blockLength / (alignNum * MAX_AVAILABLE_UB_BLOCK_NUM);
    if (tileNum == 0) {
        // If the length of data to be computed by a single core is smaller than the available UB space, only tail block processing is required.
        tileLength = 0;
        lastTileLength = (blockLength + alignNum - 1) / alignNum * alignNum;
    } else if ((blockLength / alignNum) % MAX_AVAILABLE_UB_BLOCK_NUM == 0) {
        // The computation amount of each core can be evenly allocated to the available UB space. There are main blocks only, and no tail block exists.
        tileLength = MAX_AVAILABLE_UB_BLOCK_NUM * alignNum;
        lastTileLength = 0;
    } else {
        // Both main and tail blocks are available.
        tileLength = MAX_AVAILABLE_UB_BLOCK_NUM * alignNum;
        lastTileLength = blockLength - tileNum * tileLength;
    }
    

Operator Class Implementation

If DoubleBuffer is disabled, you only need to process the start address of the last block on each core. If DoubleBuffer is enabled, the length of the data block to be processed is half of the original length. Therefore, you need to process the start addresses of the last two data blocks.

Enable DoubleBuffer, refer to the prototype of the InitBuffer API, and set num to 2, which is BUFFER_NUM.

1
2
3
4
this->initBufferLength = AscendC::Std::max(this->tileLength, this->lastTileLength);
pipe.InitBuffer(inQueueX, BUFFER_NUM, this->initBufferLength * sizeof(dataType));
pipe.InitBuffer(inQueueY, BUFFER_NUM, this->initBufferLength * sizeof(dataType));
pipe.InitBuffer(outQueueZ, BUFFER_NUM, this->initBufferLength * sizeof(dataType));

In addition, when the length of each data block in the core is computed, the number of buffers (BUFFER_NUM=2) needs to be used in the DoubleBuffer scenario.

1
this->tileLength = tiling.tileLength / BUFFER_NUM;

Because it cannot be ensured that the tail block meets the DoubleBuffer condition, the tail block is not tiled.

1
this->lastTileLength = tiling.lastTileLength;

The implementation code of the Init function is as follows:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
__aicore__ inline void Init(GM_ADDR x, GM_ADDR y, GM_ADDR z, AddCustomTilingData tiling)
{
    if (tiling.isEvenCore) {
        this->blockLength = tiling.blockLength;
        this->tileNum = tiling.tileNum;
        this->tileLength = tiling.tileLength / BUFFER_NUM;
        this->lastTileLength = tiling.lastTileLength;

        xGm.SetGlobalBuffer((__gm__ dataType *)x + this->blockLength * AscendC::GetBlockIdx(), this->blockLength);
        yGm.SetGlobalBuffer((__gm__ dataType *)y + this->blockLength * AscendC::GetBlockIdx(), this->blockLength);
        zGm.SetGlobalBuffer((__gm__ dataType *)z + this->blockLength * AscendC::GetBlockIdx(), this->blockLength);
    } else {
        if (AscendC::GetBlockIdx() < tiling.formerNum) {
            this->tileNum = tiling.formerTileNum;
            this->tileLength = tiling.formerTileLength / BUFFER_NUM;
            this->lastTileLength = tiling.formerLastTileLength;

            xGm.SetGlobalBuffer((__gm__ dataType *)x + tiling.formerLength * AscendC::GetBlockIdx(), tiling.formerLength);
            yGm.SetGlobalBuffer((__gm__ dataType *)y + tiling.formerLength * AscendC::GetBlockIdx(), tiling.formerLength);
            zGm.SetGlobalBuffer((__gm__ dataType *)z + tiling.formerLength * AscendC::GetBlockIdx(), tiling.formerLength);
        } else {
            this->tileNum = tiling.tailTileNum;
            this->tileLength = tiling.tailTileLength / BUFFER_NUM;
            this->lastTileLength = tiling.tailLastTileLength;

            xGm.SetGlobalBuffer((__gm__ dataType *)x + tiling.formerLength * tiling.formerNum +
                tiling.tailLength * (AscendC::GetBlockIdx() - tiling.formerNum), tiling.tailLength);
            yGm.SetGlobalBuffer((__gm__ dataType *)y + tiling.formerLength * tiling.formerNum +
                tiling.tailLength * (AscendC::GetBlockIdx() - tiling.formerNum), tiling.tailLength);
            zGm.SetGlobalBuffer((__gm__ dataType *)z + tiling.formerLength * tiling.formerNum +
                tiling.tailLength * (AscendC::GetBlockIdx() - tiling.formerNum), tiling.tailLength);
        }
    }

    uint32_t initBufferLength = AscendC::Std::max(this->tileLength, this->lastTileLength);
    pipe.InitBuffer(inQueueX, BUFFER_NUM, initBufferLength * sizeof(dataType));
    pipe.InitBuffer(inQueueY, BUFFER_NUM, initBufferLength * sizeof(dataType));
    pipe.InitBuffer(outQueueZ, BUFFER_NUM, initBufferLength * sizeof(dataType));
}

After DoubleBuffer is enabled, the number of main data blocks is doubled. Therefore, in the Process function, BUFFER_NUM needs to be used for computing the number of loops. Tail blocks are computed independently, with DoubleBuffer disabled. The subsequent processing of the main and tail blocks in the CopyIn, Compute, and CopyOut functions is consistent with tail block tiling.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
__aicore__ inline void Process()
{
    // DoubleBuffer computation is performed on main blocks. Therefore, loopCount needs to be multiplied by 2.
    uint32_t loopCount = this->tileNum * BUFFER_NUM;
    for (uint32_t i = 0; i < loopCount; i++) {
        CopyIn(i, this->tileLength);
        Compute(i, this->tileLength);
        CopyOut(i, this->tileLength);
    }
    // Tail blocks are computed without DoubleBuffer.
    if (this->lastTileLength > 0U) {
        CopyIn(loopCount, this->lastTileLength);
        Compute(loopCount, this->lastTileLength);
        CopyOut(loopCount, this->lastTileLength);
    }
}