Multi-Core Tiling

The following figure shows the development process of an operator with tiling based on Ascend C.

Figure 1 Operator development process

Operator Analysis

In this sample, input data can be evenly allocated across cores and within a core. The tiling strategy is as follows: The total data length (TOTAL_LENGTH) is defined as 8 × 2048. The data is evenly allocated across eight cores, each processing a data block of length BLOCK_LENGTH = 2048. For example only (not indicating optimal performance), the data on a single core is further partitioned into 16 tiles, where the length of each tile (TILE_LENGTH) is 128. The following figure shows the data tiling.

Figure 2 Data tiling
Table 1 Design specifications of the Ascend C Add operator

OpType

Add

Operator Input and Output

name

shape

data type

format

x (input)

(8, 2048)

half

ND

y (input)

(8, 2048)

half

ND

z (output)

(8, 2048)

half

ND

Kernel Function Name

add_custom

Main APIs

DataCopy: data movement API

Add: vector basic arithmetic API

EnQue, DeQue, and others: queue management APIs

Operator Implementation File

add_custom.cpp

Tiling Implementation

In the preceding scenario, the operator input and output have static shapes. However, in real-world operator development, the shapes can dynamically change, leading to more flexible and complex scenarios. In the dynamic-shape scenario, the input shape is unknown. Some variables related to the input shape (such as the size of a block moved each time) need to be computed using tiling and then passed to the kernel. The kernel uses this parameter for subsequent computation.

The implementation method is as follows: Analyze and design tiling parameters, define the tiling structure, obtain the input and output shape information through the context on the host, compute the tiling parameters based on the shape information, and set the parameters to the corresponding tiling structure. Pass the tiling information to the kernel functions through the kernel function entry parameter. Parse the tiling structure in the kernel functions to obtain and use related parameters for implementing the internal logic of the kernel function. For details, see Tiling Implementation on the Host. This section uses the tiling strategy mentioned above as an example to describe how to implement tiling.

Based on the tiling strategy in this section, the following parameters need to be defined for tiling:

  • blockLength: length of data computed by each core
  • tileNum: number of data blocks to be computed by each core
  • tileLength: length of each data block in each core

Use the C++ syntax to define the TilingData structure of the operator in the header file based on the determined tiling parameters. The code is as follows: The header file is named {Operator name}_tiling.h. In this section, the operator name is add_custom, and so the corresponding header file name is add_custom_tiling.h.

1
2
3
4
5
struct AddCustomTilingData {
    uint32_t blockLength;
    uint32_t tileNum;
    uint32_t tileLength;
}
Create the add_custom_tiling.cpp file corresponding to the header file of the tiling structure. The tiling parameters are computed in the file. Since data in each core is partitioned into 16 tiles, the tiling parameters are computed based on CORE_NUM = 16 and TILE_NUM = 16, and written to the tiling structure. A code example is as follows:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
#include "add_custom_tiling.h"
constexpr int32_t CORE_NUM = 8;                             // Number of used cores.
constexpr int32_t TILE_NUM = 16;                             // Number of tiles in a core.
void GenerateTilingData(uint8_t* tilingBuf)
{
    uint32_t totalLength;
    // The total data length TOTAL_LENGTH can be obtained as required. This section describes tiling only.
    AddCustomTilingData *tiling = reinterpret_cast<AddCustomTilingData *>(tilingBuf);
    uint32_t blockLength = TOTAL_LENGTH / CORE_NUM;
    uint32_t tileNum = TILE_NUM;
    uint32_t tileLength = blockLength / tileNum;

    tiling->blockLength = blockLength;
    tiling->tileNum = tileNum;
    tiling->tileLength = tileLength;
}

Finally, in the calling program on the host, the preceding tiling parameter computation function is called to compute related parameters, which are then passed to the kernel function on the kernel.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
extern void GenerateTilingData(uint8_t* tilingBuf);
constexpr int32_t CORE_NUM = 8;
    ...
    uint8_t *tiling = nullptr;
    size_t tilingSize = sizeof(AddCustomTilingData);
#ifdef ASCENDC_CPU_DEBUG
    tiling = (uint8_t *)AscendC::GmAlloc(tilingSize);  // CPU debug mode.
    ...
#else
    ...
    CHECK_ACL(aclrtMallocHost((void **)(&tiling), tilingSize));  // NPU mode.
    ...
#endif
    GenerateTilingData(tiling);  // Call the tiling parameter computation function.
    ....
#ifdef ASCENDC_CPU_DEBUG
    ...
    ICPU_RUN_KF(add_custom, CORE_NUM, x, y, z,
                *reinterpret_cast<AddCustomTilingData *>(tiling));  // Call the kernel function in CPU debug mode.
	....
#else
	....
    ACLRT_LAUNCH_KERNEL(add_custom)(CORE_NUM, stream, xDevice, yDevice, zDevice,  // Call the kernel function in NPU mode.
        reinterpret_cast<AddCustomTilingData *>(tiling));
	....

Operator Class Implementation

The operator implementation on the kernel still complies with the implementation process of the vector operator kernel function. The following focuses on the differences in operator class implementation in this scenario.

  • Set the global memory address of the input and output global tensors.

    In this sample, data is allocated across cores for processing, and each core processes different data. Therefore, the addresses of the data to be processed by different cores in the global memory differ. In the initialization function Init, you need to obtain the memory offset addresses of the input and output data to be processed by a single core in the global memory and set the offset addresses to GlobalTensor.

    For example, the memory offset address of input x in the global memory is obtained. The total data length (TOTAL_LENGTH) is defined as 8 × 2048. The data is evenly allocated across eight cores, each processing a data block of length blockLength = 2048. GetBlockIdx is called to obtain the index of the current core, and x + blockLength * GetBlockIdx() is the memory offset address of x in the global memory during the single-core processing program. After the offset address is obtained, SetGlobalBuffer of the GlobalTensor class is called to set the start address and length of the global memory of the core. For details, see Figure 3. The code is as follows:

    1
    xGm.SetGlobalBuffer((__gm__ half *)x + this->blockLength * AscendC::GetBlockIdx(), this->blockLength);
    
    Figure 3 Multi-core parallel processing
  • Allocate memory for the input and output queues through TPipe.

    Data processed on a single core can be tiled. In this sample, data (2048 elements) on a single core is partitioned into 16 tiles (is for reference only and not implying the optimal performance), each with tileLength = 128. Figure 4 shows the data tiling.

    Figure 4 Single-core data tiling

    Compared with the basic vector operators, when TPipe is used to allocate memory for the input and output queues, the length tileLength of each data block on a single core is used as the allocated memory length. For example, to allocate memory for the queue of input x, you can use the following code snippet. The pipe allocates a memory block whose size is tileLength * sizeof(half) bytes for inQueueX. Each memory block can contain the number of tileLength (128) half-type data.

    1
    pipe.InitBuffer(inQueueX, 1, this->tileLength * sizeof(half))
    
The initialization function code is as follows:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
__aicore__ inline void Init(GM_ADDR x, GM_ADDR y, GM_ADDR z, AddCustomTilingData tiling)
{
    this->blockLength = tiling.blockLength;
    this->tileNum = tiling.tileNum;
    this->tileLength = tiling.tileLength;
    // Compute the address offset of each core.
    xGm.SetGlobalBuffer((__gm__ half *)x + this->blockLength * AscendC::GetBlockIdx(), this->blockLength);
    yGm.SetGlobalBuffer((__gm__ half *)y + this->blockLength * AscendC::GetBlockIdx(), this->blockLength);
    zGm.SetGlobalBuffer((__gm__ half *)z + this->blockLength * AscendC::GetBlockIdx(), this->blockLength);
    // pipe alloc memory to queue, the unit is Bytes
    pipe.InitBuffer(inQueueX, 1, this->tileLength * sizeof(half));
    pipe.InitBuffer(inQueueY, 1, this->tileLength * sizeof(half));
    pipe.InitBuffer(outQueueZ, 1, this->tileLength * sizeof(half));
}
The number of tileNum data blocks need to be moved in, computed, and moved out by each core. Therefore, tileNum is used as the upper limit of the loop in the Process function.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
__aicore__ inline void Process()
{
    int32_t loopCount = this->tileNum;
    // tiling strategy, pipeline parallel
    for (int32_t i = 0; i < loopCount; i++) {
        CopyIn(i);
        Compute(i);
        CopyOut(i);
    }
}

Accordingly, when each data block is moved in or out of each core, its memory offset address in the global memory needs to be located. Therefore, when the DataCopy API is used in the CopyIn and CopyOut functions, the address offset of each data block needs to be added. The Compute function is the same as that in Basic Vector Operators.

The implementation code of the CopyIn function is as follows:
1
2
3
4
5
6
7
8
__aicore__ inline void CopyIn(int32_t progress)
{
    ...
    // copy progress_th tile from global tensor to local tensor
    AscendC::DataCopy(xLocal, xGm[progress * this->tileLength], this->tileLength);
    AscendC::DataCopy(yLocal, yGm[progress * this->tileLength], this->tileLength);
    ...
}
The implementation code of the CopyOut function is as follows:
1
2
3
4
5
6
7
 __aicore__ inline void CopyOut(int32_t progress)
{
    ...
    // copy progress_th tile from local tensor to global tensor
    AscendC::DataCopy(zGm[progress * this->tileLength], zLocal, this->tileLength);
    ...
}