SPMD Model
Ascend C operator programming is SPMD (Single-Program Multiple-Data) programming, which is a common parallel computing method and an effective means to improve the computing speed.
Assume that data needs to be processed through three phases (T1, T2, and T3) from input to output. As shown in the following figure, in SPMD mode, the system starts a group of processes to concurrently process data. First, the data is divided into a plurality of data slices, and the data slices are then distributed to different processes for processing. Each process receives one or more data slices, and independently performs tasks of three phases on these slices.
Specifically, an application in the Ascend C programming model is to split data that needs to be processed and run on multiple compute cores (similar to multiple processes) at the same time, to obtain higher performance. Multiple AI Cores share the same instruction code. The only difference between running instances on each core is that block_idx is different. Each core identifies itself through different block_idx. Block is similar to the process described above. block_idx is the process ID that uniquely identifies a process. The following figure shows the parallel computing process.
The following code snippet is from the implementation code of the Ascend C Add operator. When the operator is called, all compute cores execute the same implementation code, and input parameters of the entry function of each core are the same. The address of the data processed on each core needs to be obtained by adding an offset of GetBlockIdx()*BLOCK_LENGTH (length of the data processed by each block) to the start address. In this way, data slicing of multi-core parallel computing is implemented.
class KernelAdd {
public:
__aicore__ inline KernelAdd() {}
__aicore__ inline void Init(GM_ADDR x, GM_ADDR y, GM_ADDR z)
{
// get start index for current core, core parallel
xGm.SetGlobalBuffer((__gm__ half*)x + BLOCK_LENGTH * GetBlockIdx(), BLOCK_LENGTH);
yGm.SetGlobalBuffer((__gm__ half*)y + BLOCK_LENGTH * GetBlockIdx(), BLOCK_LENGTH);
zGm.SetGlobalBuffer((__gm__ half*)z + BLOCK_LENGTH * GetBlockIdx(), BLOCK_LENGTH);
// pipe alloc memory to queue, the unit is Bytes
pipe.InitBuffer(inQueueX, BUFFER_NUM, TILE_LENGTH * sizeof(half));
pipe.InitBuffer(inQueueY, BUFFER_NUM, TILE_LENGTH * sizeof(half));
pipe.InitBuffer(outQueueZ, BUFFER_NUM, TILE_LENGTH * sizeof(half));
}
...
}
// Implement the kernel function.
extern "C" __global__ __aicore__ void add_custom(GM_ADDR x, GM_ADDR y, GM_ADDR z)
{
// Initialize the operator class. The operator class provides methods such as operator initialization and core processing.
KernelAdd op;
// Initialization function, which is used to obtain the input and output addresses to be processed by the kernel function and initialize the memory.
op.Init(x, y, z);
// Core processing function, which is used to complete core logic such as data transfer and computation of operators.
op.Process();
}