Heterogeneous Parallelism Programming Model
Host-Device Heterogeneous Collaboration Mechanism
The Ascend C heterogeneous parallel programming model is designed to address the challenges of heterogeneous computing architectures and aims to solve the efficiency and scalability issues of traditional programming models when handling complex computing tasks.
The heterogeneous compute architecture involves the host and device (Ascend AI Processor on the device). They work together to complete compute tasks. The host is responsible for runtime management, including storage management, device management, and stream management, to ensure efficient task scheduling and proper resource allocation. The device executes the kernel function compiled based on the Ascend C syntax to perform compute-intensive tasks, such as matrix and vector operations on batch data, for compute acceleration.
As shown in the following figure, when a kernel is delivered to the AI Core (compute core on Ascend AI Processor) for execution, the runtime management module starts the corresponding task based on the number of cores and task type set by the developer. The task is loaded from the host to the stream running queue on the device. The scheduling unit allocates the ready task to an idle AI Core for execution. In this case, the data to be processed is split and run on multiple compute cores at the same time (that is, SPMD Parallel Compute described below), to obtain higher performance.
The host and device have different memory spaces. The host cannot directly access the device memory, and vice versa. Therefore, the input data needs to be copied from the host to the device memory for computation on the device, and the output result needs to be copied back from the device memory to the host for further use.
SPMD Parallel Compute
The Ascend C operator programming is Single-Program Multiple-Data (SPMD) programming. Generally speaking, a copy of code is executed in multiple places to process different data. SPMD is a common parallel compute method and an effective method to improve the compute speed.
Specifically, an application in the Ascend C programming model is to split data that needs to be processed and run on multiple compute cores (similar to multiple processes) at the same time, to obtain higher performance. Multiple AI Cores share the same instruction code. The only difference between running instances on each core is that block_idx is different. Each core identifies itself through different block_idx. Block is similar to the process described above. block_idx is the process ID that uniquely identifies a process. The following figure shows the parallel compute process.
The following code snippet is from the implementation code of the Ascend C Add operator. When the operator is called, all compute cores execute the same implementation code, and input parameters of the entrypoint function are the same for all cores. The address of the data processed by each core needs to be obtained by adding an offset of GetBlockIdx()*BLOCK_LENGTH (length of the data processed by each core) to the start address. In this way, data tiling of multi-core parallel compute is implemented. The GetBlockIdx API in the code is used to obtain block_idx of each core.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
class KernelAdd { public: __aicore__ inline KernelAdd() {} __aicore__ inline void Init(GM_ADDR x, GM_ADDR y, GM_ADDR z) { // Set the data address based on block_idx of each core. xGm.SetGlobalBuffer((__gm__ half*)x + BLOCK_LENGTH * AscendC::GetBlockIdx(), BLOCK_LENGTH); yGm.SetGlobalBuffer((__gm__ half*)y + BLOCK_LENGTH * AscendC::GetBlockIdx(), BLOCK_LENGTH); zGm.SetGlobalBuffer((__gm__ half*)z + BLOCK_LENGTH * AscendC::GetBlockIdx(), BLOCK_LENGTH); // Initialize the queue. The unit is byte. pipe.InitBuffer(inQueueX, BUFFER_NUM, TILE_LENGTH * sizeof(half)); pipe.InitBuffer(inQueueY, BUFFER_NUM, TILE_LENGTH * sizeof(half)); pipe.InitBuffer(outQueueZ, BUFFER_NUM, TILE_LENGTH * sizeof(half)); } ... } // Implement the kernel function. __global__ __aicore__ void add_custom(GM_ADDR x, GM_ADDR y, GM_ADDR z) { // Initialize the operator class. The operator class provides methods such as operator initialization and core processing. KernelAdd op; // Initialization function, which is used to obtain the input and output addresses to be processed by the kernel function and initialize the memory. op.Init(x, y, z); // Core processing function, which is used to complete core logic such as data movement and computation of operators. op.Process(); } |
