HCCL Instructions
Ascend C provides a group of high-level HCCL communication APIs, enabling operator kernel developers to flexibly manage the execution order of computation and communication tasks in the AI Core for MC2 operators. Before using high-level HCCL communication APIs for operator development, learn the necessary background knowledge by referring to Basics.
HCCL is a collective communication task client, providing the collective communication primitive APIs (Prepare APIs) aligned with C++ APIs. For details, see HCCL API Reference. Currently, the AllReduce,AllGather,ReduceScatter, and AlltoAll APIs are supported. All APIs described in this section run on the AI Core and do not execute communication tasks. You can call the Prepare APIs to send the communication task information of the corresponding type to the AI CPU server and call the Commit API to instruct the AI CPU server to execute the corresponding communication task at a proper time.
The proper time depends on whether it is a task of communication before computing or a task of computing before communication. The two scenarios are described as follows:
- Task of communication before computing: A typical example is the task of AllGather communication + Matmul computing. In this scenario, after calling the AllGather API to deliver a communication task, you can call the Commit API to instruct the server to execute the task corresponding to the handleId returned by the AllGather API, and call the Wait blocking API to wait till the task is complete as notified by the server. After that, the computing task is executed.
- Task of computing before communication: A typical example is the task of Matmul computing + AllReduce communication. In this scenario, you can call the AllReduce API to instruct the server to deliver a communication task and then call the Matmul computing API to perform computing. In this way, the assembly, delivery, and execution processes of the AllReduce task can be overlapped by the Matmul computing pipeline. After the computing task is complete, you can call the Commit API to instruct the server to execute the AllReduce task. You do not need to call the Wait API to wait until the communication task is complete.
If there is no more communication task, the Finalize API is called to instruct the server to exit after the current task is complete. The client detects and waits until the final communication task is complete. The following figure shows the mechanism for the AI Core to deliver HCCL communication tasks.
For the
Perform the following steps to deliver a communication task by using the AI Core:
- Create an HCCL object and call the InitV2 initialization API.
1 2 3 4 5 6 7
// Recommended method of Init API calling with the input initTiling address GET_TILING_DATA_WITH_STRUCT(AllGatherCustomTilingData, tilingData, tilingGM); // AllGatherCustomTilingData is a structure defined in the operator header file. Hccl hccl; GM_ADDR contextGM = GetHcclContext<0>(); // Obtain the HCCL context on the kernel of the Ascend C custom operator. hccl.InitV2(contextGM, &tilingData);
To call InitV2, you must use the standard C++ syntax to define the development mode of the TilingData structure. For details, see Using the Standard C++ Syntax to Define the Tiling Structure. In the preceding sample code, tilingGM is the GM address of the operator TilingData passed from host and used as the input parameter of the kernel function. TilingData is obtained through GET_TILING_DATA_WITH_STRUCT. When calling the InitV2 initialization API, the communication context needs to be passed. The communication context can be obtained using the GetHcclContext API provided by the framework.
- Set the Tiling address of the corresponding communication algorithm.Use the SetCcTilingV2 API to set the tiling address of the corresponding communication algorithm. After Commit is called, the address is sent to the server for parsing. The SetCcTilingV2 API must be used together with the InitV2 API. The following is an example:
1 2 3 4 5 6 7 8 9 10
// Method of Init API calling with the input initTiling address GET_TILING_DATA_WITH_STRUCT(AllGatherCustomTilingData, tilingData, tilingGM); Hccl hccl; GM_ADDR contextGM = GetHcclContext<0>(); // Obtain the HCCL context on the kernel of the Ascend C custom operator. hccl.InitV2(contextGM, &tilingData); if (SetCcTilingV2(offsetof(AllGatherCustomTilingData, mc2CcTiling)) != HCCL_SUCCESS) { return; }
- Use the corresponding Prepare API to asynchronously deliver the corresponding communication task and obtain the handleId of the task. After receiving the handleId, the server starts to deploy and deliver the communication task. The following is an example:
1 2 3 4 5 6 7 8
auto handleId = hccl.ReduceScatter<false>(aGM, cGM, recvCount, AscendC::HCCL_DATA_TYPE_FP16, HCCL_REDUCE_SUM, strideCount, 1); // For the Prepare API, add abnormal value verification and PRINTF statements during debugging. // if (handleId == INVALID_HANDLE_ID) { // PRINTF("[ERROR] call ReduceScatter failed, handleId is -1."); // return; // }
In the example, the Prepare API is ReduceScatter. For details about other APIs, see the following sections. The AscendC::HCCL_DATA_TYPE_FP16 parameter indicates the data type of the HCCL task. The data structure is HcclDataType. For details about the parameter, see Table 1. HCCL_REDUCE_SUM is a reduction operation. For the reduction operation types supported by AllReduce and ReduceScatter, see Table 2.
Table 1 HcclDataType parameter Data Type
Description
HcclDataType
Data type of an HCCL task.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
enum HcclDataType { HCCL_DATA_TYPE_INT8 = 0, /* int8 */ HCCL_DATA_TYPE_INT16 = 1, /* int16 */ HCCL_DATA_TYPE_INT32 = 2, /* int32 */ HCCL_DATA_TYPE_FP16 = 3, /* half or float16 */ HCCL_DATA_TYPE_FP32 = 4, /* float */ HCCL_DATA_TYPE_INT64 = 5, /* int64 */ HCCL_DATA_TYPE_UINT64 = 6, /* uint64 */ HCCL_DATA_TYPE_UINT8 = 7, /* uint8 */ HCCL_DATA_TYPE_UINT16 = 8, /* uint16 */ HCCL_DATA_TYPE_UINT32 = 9, /* uint32 */ HCCL_DATA_TYPE_FP64 = 10, /* float64 */ HCCL_DATA_TYPE_BFP16 = 11, /* bfloat16 */ HCCL_DATA_TYPE_RESERVED /* reserved */ }
- Call the Commit API to instruct the server to execute the communication task corresponding to handleId.
1 2
// Call the Commit API to instruct the server to execute the communication task at a proper time. hccl.Commit(handleId);
- Call the Wait blocking API and wait till the server completes the corresponding communication task.
1 2 3 4 5 6 7 8 9
auto ret = hccl.Wait(handleId); // For the Wait and Query APIs, add abnormal value verification and PRINTF statements during debugging. // if (ret == HCCL_FAILED) { // PRINTF("[ERROR] call Wait for handleId[%d] failed.", handleId); // return; // } // Call the inter-core synchronization API to prevent some cores from exiting quickly, which may trigger HCCL destruction and affect the slow cores. // You can call the APIs in , , and based on the service scenario to ensure that all cores complete tasks before exit.
- Call the Finalize API to instruct the server that there is no subsequent communication task. After the execution is complete, the client exits. The client detects and waits until the last communication task is complete.
1hccl.Finalize();
Note: If the template parameter of the HCCL object does not specify the core for delivering communication tasks, the Prepare API can run only on the AIC or AIV. Before calling the APIs in steps 2 to 5, you must specify that the API code runs on the AIC or AIV core. The implementation code is as follows:
1 2 3 4 5 |
// Use the built-in constant g_coreType to determine whether the AIC or AIV core is used. if (g_coreType == AIV) { // if (g_coreType == AIC) { Call the HCCL API. } |
The repeat parameter in each Prepare API can be flexibly used based on the understanding of single communication task delivery. One Prepare API call corresponds to one handleId. The repeat parameter in each Prepare API indicates the number of communication tasks for the Prepare call. The value must be the same as the number of times to call the Commit and Wait APIs for the handleId. In Figure 2 ReduceScatter communication example, four devices are used, source data on each device is evenly divided into four parts based on rank ID, and each part is further tiled into three sub-parts. The data count of each sub-part is TileLen. Only one group of tiled data (for example, 0-0, 1-0, 2-0, and 3-0 as shown in the figure) is communicated during each ReduceScatter communication. Therefore, three ReduceScatter operations are required to complete the communication of all data.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
extern "C" __global__ __aicore__ void reduce_scatter_custom(GM_ADDR xGM, GM_ADDR yGM, GM_ADDR workspaceGM, GM_ADDR tilingGM) { auto sendBuf = xGM; // xGM is the GM address of ReduceScatter input. auto recvBuf = yGM; // yGM is the GM address of ReduceScatter output. constexpr size_t rankSize = 4U; // Four devices constexpr size_t tileCnt = 3U; // Data on each device is evenly divided into rankSize parts, and each part is split into three sub-parts. constexpr size_t tileLen = 100U;// Data count of each sub-part uint64_t strideCount = tileLen*tileCnt; // Offset between the start addresses of adjacent data blocks in sendBuf REGISTER_TILING_DEFAULT(ReduceScatterCustomTilingData); // ReduceScatterCustomTilingData is a structure defined in the operator header file. GET_TILING_DATA_WITH_STRUCT(ReduceScatterCustomTilingData, tilingData, tilingGM); Hccl hccl; GM_ADDR contextGM = AscendC::GetHcclContext<0>(); // Obtain the HCCL context on the kernel of the Ascend C custom operator. if (AscendC::g_coreType == AIV) { // Specify AIV Cores for communication. hccl.InitV2(contextGM, &tilingData); auto ret = hccl.SetCcTilingV2(offsetof(ReduceScatterCustomTilingData, reduceScatterCcTiling)); if (ret != HCCL_SUCCESS) { return; } // Three handleIds are generated in the for loop. The Commit and Wait APIs are called only once (repeat = 1) for each handleId. for (int i = 0; i < tileCnt; ++i) { auto handleId = hccl.ReduceScatter(sendBuf, recvBuf, tileLen, HcclDataType::HCCL_DATA_TYPE_FP32, HcclReduceOp::HCCL_REDUCE_SUM, strideCount, 1); // For parameter details, see the ReduceScatter API description. hccl.Commit(handleId); auto ret = hccl.Wait(handleId); // Execute other computation logic. // Update the receiving and sending addresses of ReduceScatter. sendBuf += tileLen * sizeOf(float32); recvBuf += tileLen * sizeOf(float32); } AscendC::SyncAll<true>(); // All AIV Cores are synchronized to prevent too fast execution on core 0. Calling the hccl.Finalize() API prematurely can cause suspension of other cores during the Wait operation. hccl.Finalize(); } } |
The source addresses (SendBuf) of the three pieces of data on each card are contiguous, and the memory on each card used by the destination address (recvBuf) to store the three pieces of communication data is also contiguous. Therefore, the preceding code can be optimized, namely, the repeat parameter in the ReduceScatter API can be set to 3 so that three communication tasks can be delivered after the ReduceScatter API is called once. In this case, there is only one handleId task, but the Commit and Wait APIs need to be called for three times. The code snippet is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
extern "C" __global__ __aicore__ void reduce_scatter_custom(GM_ADDR xGM, GM_ADDR yGM, GM_ADDR workspaceGM, GM_ADDR tilingGM) { auto sendBuf = xGM; // xGM is the GM address of ReduceScatter input. auto recvBuf = yGM; // yGM is the GM address of ReduceScatter output. constexpr size_t rankSize = 4U; // Four devices constexpr size_t tileCnt = 3U; // Data on each device is evenly divided into rankSize parts, and each part is split into three sub-parts. constexpr size_t tileLen = 100U;// Data count of each sub-part uint64_t strideCount = tileLen*tileCnt; // Offset between the start addresses of adjacent data blocks in sendBuf REGISTER_TILING_DEFAULT(ReduceScatterCustomTilingData); // ReduceScatterCustomTilingData is a structure defined in the operator header file. GET_TILING_DATA_WITH_STRUCT(ReduceScatterCustomTilingData, tilingData, tilingGM); Hccl hccl; GM_ADDR contextGM = AscendC::GetHcclContext<0>(); // Obtain the HCCL context on the kernel of the Ascend C custom operator. if (AscendC::g_coreType == AIV) { // Specify AIV Cores for communication. hccl.InitV2(contextGM, &tilingData); auto ret = hccl.SetCcTilingV2(offsetof(ReduceScatterCustomTilingData, reduceScatterCcTiling)); if (ret != HCCL_SUCCESS) { return; } auto handleId = hccl.ReduceScatter(sendBuf, recvBuf, tileLen, HcclDataType::HCCL_DATA_TYPE_FP32, HcclReduceOp::HCCL_REDUCE_SUM, strideCount, tileCnt); // For parameter details, see the ReduceScatter API description. for (int i = 0; i < tileCnt; ++i) { hccl.Commit(handleId); auto ret = hccl.Wait(handleId); // Execute other computation logic. } AscendC::SyncAll<true>(); // All AIV Cores are synchronized to prevent too fast execution on core 0. Calling the hccl.Finalize() API prematurely can cause suspension of other cores during the Wait operation. hccl.Finalize(); } } |
|
Data Type |
Description |
|---|---|
|
MC2_BUFFER_LOCATION |
Reserved parameter, which is the buffer for storing the intermediate results of computing and communication. This field can be set in the Tiling process. |
Note: When an operator with high-level APIs of HCCL is debugged, the build option -DASCENDC_DEBUG can be added in the operator build project to enable exception interception. For details, see assert.
