AllReduce
Applicability
|
Product |
Supported |
|---|---|
|
|
√ |
|
|
√ |
|
|
x |
|
|
x |
|
|
x |
|
|
x |
Function
This is the task delivery API of AllReduce, a collective communication operator. It returns handleId of the task to users. The AllReduce function is as follows: Perform the reduce operation on tensors with the same name on all nodes in a communicator, and then send the result to the output buffers of all nodes.

Prototype
1 2 |
template <bool commit = false> __aicore__ inline HcclHandle AllReduce(GM_ADDR sendBuf, GM_ADDR recvBuf, uint64_t count, HcclDataType dataType, HcclReduceOp op, uint8_t repeat = 1) |
Parameters
|
Parameter |
Input/Output |
Description |
|---|---|---|
|
commit |
Input |
Bool. Values:
|
|
Parameter |
Input/Output |
Description |
|---|---|---|
|
sendBuf |
Input |
Address of the source data buffer. |
|
recvBuf |
Output |
Address of the destination data buffer to receive collective communication result. |
|
count |
Input |
Number of data elements involved in the AllReduce operation. For example, if only one int32 data record is involved, set count to 1. |
|
dataType |
Input |
Data type of the AllReduce operation, which can be float, half (float16), int8_t, int16_t, bfloat16_t and int32_t. Specifically, the options can be HCCL_DATA_TYPE_FP32, HCCL_DATA_TYPE_FP16, HCCL_DATA_TYPE_INT8, HCCL_DATA_TYPE_INT16, HCCL_DATA_TYPE_INT32, and HCCL_DATA_TYPE_BFP16. For details about the HcclDataType data type, see Table 1. |
|
op |
Input |
Reduce operation type, which can be HCCL_REDUCE_SUM, HCCL_REDUCE_MAX, or HCCL_REDUCE_MIN. Currently, sum, max, and min operations are supported. For details about the HcclReduceOp data type, see Table 2. |
|
repeat |
Input |
Number of AllReduce communication tasks delivered at a time. The value of repeat is greater than or equal to 1, and the default value is 1. When repeat is greater than 1, the server automatically computes the sendBuf and recvBuf addresses of each AllReduce task. The calculation formulas are as follows: sendBuf[i] = sendBuf + count x sizeof(datatype) x i, i∈[0, repeat) recvBuf[i] = recvBuf + count x sizeof(datatype) x i, i∈[0, repeat) Note: When the value of repeat is greater than 1, the count parameter must be used together to plan the communication data address. |
Returns
The task ID handleId is returned. The value of handleId is greater than or equal to 0. If the API fails to be called, the value -1 is returned.
Restrictions
- Before calling this API, ensure that the InitV2 and SetCcTilingV2 APIs have been called.
- If the core for delivering the communication task is not specified in the config template parameters of the HCCL object, this API can be called only on the AIC or AIV core. If the core for delivering the communication task is specified in the config template parameters of the HCCL object, this API can be called on both the AIC and AIV cores. The API delivers the communication task only on the AIC or AIV core based on the specified core type.
- For the
Atlas A2 training products /Atlas A2 inference products , the total number of times that Prepare APIs are called in a communicator cannot exceed 63. - For the
Atlas A3 training products /Atlas A3 inference products , the total number of times that all Prepare and InterHcclGroupSync APIs are called in a communicator cannot exceed 63.
Example
- Non-multi-round tiling scenario
As shown in the following figure, each of the four cards has 300 float16 data records (count = 300). Each card obtains data from the xGM memory. After the reduce sum operation is performed on the data of each card, the result is output to the yGM of each card.
Figure 2 AllReduce communication between four cards in non-multi-round tiling scenario
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
extern "C" __global__ __aicore__ void all_reduce_custom(GM_ADDR xGM, GM_ADDR yGM, GM_ADDR workspaceGM, GM_ADDR tilingGM) { auto sendBuf = xGM; // xGM is the input GM address of AllReduce. auto recvBuf = yGM; // yGM is the output GM address of AllReduce. uint64_t sendCount = 300; // Each card has 300 float16 data records. HcclReduceOp reduceOp = HcclReduceOp::HCCL_REDUCE_SUM; REGISTER_TILING_DEFAULT(AllReduceCustomTilingData); // AllReduceCustomTilingData is a structure defined in the operator header file. GET_TILING_DATA_WITH_STRUCT(AllReduceCustomTilingData, tilingData, tilingGM); Hccl hccl; GM_ADDR contextGM = AscendC::GetHcclContext<0>(); // Obtain the HCCL context on the kernel of the Ascend C custom operator. if (AscendC::g_coreType == AIV) { // Specify AIV Cores for communication. hccl.InitV2(contextGM, &tilingData); auto ret = hccl.SetCcTilingV2(offsetof(AllReduceCustomTilingData, mc2CcTiling)); if (ret) { return; } HcclHandle handleId1 = hccl.AllReduce<true>(sendBuf, recvBuf, sendCount, HcclDataType::HCCL_DATA_TYPE_FP16, reduceOp); hccl.Wait(handleId1); AscendC::SyncAll<true>(); // All AIV Cores are synchronized to prevent too fast execution on core 0. Calling the hccl.Finalize() API prematurely can cause suspension of other cores during the Wait operation. hccl.Finalize(); } }
- Multi-round tiling scenario
Multi-round tiling is enabled to process communication equivalently to the foregoing non-multi-round tiling example. As shown in the following figure, the 300 float16 data records on each card are tiled into two first data blocks and one tail data block. Each first block includes 128 float16 data records (tileLen = 128), and each tail block includes 44 float16 data records (tailLen = 44). During implementation in the operator, the tiled data is divided into three rounds for AllReduce communication, achieving communication result equivalent to that of non-multi-round tiling.
Figure 3 Data tiling of each card
Specifically, in the first round of communication, AllReduce operation is performed on 0-0\1-0\2-0\3-0 data blocks on each rank. In the second round of communication, AllReduce operation is performed on 0-1\1-1\2-1\3-1 data blocks on each rank. In the third round of communication, the AllReduce operation is performed on 0-2\1-2\2-2\3-2 data blocks on each rank. The following figure shows the three rounds of communication and provides a code example.
Figure 4 AllReduce between four devices
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
extern "C" __global__ __aicore__ void all_reduce_custom(GM_ADDR xGM, GM_ADDR yGM, GM_ADDR workspaceGM, GM_ADDR tilingGM) { constexpr uint32_t tileNum = 2U; // Number of first blocks constexpr uint64_t tileLen = 128U; // Number of data elements in first blocks constexpr uint32_t tailNum = 1U; // Number of tail blocks constexpr uint64_t tailLen = 44U; // Number of data elements in tail blocks auto sendBuf = xGM; // xGM is the input GM address of AllReduce. auto recvBuf = yGM; // yGM is the output GM address of AllReduce. HcclReduceOp reduceOp = HcclReduceOp::HCCL_REDUCE_SUM; REGISTER_TILING_DEFAULT(AllReduceCustomTilingData); // AllReduceCustomTilingData is a structure defined in the operator header file. GET_TILING_DATA_WITH_STRUCT(AllReduceCustomTilingData, tilingData, tilingGM); Hccl hccl; GM_ADDR contextGM = AscendC::GetHcclContext<0>(); // Obtain the HCCL context on the kernel of the Ascend C custom operator. if (AscendC::g_coreType == AIV) { // Specify AIV Cores for communication. hccl.InitV2(contextGM, &tilingData); auto ret = hccl.SetCcTilingV2(offsetof(AllReduceCustomTilingData, mc2CcTiling)); if (ret != HCCL_SUCCESS) { return; } // Process two first blocks. constexpr uint32_t tileRepeat = tileNum; // Except the input parameters sendBuf and recvBuf, other parameters for processing the two head blocks are the same. Therefore, repeat is set to 2, and sendBuf and recvBuf of the AllReduce task of the second head block are updated in the API. HcclHandle handleId1 = hccl.AllReduce<true>(sendBuf, recvBuf, tileLen, HcclDataType::HCCL_DATA_TYPE_FP16, reduceOp, tileRepeat); // Process one tail block. constexpr uint32_t kSizeOfFloat16 = 2U; sendBuf += tileLen * tileNum * kSizeOfFloat16; recvBuf += tileLen * tileNum * kSizeOfFloat16; constexpr uint32_t tailRepeat = tailNum; HcclHandle handleId2 = hccl.AllReduce<true>(sendBuf, recvBuf, tailLen, HcclDataType::HCCL_DATA_TYPE_FP16, reduceOp, tailRepeat); for (uint8_t i=0; i<tileRepeat; i++) { hccl.Wait(handleId1); } hccl.Wait(handleId2); AscendC::SyncAll<true>(); // All AIV Cores are synchronized to prevent too fast execution on core 0. Calling the hccl.Finalize() API prematurely can cause suspension of other cores during the Wait operation. hccl.Finalize(); } }