Iterate

Applicability

Product

Supported

Atlas A3 training products / Atlas A3 inference products

Atlas A2 training products / Atlas A2 inference products

Atlas 200I/500 A2 inference products

x

Atlas inference product 's AI Core

x

Atlas inference product 's Vector Core

x

Atlas training products

x

Function

In some algorithms, a complete collective communication task can be divided into multiple steps. The point-to-point communication task for the data of each step is called fine-grained communication. The following uses the AlltoAllV communication task with the communication algorithm "AlltoAll=level0:fullmesh;level1:pairwise" and the communication step length of 1 as an example. The level0 parameter indicates the intra-server communication algorithm (an Ascend AI Server is generally a server equipped with 8 or 16 Ascend NPUs), and the level1 parameter indicates the inter-server communication algorithm. The fullmesh parameter indicates the full-mesh communication algorithm, and the pairwise parameter indicates the pairwise communication algorithm. For details about the algorithms, see "Collective Communication Algorithm Introduction". As shown in the following figure, this example illustrates all data to be sent in the AlltoAllV communication and the data received by each device after each communication step is complete.

Figure 1 AlltoAllV communication steps using the pairwise algorithm

In MC2 operators, you can call this API to obtain the input or output of each step of the communication algorithm based on the corresponding Prepare primitive. In this way, the calculation and communication can be arranged in a finer granularity, thereby obtaining better performance benefits.

Prototype

1
2
template <bool sync = true>
__aicore__ inline int32_t Iterate(HcclHandle handleId, uint16_t *seqSlices, uint16_t seqSliceLen)

Parameters

Table 1 Template parameters

Parameter

Input/Output

Description

sync

Input

Boolean type. Whether to wait for the current communication step to complete before proceeding with subsequent computation or communication tasks, parameter values are as follows:

  • true: default value. Indicates that the current communication step is blocked and waits for its completion. If this parameter is set to true, you do not need to call the Wait API to wait for the communication task to complete.
  • false: indicates not waiting for the completion of the current communication step.
Table 2 API parameters

Parameter

Input/Output

Description

handleId

Input

ID of the corresponding communication task. Only the return value of the API corresponding to the Prepare primitive can be used.

1
using HcclHandle = int8_t;

seqSlices

Output

Stack space requested by the user, which is used to store the index subscripts of the input or output data blocks in the current communication step. In the compute-first-then-communicate scenario, this parameter returns the index of the input data block required in the current communication step. In the communicate-first-then-compute scenario, this parameter returns the index of the output data block in the current communication step.

seqSliceLen

Input

Length of the seqSlices array. Based on the communication step and algorithm logic, the number of data block indexes that need to be saved in each communication step is used as the length of the array.

Returns

  • When the communication task is not complete:
    • In the computation before communication scenario, the return value is the number of input data blocks required in the current communication step, which is the same as the value of the seqSliceLen parameter.
    • In the communication before computation scenario, the return value is the number of output data blocks produced in the current communication step, which is the same as the value of the seqSliceLen parameter.
  • When the communication task is complete, the return value is 0.

Restrictions

  • Before calling this API, ensure that the InitV2 and SetCcTilingV2 APIs have been called.
  • The input parameter handleId can use only the return value of the Prepare primitive API.
  • Currently, this API supports the communication algorithm "AlltoAll=level0:fullmesh;level1:pairwise".

Example

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
extern "C" __global__ __aicore__ void alltoallv_custom(GM_ADDR sendBuf, GM_ADDR recvBuf, GM_ADDR workspaceGM, GM_ADDR tilingGM) {
    // Specify the AIV core for communication.
    if (AscendC::g_coreType != AIV) {
        return;
    }

    constexpr uint32_t RANK_NUM = 4U;
    constexpr uint32_t STEP_SIZE = 1U; // Fine-grained communication step, which is usually set by calling the SetStepSize API. In this example, the step is simplified to 1.
    constexpr uint64_t sendCounts[RANK_NUM][RANK_NUM] = {
        {3, 3, 3, 3}, {2, 2, 3, 2},
        {1, 4, 4, 4}, {3, 3, 3, 3}
    };
    constexpr uint64_t sDisplacements[RANK_NUM][RANK_NUM] = {
        {0, 3, 6, 9}, {0, 2, 4, 7},
        {0, 1, 5, 9}, {0, 3, 6, 9}
    };
    constexpr uint64_t recvCounts[RANK_NUM][RANK_NUM] = {
        {3, 2, 1, 3}, {3, 2, 4, 3},
        {3, 3, 4, 3}, {3, 2, 4, 3}
    };
    constexpr uint64_t rDisplacements[RANK_NUM][RANK_NUM] = {
        {0, 3, 5, 6}, {0, 3, 5, 9},
        {0, 3, 6, 10}, {0, 3, 5, 9}
    };
    HcclDataType dtype = HcclDataType::HCCL_DATA_TYPE_FP16;
    REGISTER_TILING_DEFAULT(AllToAllVCustomTilingData); // AllToAllVCustomTilingData is the structure defined in the operator header file.
    GET_TILING_DATA_WITH_STRUCT(AllToAllVCustomTilingData, tilingData, tilingGM);
    GM_ADDR contextGM = AscendC::GetHcclContext<0>();  // Obtain the HCCL context on the kernel of the Ascend C custom operator.
    Hccl hccl;
    hccl.InitV2(contextGM, &tilingData);
    auto ret = hccl.SetCcTilingV2(offsetof(AllToAllVCustomTilingData, alltoallvCcTiling));
    if (ret != HCCL_SUCCESS) {
        return;
    }
    const uint32_t selfRankId = hccl.GetRankId();
    // When the communication task is "AlltoAll=level0:fullmesh;level1:pairwise":
    // 1. The number of data blocks generated in each communication step is equal to STEP_SIZE.
    // 2. The total number of communication steps is RANK_NUM/STEP_SIZE*repeat.
    uint16_t sliceInfo[STEP_SIZE];

    if (TILING_KEY_IS(1000UL)) {
        // In the "communication before computation" scenario of communication-computing fusion, communication is performed in each step, and then the output of communication is used as the input of computation and computation is performed.
        const auto handleId = hccl.AlltoAllV<true>(sendBuf, sendCounts[selfRankId], sDisplacements[selfRankId], dtype,
                                                   recvBuf, recvCounts[selfRankId], rDisplacements[selfRankId], dtype);
        // The template parameter sync = true indicates that the API blocks and waits for the communication result of each step, and fills the subscript index of the output data block in sliceInfo.
        while (hccl.Iterate<true>(handleId, sliceInfo, sizeof(sliceInfo) / sizeof(sliceInfo[0]))) {
            // The subscript index of the output data block of each communication step is stored in sliceInfo. The corresponding computation process can be inserted to implement fine-grained communication-computation convergence.
        }
        // Iterate already blocks and waits, so Wait is no longer needed.
        // hccl.Wait(handleId);
    } else if (TILING_KEY_IS(1001UL)) {
        // In the "compute before communication" scenario of compute-communication convergence, each step is computed first, and then the computation result is used as the input for communication and the communication transaction is submitted.
        const uint8_t tileNum = 2U;
        const auto handleId = hccl.AlltoAllV<false>(sendBuf, sendCounts[selfRankId], sDisplacements[selfRankId], dtype,
                                                    recvBuf, recvCounts[selfRankId], rDisplacements[selfRankId], dtype,
                                                    tileNum);
        for (uint8_t i = 0; i < tileNum; ++i) {
            for (uint8_t j = 0; j < RANK_NUM; ++j) {
                // Template parameter sync = false indicates that this API does not block waiting and only fills the input data block of the current communication step into sliceInfo.
                if (hccl.Iterate<false>(handleId, sliceInfo, sizeof(sliceInfo) / sizeof(sliceInfo[0])) <= 0) {
                    break;
                }
                // sliceInfo indicates the relative address offset. The GM address needs to be calculated based on sDisplacements to ensure that the communication input is correct.
                // After the calculation is complete, inter-core synchronization is required. Then, the server is instructed to perform collective communication through the Commit interface.
                hccl.Commit(handleId);
            }
        }
        for (uint8_t i = 0; i < tileNum * RANK_NUM; ++i) {
            hccl.Wait(handleId);
        }
    }
    AscendC::SyncAll<true>();
    hccl.Finalize();
}