AlltoAllV

Applicability

Product

Supported

Atlas A3 training products / Atlas A3 inference products

Atlas A2 training products / Atlas A2 inference products

x

Atlas 200I/500 A2 inference products

x

Atlas inference product 's AI Core

x

Atlas inference product 's Vector Core

x

Atlas training products

x

Function

This is the task delivery API of AlltoAllV, a collective communication operator. It returns handleId of the task to users. AlltoAll is a subset of AlltoAllV. AlltoAll requires that the amount of data sent and received by all devices be the same, while AlltoAllV does not have the requirement and can be used more flexibly.

The function of AlltoAllV is as follows: Devices in a communicator send and receive data to and from each other. It can customize the size of data sent by each device to other devices, the size of data received from other devices, and the offset of the sent and received data in the memory. The API function is described as follows with reference to the parameters in the prototype: The ith card sends the jth data block in the sendBuf to the jth card. The offset of this block in the sendBuf is sdispls[j], and the data count is sendCounts[j]. The jth card stores the data in its recvBuf. The offset is rdispls[i], and the received data count is recvCounts[i]. sendCounts[j] must be equal to recvCounts[i]. Note: The offset and data count are the number of data elements. The unit is sizeof(sendType).

Prototype

1
2
template <bool commit = false>
__aicore__ inline HcclHandle AlltoAllV(GM_ADDR sendBuf, void *sendCounts, void *sdispls, HcclDataType sendType, GM_ADDR recvBuf, void *recvCounts, void *rdispls, HcclDataType recvType, uint8_t repeat = 1);

Parameters

Table 1 Template parameters

Parameter

Input/Output

Description

commit

Input

Bool. Values:

  • true: When the Prepare API is called, the Commit API instructs the server to execute the communication task.
  • false: When the Prepare API is called, the server is not instructed to execute the communication task.
Table 2 API parameters

Parameter

Input/Output

Description

sendBuf

Input

Address of the buffer to send source data.

sendCounts

Input

Amount of data sent by the current device to each other card in a communicator. sendCounts[i] indicates the amount of data sent by the current device to rank_i of the ith card . The unit is sizeof(sendType).

sendCounts is an array of the uint64_t type. The array length must be the total number of ranks in a communicator.

For example, if the type of data in the sendBuf is fp16, sendCounts[0] = 1, and sendCounts[1] = 2, the current rank sends one fp16 data record to rank 0 and two fp16 data records to rank 1.

sdispls

Input

Offset of the data transmitted from the current device to other cards in the sendBuf. sdispls[i]=n indicates that the offset data amount of the data blocks transmitted from the current device to rank_i in the sendBuf is n.

sdispls is an array of the uint64_t type. The array length must be the total number of ranks in a communicator.

sendType

Input

Type of data in the sendBuf. Currently, all data types contained in HcclDataType are supported. For details about HcclDataType, see Table 1.

recvBuf

Output

Address of the destination data buffer to receive collective communication result.

recvCounts

Input

Amount of data received by the current device from other cards. recvCounts[i] indicates the amount of data received by the current rank from rank_i. The unit is sizeof(recvType).

recvCounts is an array of the uint64_t type. The array length must be the total number of ranks in a communicator.

For example, if the type of data in the recvBuf is fp16, recvCounts[0] = 1, and recvCounts[1] = 2, the current rank receives one fp16 data record from rank 0 and two fp16 data records from rank 1.

rdispls

Input

Offset of the data received by the card in the recvBuf. rdispls[i] = n indicates that the offset data amount of the data blocks received by the card from rank_i in the recvBuf is n.

rdispls is an array of the uint64_t type. The array length must be the total number of ranks in a communicator.

recvType

Input

Type of data in the recvBuf. Currently, all data types contained in HcclDataType are supported. For details about HcclDataType, see Table 1.

Note: The values of recvType and sendType must be the same.

repeat

Input

Number of AlltoAllV communication tasks delivered at a time. The value of repeat is greater than or equal to 1, and the default value is 1. When repeat >1, the sendBuf, sendCounts, recvBuf, and recvCounts parameters of each AlltoAllV task remain unchanged, and sdispls and rdispls are updated by the server. The update formula of task i in each round is as follows:

sdispls[i] = sdispls[i] + sendCounts[i], i∈[0, sdispls.size())

rdispls[i] = rdispls[i] + recvCounts[i], i∈[0, rdispls.size())

Note: When repeat >1, you need to plan the communication memory based on this formula.

Returns

The task ID handleId is returned. The value of handleId is greater than or equal to 0. If the API fails to be called, the value -1 is returned.

Restrictions

  • Before calling this API, ensure that the InitV2 and SetCcTilingV2 APIs have been called.
  • If the core for delivering the communication task is not specified in the config template parameters of the HCCL object, this API can be called only on the AIC or AIV core. If the core for delivering the communication task is specified in the config template parameters of the HCCL object, this API can be called on both the AIC and AIV cores. The API delivers the communication task only on the AIC or AIV core based on the specified core type.
  • The total number of times that all Prepare and InterHcclGroupSync APIs are called in a communicator cannot exceed 63.
  • The number of data elements sent by each device to rank_j (sendCounts[j]) must be equal to that received by rank_j from rank_i (recvCounts[i]).
  • For the Atlas A3 training products / Atlas A3 inference products , a maximum of 128 cards can be used for communication in a communicator.

Example

  • Using AlltoAllV to implement AlltoAll communication between four cards

    Call the AlltoAllV API for four cards. In the non-multi-round tiling scenario, the number of data blocks and data records on each card are the same. As shown in the following figure, the number of data elements in blocks A, B, C, and D on each card is dataCount.

    Figure 1 AlltoAllV communication between four cards in non-tiling scenarios
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    extern "C" __global__ __aicore__ void alltoallv_custom(GM_ADDR xGM, GM_ADDR yGM, GM_ADDR workspaceGM, GM_ADDR tilingGM) {
        constexpr uint32_t rankNum = 4U;	
        constexpr uint32_t dataCount = 10U;   // Assume that the number of data elements in blocks A, B, C, and D is 10.
        uint64_t sendCounts[rankNum] = {0};
        uint64_t sDisplacements[rankNum] = {0};
        uint64_t recvCounts[rankNum] = {0};
        uint64_t rDisplacements[rankNum] = {0};
        for (uint32_t i = 0U; i < rankNum; ++i) {
            sendCounts[i] = dataCount;
            sDisplacements[i] = i * dataCount;
            recvCounts[i] = dataCount;
            rDisplacements[i] = i * dataCount;
        }	
        auto sendBuf = xGM;  // xGM is the input GM address of AlltoAllV.
        auto recvBuf = yGM;  // yGM is the output GM address of AlltoAllV.
        auto dtype = HcclDataType::HCCL_DATA_TYPE_FP16;
        REGISTER_TILING_DEFAULT(AllToAllVCustomTilingData); // AllToAllVCustomTilingData is the structure defined in the operator header file.
        GET_TILING_DATA_WITH_STRUCT(AllToAllVCustomTilingData, tilingData, tilingGM);
    
        Hccl hccl;
        GM_ADDR contextGM = AscendC::GetHcclContext<0>();  // Obtain the HCCL context on the kernel of the Ascend C custom operator.
    
        if (AscendC::g_coreType == AIV) {  // Specify AIV Cores for communication.
            hccl.InitV2(contextGM, &tilingData);
            auto ret = hccl.SetCcTilingV2(offsetof(AllToAllVCustomTilingData, alltoallvCcTiling));
            if (ret != HCCL_SUCCESS) {
              return;
            }
            auto handleId1 = hccl.AlltoAllV<true>(sendBuf, sendCounts, sDisplacements, dtype, 
                                                   recvBuf, recvCounts, rDisplacements, dtype);
    
            hccl.Wait(handleId1);    
            AscendC::SyncAll<true>();  // All AIV Cores are synchronized to prevent too fast execution on core 0. Calling the hccl.Finalize() API prematurely can cause suspension of other cores during the Wait operation.
            hccl.Finalize();
        }
    }
    
  • Using AlltoAllV to transmit and receive data of different sizes between four cards

    In the following figure, the numbers in the cells under each rank indicate the number of data pieces sent or received. Take rank 1 as an example. The rank sends 2, 2, 3, and 2 pieces of data to rank 0, rank 1, rank 2, and rank 3, respectively, and receives 3, 2, 4, and 3 pieces of data from rank 0, rank 1, rank 2, and rank 3, respectively. The corresponding code example is as follows:

    Figure 2 Uneven data receiving and sending between four cards in non-tiling scenarios
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    extern "C" __global__ __aicore__ void alltoallv_custom(GM_ADDR xGM, GM_ADDR yGM, GM_ADDR workspaceGM, GM_ADDR tilingGM) {
        constexpr uint32_t rankNum = 4U;	
        uint64_t sendCounts[rankNum] = {0};
        uint64_t sDisplacements[rankNum] = {0};
        uint64_t recvCounts[rankNum] = {0};
        uint64_t rDisplacements[rankNum] = {0};
        auto sendBuf = xGM;  // xGM is the input GM address of AlltoAllV.
        auto recvBuf = yGM;  // yGM is the output GM address of AlltoAllV.
        auto dtype = HcclDataType::HCCL_DATA_TYPE_FP16;
        REGISTER_TILING_DEFAULT(AllToAllVCustomTilingData); // AllToAllVCustomTilingData is the structure defined in the operator header file.
        GET_TILING_DATA_WITH_STRUCT(AllToAllVCustomTilingData, tilingData, tilingGM);
    
        Hccl hccl;
        GM_ADDR contextGM = AscendC::GetHcclContext<0>();  // Obtain the HCCL context on the kernel of the Ascend C custom operator.
    
        if (AscendC::g_coreType == AIV) {  // Specify AIV Cores for communication.
            hccl.InitV2(contextGM, &tilingData);
            auto ret = hccl.SetCcTilingV2(offsetof(AllToAllVCustomTilingData, alltoallvCcTiling));
            if (ret != HCCL_SUCCESS) {
              return;
            }
            if(hccl.GetRankId() == 0) {
                sendCounts[0] = 3; sendCounts[1] = 3; sendCounts[2] = 3; sendCounts[3] = 3;
                sDisplacements[1] = 3; sDisplacements[2] = 6; sDisplacements[2] = 9;
                recvCounts[0] = 3; recvCounts[1] = 2; recvCounts[2] = 1; recvCounts[3] = 3;
                rDisplacements[1] = 3; rDisplacements[2] = 5; rDisplacements[3] = 6;
            } else if(hccl.GetRankId() == 1) {
                sendCounts[0] = 2; sendCounts[1] = 2; sendCounts[2] = 3; sendCounts[3] = 2;
                sDisplacements[1] = 2; sDisplacements[2] = 4; sDisplacements[2] = 7;
                recvCounts[0] = 3; recvCounts[1] = 2; recvCounts[2] = 4; recvCounts[3] = 3;
                rDisplacements[1] = 3; rDisplacements[2] = 5; rDisplacements[3] = 9;
            } else if(hccl.GetRankId() == 2) {
                sendCounts[0] = 1; sendCounts[1] = 4; sendCounts[2] = 4; sendCounts[3] = 4;
                sDisplacements[1] = 1; sDisplacements[2] = 5; sDisplacements[2] = 9;
                recvCounts[0] = 3; recvCounts[1] = 3; recvCounts[2] = 4; recvCounts[3] = 3;
                rDisplacements[1] = 3; rDisplacements[2] = 6; rDisplacements[3] = 10;
            } else if(hccl.GetRankId() == 3) {
                sendCounts[0] = 3; sendCounts[1] = 3; sendCounts[2] = 3; sendCounts[3] = 3;
                sDisplacements[1] = 3; sDisplacements[2] = 6; sDisplacements[2] = 9;
                recvCounts[0] = 3; recvCounts[1] = 2; recvCounts[2] = 4; recvCounts[3] = 3;
                rDisplacements[1] = 3; rDisplacements[2] = 5; rDisplacements[3] = 9;
            }
            auto handleId = hccl.AlltoAllV<true>(sendBuf, sendCounts, sDisplacements, dtype, 
                                                   recvBuf, recvCounts, rDisplacements, dtype);
    
            hccl.Wait(handleId);    
            AscendC::SyncAll<true>();  // All AIV Cores are synchronized to prevent too fast execution on core 0. Calling the hccl.Finalize() API prematurely can cause suspension of other cores during the Wait operation.
            hccl.Finalize();
        }
    }