AlltoAll

Applicability

Product

Supported

Atlas A3 training products / Atlas A3 inference products

Atlas A2 training products / Atlas A2 inference products

Atlas 200I/500 A2 inference products

x

Atlas inference product 's AI Core

x

Atlas inference product 's Vector Core

x

Atlas training products

x

Function

This is the task delivery API of AlltoAll, a collective communication operator. It returns handleId of the task to users. The function of AlltoAll is as follows: Each card sends data of the same size to all other cards in a communicator, and receives data of the same size from other cards. The API function is described as follows with reference to the parameters in the prototype: the card j receives data block j from the sendBuf of card i, and stores the data in block i in the recvBuf of the current card.

Prototype

1
2
template <bool commit = false>
__aicore__ inline HcclHandle AlltoAll(GM_ADDR sendBuf, GM_ADDR recvBuf, uint64_t dataCount, HcclDataType dataType, uint64_t strideCount = 0, uint8_t repeat = 1);

Parameters

Table 1 Template parameters

Parameter

Input/Output

Description

commit

Input

Bool. Values:

  • true: When the Prepare API is called, the Commit API instructs the server to execute the communication task.
  • false: When the Prepare API is called, the server is not instructed to execute the communication task.
Table 2 API parameters

Parameter

Input/Output

Description

sendBuf

Input

Address of the source data buffer.

recvBuf

Output

Address of the destination data buffer to receive collective communication result.

dataCount

Input

Amount of data sent and received by the current card to and from other cards in the communicator. The unit is sizeof(dataType).

For example, if four cards are used in a communicator, and four fp16 data records exist in the sendBuf of each card, the value of dataCount is 1.

dataType

Input

Data type of the AlltoAll operation. All data types of HcclDataType are supported. For details about HcclDataType, see Table 1.

strideCount

Input

Spacing between data blocks on each card that are involved in communication in an AlltoAll task in the multi-round tiling scenario. The default value is 0, indicating that data blocks in the memory are contiguous.

  • If strideCount = 0, data blocks on each card that are involved in communication are continuous. Card rank_j receives the jth data block from the sendBuf of card rank_i, and the offset data amount between blocks is j x dataCount. Then card rank_j stores this data block in the ith block in its recvBuf, and the offset data amount is i x dataCount.
  • If strideCount is greater than 0, the offset data amount of start addresses of adjacent data blocks on each card that are involved in communication is strideCount. Card rank_j receives the jth data block from the sendBuf of card rank_i, and the offset data amount between blocks is j x strideCount. Then card rank_j stores this data block in the ith block in its recvBuf, and the offset data amount is i x strideCount.

Note: The preceding offset data amount is the number of data elements. The unit is sizeof(dataType).

repeat

Input

Number of AlltoAll communication tasks delivered at a time. The value of repeat is greater than or equal to 1, and the default value is 1. When repeat >1, the sendBuf and recvBuf addresses of each round of AlltoAll task are updated by the server. The update formula of task round i is as follows:

sendBuf[i] = sendBuf + dataCount x sizeof(datatype) x i, i∈[0, repeat)

recvBuf[i] = recvBuf + dataCount x sizeof(datatype) x i, i∈[0, repeat)

Note: When the value of repeat is greater than 1, the strideCount parameter must be used together to plan the communication data address.

Returns

The task ID handleId is returned. The value of handleId is greater than or equal to 0. If the API fails to be called, the value -1 is returned.

Restrictions

  • Before calling this API, ensure that the InitV2 and SetCcTilingV2 APIs have been called.
  • If the core for delivering the communication task is not specified in the config template parameters of the HCCL object, this API can be called only on the AIC or AIV core. If the core for delivering the communication task is specified in the config template parameters of the HCCL object, this API can be called on both the AIC and AIV cores. The API delivers the communication task only on the AIC or AIV core based on the specified core type.
  • For the Atlas A2 training products / Atlas A2 inference products , the total number of times that Prepare APIs are called in a communicator cannot exceed 63.
  • For the Atlas A3 training products / Atlas A3 inference products , the total number of times that all Prepare and InterHcclGroupSync APIs are called in a communicator cannot exceed 63.
  • For the Atlas A3 training products / Atlas A3 inference products , a maximum of 128 cards can be used for communication in a communicator.

Example

  • Non-multi-round tiling scenario

    Four cards are used to execute the AlltoAll communication task. In the non-multi-round tiling scenario, the number of data blocks and data records on each card are the same. As shown in the following figure, the number of data elements in blocks A, B, C, and D on each card is dataCount.

    Figure 1 AlltoAll communication between four cards in non-multi-round tiling scenario
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    extern "C" __global__ __aicore__ void alltoall_custom(GM_ADDR xGM, GM_ADDR yGM, GM_ADDR workspaceGM, GM_ADDR tilingGM) {
        constexpr uint64_t dataCount = 128U; // Data volume
        auto sendBuf = xGM;  // xGM is the GM address of AlltoAll input.
        auto recvBuf = yGM;  // yGM is the GM address of AlltoAll output.
        REGISTER_TILING_DEFAULT(AllToAllCustomTilingData); // AllToAllCustomTilingData is the structure defined in the operator header file.
        GET_TILING_DATA_WITH_STRUCT(AllToAllCustomTilingData, tilingData, tilingGM);
    
        Hccl hccl;
        GM_ADDR contextGM = AscendC::GetHcclContext<0>();  // Obtain the HCCL context on the kernel of the Ascend C custom operator.
    
        if (AscendC::g_coreType == AIV) {  // Specify AIV Cores for communication.
            hccl.InitV2(contextGM, &tilingData);
            auto ret = hccl.SetCcTilingV2(offsetof(AllToAllCustomTilingData, alltoallCcTiling));
    	if (ret != HCCL_SUCCESS) {
    	    return;
    	}
    	HcclHandle handleId = hccl.AlltoAll<true>(sendBuf, recvBuf, dataCount, HcclDataType::HCCL_DATA_TYPE_FP16);   
    	hccl.Wait(handleId);   
    	AscendC::SyncAll<true>();  // All AIV Cores are synchronized to prevent too fast execution on core 0. Calling the hccl.Finalize() API prematurely can cause suspension of other cores during the Wait operation.
    	hccl.Finalize();
        }
    }
    
  • Multi-round tiling scenario

    Multi-round tiling is enabled to process communication equivalently to the foregoing non-multi-round tiling example. Data on each card is evenly divided into four blocks (A, B, C, and D), and each block is further split into various data tiles. In this example, a data block is split into three tiles. As shown in the following figure, the three tiles include two tiles with tileLen data records and one tail tile with tailLen data records. After tiling, three rounds of AlltoAll communication are performed, and communication result equivalent to that of non-multi-round tiling is achieved.

    Figure 2 AlltoAll communication between four cards in three-round tiling scenario

    In the first round of communication, AlltoAll operation is performed on 0-0\1-0\2-0\3-0 data blocks on each rank. The spacing between adjacent data blocks on the same card that are involved in communication is the value of strideCount. In the second round of communication, AlltoAll operation is performed on 0-1\1-1\2-1\3-1 data blocks on each rank. In the third round of communication, AlltoAll operation is performed on 0-2\1-2\2-2\3-2 data blocks on each rank. The following figure shows the first round of communication and provides a code example.

    Figure 3 First round of AlltoAll communication between four cards
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    extern "C" __global__ __aicore__ void alltoall_custom(GM_ADDR xGM, GM_ADDR yGM, GM_ADDR workspaceGM, GM_ADDR tilingGM) {
        constexpr uint32_t tileNum = 2U;   // Number of first blocks
        constexpr uint64_t tileLen = 128U; // Number of data elements in first blocks
        constexpr uint32_t tailNum = 1U;   // Number of tail blocks
        constexpr uint64_t tailLen = 100U; // Number of data elements in tail blocks
        auto sendBuf = xGM;  // xGM is the GM address of AlltoAll input.
        auto recvBuf = yGM;  // yGM is the GM address of AlltoAll output.
        REGISTER_TILING_DEFAULT(AllToAllCustomTilingData); // AllToAllCustomTilingData is the structure defined in the operator header file.
        GET_TILING_DATA_WITH_STRUCT(AllToAllCustomTilingData, tilingData, tilingGM);
    
        Hccl hccl;
        GM_ADDR contextGM = AscendC::GetHcclContext<0>();  // Obtain the HCCL context on the kernel of the Ascend C custom operator.
    
        if (AscendC::g_coreType == AIV) {  // Specify AIV Cores for communication.
            hccl.InitV2(contextGM, &tilingData);
            auto ret = hccl.SetCcTilingV2(offsetof(AllToAllCustomTilingData, alltoallCcTiling));
            if (ret != HCCL_SUCCESS) {
              return;
            }
            uint64_t strideCount = tileLen * tileNum + tailLen * tailNum;
            // Process two first blocks.
            HcclHandle handleId1 = hccl.AlltoAll<true>(sendBuf, recvBuf, tileLen, HcclDataType::HCCL_DATA_TYPE_FP16, strideCount, tileNum);
            // Process one tail block.
            constexpr uint32_t kSizeOfFloat16 = 2U;
            sendBuf += tileLen * tileNum * kSizeOfFloat16;
            recvBuf += tileLen * tileNum * kSizeOfFloat16;
            HcclHandle handleId2 = hccl.AlltoAll<true>(sendBuf, recvBuf, tailLen, HcclDataType::HCCL_DATA_TYPE_FP16, strideCount, tailNum);
            
            for (uint8_t i=0; i<tileNum; i++) {
                hccl.Wait(handleId1);
            }
            hccl.Wait(handleId2);  
            AscendC::SyncAll<true>();  // All AIV Cores are synchronized to prevent too fast execution on core 0. Calling the hccl.Finalize() API prematurely can cause suspension of other cores during the Wait operation.
            hccl.Finalize();
        }
    }