ReduceScatter

Applicability

Product	Supported
Atlas A3 training products / Atlas A3 inference products	√
Atlas A2 training products / Atlas A2 inference products	√
Atlas 200I/500 A2 inference products	x
Atlas inference product 's AI Core	x
Atlas inference product 's Vector Core	x
Atlas training products	x

Function

Task delivery API of ReduceScatter, a collective communication operator. It returns handleId of the task to users. The function of ReduceScatter is as follows: Add (or perform other reduction operations on) the inputs of all ranks, and then distribute the results evenly to the output buffers of ranks according to their rank numbers. Each process receives 1/ranksize portion of data from other processes for reduction.

Prototype

      
           template <bool commit = false>
__aicore__ inline HcclHandle ReduceScatter(GM_ADDR sendBuf, GM_ADDR recvBuf, uint64_t recvCount, HcclDataType dataType, HcclReduceOp op, uint64_t strideCount, uint8_t repeat = 1)

Parameters

**Table 1** Template parameters
Parameter	Input/Output	Description
commit	Input	Bool. Values: true: When the Prepare API is called, the Commit API instructs the server to execute the communication task. false: When the Prepare API is called, the server is not instructed to execute the communication task.

**Table 2** API parameters
Parameter	Input/Output	Description
sendBuf	Input	Address of the buffer to send source data.
recvBuf	Output	Address of the destination data buffer to receive collective communication result.
recvCount	Input	Number of recvBuf data elements involved in the ReduceScatter operation. The number of sendBuf data elements is computed as: recvCount × rank size.
dataType	Input	Data type of the ReduceScatter operation, which can be float, half, int8_t, int16_t, bfloat16_t and int32_t. Specifically, the options can be HCCL_DATA_TYPE_FP32, HCCL_DATA_TYPE_FP16, HCCL_DATA_TYPE_INT8, HCCL_DATA_TYPE_INT16, HCCL_DATA_TYPE_INT32, and HCCL_DATA_TYPE_BFP16. For details about the HcclDataType data type, see Table 1.
op	Input	ReduceScatter operation type, which can be HCCL_REDUCE_SUM, HCCL_REDUCE_MAX, or HCCL_REDUCE_MIN. Currently, sum, max, and min operations are supported. For details about the HcclReduceOp data type, see Table 2.
strideCount	Input	This parameter indicates the offset of start addresses of adjacent data blocks in the sendBuf when data in the sendBuf of a card is scattered to the recvBuf of multiple cards. If the value of strideCount is 0, the addresses of adjacent data blocks are consecutive when data is transmitted from the current device to other cards. The current card transmits data to card rank[i], and the offset of data blocks in sendBuf of the card is computed as: i × recvCount. In non-multi-round tiling scenarios, you are advised to set this parameter to 0. If strideCount is greater than 0, the offset of start addresses of adjacent data blocks in the sendBuf is strideCount when data is transmitted from the current device to other cards. The current card transmits data to card rank[i], and the offset of data blocks in SendBuf of the card is computed as: i × strideCount. Note: The preceding offset data amount is the number of data elements. The unit is sizeof(dataType).
repeat	Input	Number of ReduceScatter communication tasks delivered at a time. The value of repeat is greater than or equal to 1, and the default value is 1. When repeat is greater than 1, the server automatically computes the sendBuf and recvBuf addresses of each ReduceScatter task. The calculation formulas are as follows: sendBuf[i] = sendBuf + recvCount × sizeof(datatype) × i, i∈[0, repeat) recvBuf[i] = recvBuf + recvCount × sizeof(datatype) × i, i∈[0, repeat) Note: When the value of repeat is greater than 1, the strideCount parameter must be used together to plan the communication data address.

Figure 1 ReduceScatter communication example

In the preceding figure, four cards are used, each data record is tiled into three blocks (TileCnt = 3), and blocks 0-0, 0-1, and 0-2 on each card are reduced and scattered to the recvBuf of rank0. The rest blocks 1-y, 2-y, and 3-y are reduced and scattered to the recvBuf of rank1, rank2, and rank3, respectively. Therefore, the ReduceScatter API is called for three times for data on one card to complete the communication of three tiled data blocks of each data record. For each data record, the value of recvCount is TileLen, and the value of strideCount is TileLen x TileCnt (that is, the number of data elements between blocks 0-0 and 1-0). In this example, the memory is contiguous. Therefore, you can call the ReduceScatter API only once and set the repeat parameter to 3.

Returns

The task ID handleId is returned. The value of handleId is greater than or equal to 0. If the API fails to be called, the value -1 is returned.

Restrictions

Before calling this API, ensure that the InitV2 and SetCcTilingV2 APIs have been called.
If the core for delivering the communication task is not specified in the config template parameters of the HCCL object, this API can be called only on the AIC or AIV core. If the core for delivering the communication task is specified in the config template parameters of the HCCL object, this API can be called on both the AIC and AIV cores. The API delivers the communication task only on the AIC or AIV core based on the specified core type.
For the Atlas A2 training products / Atlas A2 inference products , the total number of times that Prepare APIs are called in a communicator cannot exceed 63.
For the Atlas A3 training products / Atlas A3 inference products , the total number of times that all Prepare and InterHcclGroupSync APIs are called in a communicator cannot exceed 63.

Example

Non-multi-round tiling scenario

As shown in the following figure, each of the four cards has 1200 (300 x 4) float16 data records. Each card obtains its own data from the xGM memory, and scatters the result data after the reduce sum calculation is performed on the data of each card, finally, each card obtains 300 float16 data records after the reduce sum operation.

Figure 2 ReduceScatter communication between four devices in non-multi-round tiling scenario

          
           
             
             
               extern "C" __global__ __aicore__ void reduce_scatter_custom(GM_ADDR xGM, GM_ADDR yGM, GM_ADDR workspaceGM, GM_ADDR tilingGM) {
    auto sendBuf = xGM;  // xGM is the GM address of ReduceScatter input.
    auto recvBuf = yGM;  // yGM is the GM address of ReduceScatter output.
    uint64_t recvCount = 300;  // Number of communication data records of each card.
    uint64_t strideCount = 0;  // strideCount can be set to 0 in non-tiling scenarios.
    HcclReduceOp reduceOp = HcclReduceOp::HCCL_REDUCE_SUM;
    REGISTER_TILING_DEFAULT(ReduceScatterCustomTilingData); // ReduceScatterCustomTilingData is a structure defined in the operator header file.
    GET_TILING_DATA_WITH_STRUCT(ReduceScatterCustomTilingData, tilingData, tilingGM);

    Hccl hccl;
    GM_ADDR contextGM = AscendC::GetHcclContext<0>();  // Obtain the HCCL context on the kernel of the Ascend C custom operator.
    if (AscendC::g_coreType == AIV) {  // Specify AIV Cores for communication.
        hccl.InitV2(contextGM, &tilingData);
        auto ret = hccl.SetCcTilingV2(offsetof(ReduceScatterCustomTilingData, reduceScatterCcTiling));
        if (ret != HCCL_SUCCESS) {
          return;
        }
        HcclHandle handleId1 = hccl.ReduceScatter<true>(sendBuf, recvBuf, recvCount, HcclDataType::HCCL_DATA_TYPE_FP16, reduceOp, strideCount);
        hccl.Wait(handleId1);    
        AscendC::SyncAll<true>();  // All AIV Cores are synchronized to prevent too fast execution on core 0. Calling the hccl.Finalize() API prematurely can cause suspension of other cores during the Wait operation.
        hccl.Finalize();
    }
}

              

            

          
         

Multi-round tiling scenario

Multi-round tiling is enabled to process communication equivalently to the foregoing non-multi-round tiling example. As shown in the following figure, the 300 float16 data records on each card are tiled into two first data blocks and one tail data block. Each first block includes 128 float16 data records (tileLen = 128), and each tail block includes 44 float16 data records (tailLen = 44). During implementation in the operator, the tiled data is divided into three rounds for ReduceScatter communication, achieving communication result equivalent to that of non-multi-round tiling.

Figure 3 Data tiling of each card

Specifically, in the first round of communication, the ReduceScatter operation is performed on 0-0\1-0\2-0\3-0 data blocks on each rank. In the second round of communication, ReduceScatter operation is performed on 0-1\1-1\2-1\3-1 data blocks on each rank. In the third round of communication, ReduceScatter operation is performed on 0-2\1-2\2-2\3-2 data blocks on each rank. In the input data of each round of communication, the number of data elements between start addresses of adjacent data blocks on each card is strideCount. Take the first round of communication as an example. The number of data elements between start addresses of blocks 0-0 and 1-0 on rank0 or between 1-0 and 2-0 is computed as: strideCount = 2 x tileLen + 1 x tailLen=300.

Figure 4 First round of ReduceScatter between four devices

        
         
           
           
             extern "C" __global__ __aicore__ void reduce_scatter_custom(GM_ADDR xGM, GM_ADDR yGM, GM_ADDR workspaceGM, GM_ADDR tilingGM) {
    constexpr uint32_t tileNum = 2U;   // Number of first blocks
    constexpr uint64_t tileLen = 128U; // Number of data elements in first blocks
    constexpr uint32_t tailNum = 1U;   // Number of tail blocks
    constexpr uint64_t tailLen = 44U;  // Number of data elements in tail blocks
    auto sendBuf = xGM;  // xGM is the GM address of ReduceScatter input.
    auto recvBuf = yGM;  // yGM is the GM address of ReduceScatter output.
    HcclReduceOp reduceOp = HcclReduceOp::HCCL_REDUCE_SUM;
    uint64_t strideCount = tileLen * tileNum + tailLen * tailNum;
    REGISTER_TILING_DEFAULT(ReduceScatterCustomTilingData); // ReduceScatterCustomTilingData is a structure defined in the operator header file.
    GET_TILING_DATA_WITH_STRUCT(ReduceScatterCustomTilingData, tilingData, tilingGM);

    Hccl hccl;
    GM_ADDR contextGM = AscendC::GetHcclContext<0>();  // Obtain the HCCL context on the kernel of the Ascend C custom operator.
    if (AscendC::g_coreType == AIV) {  // Specify AIV Cores for communication.
        hccl.InitV2(contextGM, &tilingData);
        auto ret = hccl.SetCcTilingV2(offsetof(ReduceScatterCustomTilingData, reduceScatterCcTiling));
        if (ret != HCCL_SUCCESS) {
          return;
        }
        // Process two first blocks.
        constexpr uint32_t tileRepeat = tileNum; 
        // Except the input parameters sendBuf and recvBuf, other parameters for processing the two head blocks are the same. Therefore, repeat is set to 2, and sendBuf and recvBuf of the ReduceScatter task of the second head block are updated in the API.
        HcclHandle handleId1 = hccl.ReduceScatter<true>(sendBuf, recvBuf, tileLen, HcclDataType::HCCL_DATA_TYPE_FP16, reduceOp, strideCount, tileRepeat); 
        // Process one tail block.
        constexpr uint32_t kSizeOfFloat16 = 2U;
        sendBuf += tileLen * tileNum * kSizeOfFloat16;
        recvBuf += tileLen * tileNum * kSizeOfFloat16;
        constexpr uint32_t tailRepeat = tailNum; 
        HcclHandle handleId2 = hccl.ReduceScatter<true>(sendBuf, recvBuf, tailLen, HcclDataType::HCCL_DATA_TYPE_FP16, reduceOp, strideCount, tailRepeat);
        
        for (uint8_t i=0; i<tileRepeat; i++) {
            hccl.Wait(handleId1);
        }
        hccl.Wait(handleId2);  
        AscendC::SyncAll<true>();  // All AIV Cores are synchronized to prevent too fast execution on core 0. Calling the hccl.Finalize() API prematurely can cause suspension of other cores during the Wait operation.
        hccl.Finalize();
    }
}

            

          

        
       

Parent topic: HCCL Kernel APIs