HcclAllReduce
Applicability
|
Product |
Supported |
|---|---|
|
|
√ |
|
|
√ |
|
|
☓ |
|
|
√ |
|
|
√ |
For
For the
Description
Adds the input data of all nodes in the communicator (or performs other reduction operations) and sends the result to the output buffer of all nodes. The reduction operation type is specified by the op parameter.

Prototype
1
|
HcclResult HcclAllReduce(void *sendBuf, void *recvBuf, uint64_t count, HcclDataType dataType, HcclReduceOp op, HcclComm comm, aclrtStream stream) |
Parameters
|
Parameter |
Input/Output |
Description |
|---|---|---|
|
sendBuf |
Input |
Address of the send buffer. |
|
recvBuf |
Output |
Address of the buffer to receive collective communication result. |
|
count |
Input |
Number of data records to perform AllReduce operation. For example, if only one int32 data record is involved, then count=1. |
|
dataType |
Input |
Data type of the AllReduce operation, which is of the HcclDataType type. Atlas 300I Duo inference card: The supported data types are int8, int16, int32, float16, and float32. |
|
op |
Input |
Reduction operation type. Currently, the following operation types are supported: sum, prod, max, and min.
NOTE:
Atlas 300I Duo inference card: The prod, max, and min operations do not support the int16 data type. |
|
comm |
Input |
Communicator where the operation is performed. |
|
stream |
Input |
Stream of the rank. |
Returns
HcclResult: HCCL_SUCCESS on success; else, failure.
Constraints
- All ranks must have the same count, dataType, and op.
- Each rank has only one input.
- The input and output addresses (sendBuf and recvBuf) of the operator must meet the following alignment requirements based on different data types:
- int8: 1-byte aligned
- int16, float16, bfp16: 2-byte aligned
- int32 and float32: 4-byte aligned
- int64: 8-byte aligned
Example
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
// Allocate device memory for collective communication. void *sendBuf = nullptr; void *recvBuf = nullptr; uint64_t count = 8; size_t mallocSize = count * sizeof(float); aclrtMalloc((void **)&sendBuf, mallocSize, ACL_MEM_MALLOC_HUGE_ONLY); aclrtMalloc((void **)&recvBuf, mallocSize, ACL_MEM_MALLOC_HUGE_ONLY); // Initialize the communicator. uint32_t rankSize = 8; HcclComm hcclComm; HcclCommInitRootInfo(rankSize, &rootInfo, deviceId, &hcclComm); // Create a task flow. aclrtStream stream; aclrtCreateStream(&stream); // Execute AllReduce to add input data of all ranks in the communicator and send the result to the output buffer of all ranks. HcclAllReduce(sendBuf, recvBuf, count, HCCL_DATA_TYPE_FP32, HCCL_REDUCE_SUM, hcclComm, stream); // Wait until the collective communication task in the task flow is complete. aclrtSynchronizeStream(stream); // Free resources. aclrtFree(sendBuf); // Free the device memory. aclrtFree(recvBuf); // Free the device memory. aclrtDestroyStream(stream); // Destroy the task flow. HcclCommDestroy(hcclComm); // Destroy the communicator. |