HCCL_ALGO
Description
Configures the communication algorithms between servers and supernodes. The algorithms can be configured globally or by operator.
This section only briefly describes the algorithm functions. For details, see Introduction to Collective Communication Algorithms.
- HCCL provides the adaptive algorithm selection function, so that it can select an appropriate algorithm by default based on the product form, data size, and number of servers, requiring no manual configuration. Enabling this environment variable to specify the inter-server or inter-supernode communication algorithm would invalidate the adaptive algorithm selection function.
- In some communication operators, when a specific type of AI processor is used and the data size is small, the communication algorithm is adaptively selected by HCCL and not controlled by this environment variable.
- The global configuration method is as follows:
export HCCL_ALGO="level0:NA;level1:<algo>;level2:<algo>"
- level0 indicates the intra-server communication algorithm. Currently, this value can only be set to NA.
- level1 indicates the inter-server communication algorithm. This value can be set to the following:
- ring: a communication algorithm based on the ring topology. It features a large number of communication steps (linear complexity) and relatively high latency. However, it has a simple communication relationship and is less affected by network congestion. This algorithm applies to scenarios where there are only a few servers in the communicator, small-sized communication data, and obvious network congestion and the pipeline algorithm is not applicable.
- H-D_R: Recursive Halving-Doubling (RHD) algorithm. It features a small number of communication steps (logarithmic complexity) and relatively low latency. However, extra traffic is introduced if the number of nodes is not an integral power of 2. This algorithm applies to scenarios where the number of servers in the communicator is an integral power of 2 and the pipeline algorithm is not applicable, or the number of servers is not an integral power of 2 but the communication data size is small.
- NHR: Nonuniform Hierarchical Ring (NHR) algorithm. It features a small number of communication steps (logarithmic complexity) and relatively low latency. This algorithm applies to scenarios where there are lots of servers in the communicator and the pipeline algorithm is not applicable.
- NHR_V1: NHR algorithm of an earlier version. It features a small number of communication steps (root complexity) and relatively low latency. This algorithm applies to scenarios where the number of servers in the communicator is not an integer power of 2 and the pipeline algorithm is not applicable. Theoretically, the performance of the NHR_V1 algorithm is lower than that of the new NHR algorithm. This configuration option will be deprecated in the future. It is recommended that developers use the NHR algorithm.
- NB: Nonuniform Bruck (NB) algorithm. It features a small number of communication steps (logarithmic complexity) and relatively low latency. This algorithm applies to scenarios where there are lots of servers in the communicator and the pipeline algorithm is not applicable.
- AHC: Asymmetric Hierarchical Concatenate (AHC) algorithm. This algorithm applies to scenarios where NPUs are symmetrically or asymmetrically distributed (that is, asymmetric number of devices) across multiple layers in the communicator. When bandwidth convergence exists between layers in the communicator, the relative benefits are better.
Note: When level1 (inter-server communication algorithm) is set to AHC, level2 (inter-supernode communication algorithm) automatically uses the AHC algorithm without additional configuration. Even if level2 is set to another algorithm, the setting does not take effect.
- pipeline: pipeline parallel algorithm. It can concurrently use the intra-server and inter-server links. This algorithm applies to scenarios where the communication data size is large and each server in the communicator contains multiple devices.
- pairwise: pairwise communication algorithm. It is used only for the AlltoAll, AlltoAllV, and AlltoAllVC operators. It features a large number of communication steps (linear complexity) and relatively high latency, and requires additional memory allocation. The memory size is in direct proportion to the data size. However, it can avoid the problem of one-to-many network. This algorithm applies to scenarios where the communication data size is large and the problem of one-to-many network needs to be avoided.
For details about the communication operators, data types, network running modes, and products supported by each inter-server communication algorithm, see Supported Algorithms for Communication Between Servers.
If level1 is not set:- For the
Atlas A3 training products /Atlas A3 inference products , the algorithm is automatically selected based on the product form, number of nodes, and data size. - For the
Atlas A2 training products /Atlas A2 inference products , the algorithm is automatically selected based on the product form, number of nodes, and data size. - For the
Atlas training products , if the number of servers in a communicator is not an integral power of 2, the ring algorithm is used by default. In other scenarios, the H-D_R algorithm is used by default.
- level2 indicates the inter-supernode communication algorithm. This value can be set to the following:
- ring: a communication algorithm based on the ring topology. It features a large number of communication steps (linear complexity) and relatively high latency. However, it has a simple communication relationship and is less affected by network congestion. This algorithm applies to scenarios where the number of supernodes in the communicator is small and is not an integral power of 2.
- H-D_R: Recursive Halving-Doubling (RHD) algorithm. It features a small number of communication steps (logarithmic complexity) and relatively low latency. However, extra traffic is introduced if the number of nodes is not a power of 2. This algorithm applies to scenarios where the number of supernodes in the communicator is an integral power of 2, or the number of servers is not an integral power of 2 but the communication data size is small.
- NHR: Nonuniform Hierarchical Ring (NHR) algorithm. It features a small number of communication steps (logarithmic complexity) and relatively low latency. This algorithm applies to scenarios where the number of supernodes in the communicator is large.
- NB: Nonuniform Bruck (NB) algorithm. It features a small number of communication steps (logarithmic complexity) and relatively low latency. This algorithm applies to scenarios where the number of supernodes in the communicator is large.
- pipeline: pipeline parallel algorithm. It can concurrently use the intra-supernode and inter-supernode links. This algorithm applies to scenarios where the communication data size is large and each supernode in the communicator contains multiple devices.
For details about the communication operators, data types, and network running modes supported by each inter-supernode communication algorithm, see Supported Algorithms for Communication Between Supernodes.
If level2 is not set, the ring algorithm is used when the number of supernodes in the communicator is less than 8 and is not an integral power of 2. In other scenarios, the H-D_R algorithm is used.
Currently, the level2 configuration applies only to the following scenarios:- It is supported only by the
Atlas A3 training products /Atlas A3 inference products . - It is supported only when the orchestration expansion position of the communication algorithm is AI_CPU. The orchestration expansion position of the communication algorithm can be configured using the environment variable HCCL_OP_EXPANSION_MODE.
- Configure the communication algorithm based on the operator type as follows:
export HCCL_ALGO="<op0>=level0:NA;level1:<algo0>;level2:<algo1>/<op1>=level0:NA;level1:<algo3>;level2:<algo4>"
Where:- <op> indicates the type of the communication operator. The options are as follows:
- allgather: corresponds to the communication operators AllGather and AllGatherV.
- reducescatter: corresponds to the communication operators ReduceScatter and ReduceScatterV.
- allreduce: corresponds to the communication operator AllReduce.
- broadcast: corresponds to the communication operator Broadcast.
- reduce: corresponds to the communication operator Reduce.
- scatter: corresponds to the communication operator Scatter.
- alltoall: corresponds to the communication operators AlltoAll, AlltoAllV, and AlltoAllVC.
- <algo> specifies the communication algorithm used by the specified communication operator. The supported configuration is the same as that of level1 and level2 in the global configuration method. Ensure that the specified communication algorithm is supported by the communication operator. For details about the communication operators supported by each algorithm, see Supported Algorithms for Communication Between Servers and Supported Algorithms for Communication Between Supernodes. If no communication algorithm is specified for a communication operator, the system automatically selects a communication algorithm based on the product form, number of nodes, and data size.
- Use slashes (/) to separate the configurations of multiple operators.
- <op> indicates the type of the communication operator. The options are as follows:
Example
- Global configuration
export HCCL_ALGO="level0:NA;level1:H-D_R"
- Configuration by operator
# The AllReduce operator uses the ring algorithm and the AllGather operator uses the RHD algorithm. Other operators automatically select a communication algorithm based on the product form, number of nodes, and data size. export HCCL_ALGO="allreduce=level0:NA;level1:ring/allgather=level0:NA;level1:H-D_R"
Restrictions
- In the current version, the intra-server communication algorithm can only be set to NA.
- For
Atlas A2 training products /Atlas A2 inference products , you are advised not to configure the HCCL_ALGO environment variable in the order-preserving scenario of strict deterministic computing. - If you call the HCCL C APIs to initialize a communicator with specific configurations and specify the communication algorithm using the hcclAlgo parameter of HcclCommConfig, the configuration of the communicator takes precedence.
Supported Algorithms for Communication Between Servers
|
Algorithm Type |
Collective Communication Operator |
Data Type |
Network Running Mode |
Deterministic Computing |
Supported Products |
Unsupported Operator Processing Method |
|---|---|---|---|---|---|---|
|
ring |
ReduceScatter, AllGather, AllReduce, and Reduce |
int8, int16, int32, int64, float16, float32, and bfp16 |
|
Supported |
|
The NHR or H-D_R algorithm is automatically selected. |
|
ReduceScatterV |
int8, int16, int32, int64 (supported only in single-operator mode on the |
|
Supported |
|
The NHR or H-D_R algorithm is automatically selected. |
|
|
Scatter |
int8, int16, int32, int64, float16, float32, and bfp16 |
|
Supported |
|
The NHR or H-D_R algorithm is automatically selected. |
|
|
AllGatherV |
int8, int16, int32, int64, float16, float32, and bfp16 |
|
Supported |
|
The NHR or H-D_R algorithm is automatically selected. |
|
|
H-D_R |
ReduceScatter, AllGather, AllReduce, Broadcast, and Reduce |
int8, int16, int32, int64, float16, float32, and bfp16 |
|
Supported |
|
The NHR or ring algorithm is automatically selected. |
|
NHR |
ReduceScatter, AllGather, AllReduce, and Broadcast |
int8, int16, int32, int64, float16, float32, and bfp16 |
|
Supported |
|
The H-D_R or ring algorithm is automatically selected. |
|
ReduceScatterV |
int8, int16, int32, int64 (supported only in single-operator mode on the |
|
Supported |
|
The H-D_R or ring algorithm is automatically selected. |
|
|
Scatter |
int8, int16, int32, int64, float16, float32, and bfp16 |
|
Supported |
|
The H-D_R or ring algorithm is automatically selected. |
|
|
AllGatherV |
int8, int16, int32, int64, float16, float32, and bfp16 |
|
Supported |
|
The H-D_R or ring algorithm is automatically selected. |
|
|
NHR_V1 |
ReduceScatter, AllGather, AllReduce, and Broadcast |
int8, int16, int32, int64, float16, float32, and bfp16 |
|
Supported |
|
The NHR, H-D_R, or ring algorithm is automatically selected. |
|
NB |
ReduceScatter, AllGather, AllReduce, and Broadcast |
int8, int16, int32, int64, float16, float32, and bfp16 |
|
Supported |
|
The NHR, H-D_R, or ring algorithm is automatically selected. |
|
ReduceScatterV |
int8, int16, int32, int64 (supported only in single-operator mode on the |
|
Supported |
|
The NHR, H-D_R, or ring algorithm is automatically selected. |
|
|
AllGatherV |
int8, int16, int32, int64, float16, float32, and bfp16 |
|
Supported |
|
The NHR, H-D_R, or ring algorithm is automatically selected. |
|
|
Scatter |
int8, int16, int32, int64, float16, float32, and bfp16 |
|
Supported |
|
The NHR, H-D_R, or ring algorithm is automatically selected. |
|
|
AHC |
ReduceScatter, AllGather, and AllReduce |
int8, int16, int32, int64, float16, float32, and bfp16 |
|
Supported |
|
The NHR, H-D_R, or ring algorithm is automatically selected. |
|
pipeline |
AllReduce |
int8, int16, int32, float16, float32, and bfp16 |
|
Not supported |
|
The NHR, H-D_R, or ring algorithm is automatically selected. |
|
AllGather |
int8, int16, int32, int64, float16, float32, and bfp16 |
|
- |
|
The NHR, H-D_R, or ring algorithm is automatically selected. |
|
|
AllGatherV |
int8, int16, int32, int64, float16, float32, and bfp16 |
|
Supported |
|
The NHR, H-D_R, or ring algorithm is automatically selected. |
|
|
ReduceScatter |
int8, int16, int32, float16, float32, and bfp16 |
|
Not supported |
|
The NHR, H-D_R, or ring algorithm is automatically selected. |
|
|
AlltoAll |
int8, int16, int32, int64, float16, float32, and bfp16 |
|
- |
|
The pairwise algorithm is automatically selected. |
|
|
AlltoAllV |
int8, int16, int32, int64, float16, float32, and bfp16 |
|
- |
|
The pairwise algorithm is automatically selected. |
|
|
AlltoAllVC |
int8, int16, int32, int64, float16, float32, and bfp16 |
Dynamic shape scenario of the graph mode (Ascend IR) |
- |
|
The pairwise algorithm is automatically selected. |
|
|
pairwise |
AlltoAll, AlltoAllV, and AlltoAllVC |
int8, int16, int32, int64, float16, float32, and bfp16 |
|
- |
|
The NHR, H-D_R, or ring algorithm is automatically selected. |
Supported Algorithms for Communication Between Supernodes
|
Algorithm Type |
Collective Communication Operator |
Data Type |
Network Running Mode |
Deterministic Computing |
Supported Products |
Unsupported Operator Processing Method |
|---|---|---|---|---|---|---|
|
ring |
ReduceScatter, AllGather, AllReduce, and Reduce |
int8, int16, int32, int64, float16, float32, and bfp16 |
|
Supported |
|
The NHR or H-D_R algorithm is automatically selected. |
|
ReduceScatterV |
int8, int16, int32, int64, float16, float32, and bfp16 |
|
Supported |
|
The NHR or H-D_R algorithm is automatically selected. |
|
|
Scatter and AllGatherV |
int8, int16, int32, int64, float16, float32, and bfp16 |
|
Supported |
|
The NHR or H-D_R algorithm is automatically selected. |
|
|
H-D_R |
AllReduce, Broadcast, and Reduce |
int8, int16, int32, int64, float16, float32, and bfp16 |
|
Supported |
|
The NHR or ring algorithm is automatically selected. |
|
NHR |
ReduceScatter, AllGather, AllReduce, and Broadcast |
int8, int16, int32, int64, float16, float32, and bfp16 |
|
Supported |
|
The H-D_R or ring algorithm is automatically selected. |
|
ReduceScatterV |
int8, int16, int32, int64, float16, float32, and bfp16 |
|
Supported |
|
The H-D_R or ring algorithm is automatically selected. |
|
|
Scatter and AllGatherV |
int8, int16, int32, int64, float16, float32, and bfp16 |
|
Supported |
|
The H-D_R or ring algorithm is automatically selected. |
|
|
NB |
ReduceScatter, AllGather, AllReduce, and Broadcast |
int8, int16, int32, int64, float16, float32, and bfp16 |
|
Supported |
|
The NHR, H-D_R, or ring algorithm is automatically selected. |
|
ReduceScatterV |
int8, int16, int32, int64, float16, float32, and bfp16 |
|
Supported |
|
The NHR, H-D_R, or ring algorithm is automatically selected. |
|
|
Scatter and AllGatherV |
int8, int16, int32, int64, float16, float32, and bfp16 |
|
Supported |
|
The NHR, H-D_R, or ring algorithm is automatically selected. |
|
|
pipeline |
AllGather |
int8, int16, int32, int64, float16, float32, and bfp16 |
|
Supported |
|
The NHR, H-D_R, or ring algorithm is automatically selected. |
|
ReduceScatter |
int8, int16, int32, float16, float32, and bfp16 |
|
Supported |
|
The NHR, H-D_R, or ring algorithm is automatically selected. |