HCCL_ALGO
Description
Configures the communication algorithm between servers. The algorithm can be configured globally or by operator.
HCCL provides the adaptive algorithm selection function, so that it can select an appropriate algorithm by default based on the product form, data size, and number of servers, requiring no manual configuration. Enabling this environment variable to specify the inter-server communication algorithm would invalidate the adaptive algorithm selection function.
- The global configuration method is as follows:
export HCCL_ALGO="level0:NA;level1:<algo>"
level0 indicates the intra-server communication algorithm, which can only be set to NA, while level1 indicates the inter-server communication algorithm, which can be set to the following values:- ring: a communication algorithm based on the ring topology. It features a large number of communication steps (linear complexity) and relatively high latency. However, it has a simple communication relationship and is less affected by network congestion. This algorithm applies to scenarios where there are only a few servers in the communicator, small-sized communication data, and obvious network congestion and the pipeline algorithm is not applicable.
- H-D_R: Recursive Halving-Doubling (RHD) algorithm. It features a small number of communication steps (logarithmic complexity) and relatively low latency. However, extra traffic is introduced if the number of nodes is not a power of 2. This algorithm applies to scenarios where the number of servers in the communicator is an integral power of 2 and the pipeline algorithm is not applicable, or the number of servers is not an integral power of 2 but the communication data size is small.
- NHR: Nonuniform Hierarchical Ring (NHR) algorithm. It features a small number of communication steps (logarithmic complexity) and relatively low latency. This algorithm applies to scenarios where there are lots of servers in the communicator and the pipeline algorithm is not applicable.
- NHR_V1: NHR algorithm of an earlier version. It features a small number of communication steps (root complexity) and relatively low latency. This algorithm applies to scenarios where the number of servers in the communicator is not an integer power of 2 and the pipeline algorithm is not applicable. Theoretically, the performance of the NHR_V1 algorithm is lower than that of the new NHR algorithm. This configuration option will be deprecated in the future. It is recommended that developers use the NHR algorithm.
- NB: Nonuniform Bruck (NB) algorithm. It features a small number of communication steps (logarithmic complexity) and relatively low latency. This algorithm applies to scenarios where there are lots of servers in the communicator and the pipeline algorithm is not applicable.
For details about the communication operators, data types, reduction types, network running modes, and product models supported by each communication algorithm, see Supported Algorithms for Communication Between Servers.
- Configure the communication algorithm based on the operator type as follows:
export HCCL_ALGO="<op0>=level0:NA;level1:<algo0>/<op1>=level0:NA;level1:<algo1>"
where- <op> specifies the type of a communication operator. The value can be allgather, reducescatter, allreduce, broadcast, reduce, scatter, or alltoall.
- <algo> specifies the inter-server communication algorithm used by the specified communication operator. The supported configuration is the same as that of level1 in the global configuration method. Ensure that the specified communication algorithm is supported by the communication operator. For details about the communication operators supported by each algorithm, see Table 1. If no communication algorithm is specified for a communication operator, the system automatically selects a communication algorithm based on the product form, number of nodes, and data size.
- Use slashes (/) to separate the configurations of multiple operators.
- For the
Atlas Training Series Product , if the number of servers in a communicator is not an integer power of 2, the ring algorithm is used by default. In other scenarios, the H-D_R algorithm is used by default.
Example
- Global configuration
export HCCL_ALGO="level0:NA;level1:H-D_R"
- Configuration by operator
# The AllReduce operator uses the ring algorithm. Other operators automatically select a communication algorithm based on the product form, number of nodes, and data size. export HCCL_ALGO="allreduce=level0:NA;level1:ring"
Supported Algorithms for Communication Between Servers
Algorithm Type |
Collective Communication Operator |
Data Type |
Reduction Type |
Network Running Mode |
Deterministic Computing |
Supported Product |
Unsupported Operator Processing Method |
|---|---|---|---|---|---|---|---|
ring |
ReduceScatter, AllGather, AllReduce, Reduce, and Scatter |
int8, int16, int32, int64, float16, float32, and bfp16 |
sum, prod, max, and min |
Single-operator/Graph mode |
Supported |
The H-D_R algorithm is automatically selected. |
|
H-D_R |
ReduceScatter, AllGather, AllReduce, Broadcast, and Reduce |
int8, int16, int32, int64, float16, float32, and bfp16 |
sum, prod, max, and min |
Single-operator/Graph mode |
Supported |
The ring algorithm is automatically selected. |
|
NHR |
ReduceScatter, AllGather, AllReduce, Broadcast, and Scatter |
int8, int16, int32, int64, float16, float32, and bfp16 |
sum, prod, max, and min |
Single-operator/Graph mode |
Supported |
The H-D_R or ring algorithm is automatically selected. |
|
NHR_V1 |
ReduceScatter, AllGather, AllReduce, and Broadcast |
int8, int16, int32, int64, float16, float32, and bfp16 |
sum, prod, max, and min |
Single-operator/Graph mode |
Supported |
The H-D_R or ring algorithm is automatically selected. |
|
NB |
ReduceScatter, AllGather, AllReduce, Broadcast, and Scatter |
int8, int16, int32, int64, float16, float32, and bfp16 |
sum, prod, max, and min |
Single-operator/Graph mode |
Supported |
The H-D_R or ring algorithm is automatically selected. |
Restrictions
In the current version, the intra-server communication algorithm can only be set to NA.