HCCL_ALGO

Description

Configures the communication algorithm between servers. The algorithm can be configured globally or by operator.

HCCL provides the adaptive algorithm selection function, so that it can select an appropriate algorithm by default based on the product form, data size, and number of servers, requiring no manual configuration. Enabling this environment variable to specify the inter-server communication algorithm would invalidate the adaptive algorithm selection function.

  • The global configuration method is as follows:

    export HCCL_ALGO="level0:NA;level1:<algo>"

    level0 indicates the intra-server communication algorithm, which can only be set to NA, while level1 indicates the inter-server communication algorithm, which can be set to the following values:
    • ring: a communication algorithm based on the ring topology. It features a large number of communication steps (linear complexity) and relatively high latency. However, it has a simple communication relationship and is less affected by network congestion. This algorithm applies to scenarios where there are only a few servers in the communicator, small-sized communication data, and obvious network congestion and the pipeline algorithm is not applicable.
    • H-D_R: Recursive Halving-Doubling (RHD) algorithm. It features a small number of communication steps (logarithmic complexity) and relatively low latency. However, extra traffic is introduced if the number of nodes is not a power of 2. This algorithm applies to scenarios where the number of servers in the communicator is an integral power of 2 and the pipeline algorithm is not applicable, or the number of servers is not an integral power of 2 but the communication data size is small.
    • NHR: Nonuniform Hierarchical Ring (NHR) algorithm. It features a small number of communication steps (logarithmic complexity) and relatively low latency. This algorithm applies to scenarios where there are lots of servers in the communicator and the pipeline algorithm is not applicable.
    • NHR_V1: NHR algorithm of an earlier version. It features a small number of communication steps (root complexity) and relatively low latency. This algorithm applies to scenarios where the number of servers in the communicator is not an integer power of 2 and the pipeline algorithm is not applicable. Theoretically, the performance of the NHR_V1 algorithm is lower than that of the new NHR algorithm. This configuration option will be deprecated in the future. It is recommended that developers use the NHR algorithm.
    • NB: Nonuniform Bruck (NB) algorithm. It features a small number of communication steps (logarithmic complexity) and relatively low latency. This algorithm applies to scenarios where there are lots of servers in the communicator and the pipeline algorithm is not applicable.

    For details about the communication operators, data types, reduction types, network running modes, and product models supported by each communication algorithm, see Supported Algorithms for Communication Between Servers.

  • Configure the communication algorithm based on the operator type as follows:

    export HCCL_ALGO="<op0>=level0:NA;level1:<algo0>/<op1>=level0:NA;level1:<algo1>"

    where
    • <op> specifies the type of a communication operator. The value can be allgather, reducescatter, allreduce, broadcast, reduce, scatter, or alltoall.
    • <algo> specifies the inter-server communication algorithm used by the specified communication operator. The supported configuration is the same as that of level1 in the global configuration method. Ensure that the specified communication algorithm is supported by the communication operator. For details about the communication operators supported by each algorithm, see Table 1. If no communication algorithm is specified for a communication operator, the system automatically selects a communication algorithm based on the product form, number of nodes, and data size.
    • Use slashes (/) to separate the configurations of multiple operators.
When this environment variable is not set:
  • For the Atlas Training Series Product, if the number of servers in a communicator is not an integer power of 2, the ring algorithm is used by default. In other scenarios, the H-D_R algorithm is used by default.

Example

  • Global configuration
    export HCCL_ALGO="level0:NA;level1:H-D_R"
  • Configuration by operator
    # The AllReduce operator uses the ring algorithm. Other operators automatically select a communication algorithm based on the product form, number of nodes, and data size.
    export HCCL_ALGO="allreduce=level0:NA;level1:ring"

Supported Algorithms for Communication Between Servers

Table 1 Supported algorithms for communication between servers

Algorithm Type

Collective Communication Operator

Data Type

Reduction Type

Network Running Mode

Deterministic Computing

Supported Product

Unsupported Operator Processing Method

ring

ReduceScatter, AllGather, AllReduce, Reduce, and Scatter

int8, int16, int32, int64, float16, float32, and bfp16

sum, prod, max, and min

Single-operator/Graph mode

Supported

Atlas Training Series Product

The H-D_R algorithm is automatically selected.

H-D_R

ReduceScatter, AllGather, AllReduce, Broadcast, and Reduce

int8, int16, int32, int64, float16, float32, and bfp16

sum, prod, max, and min

Single-operator/Graph mode

Supported

Atlas Training Series Product

The ring algorithm is automatically selected.

NHR

ReduceScatter, AllGather, AllReduce, Broadcast, and Scatter

int8, int16, int32, int64, float16, float32, and bfp16

sum, prod, max, and min

Single-operator/Graph mode

Supported

Atlas Training Series Product

The H-D_R or ring algorithm is automatically selected.

NHR_V1

ReduceScatter, AllGather, AllReduce, and Broadcast

int8, int16, int32, int64, float16, float32, and bfp16

sum, prod, max, and min

Single-operator/Graph mode

Supported

Atlas Training Series Product

The H-D_R or ring algorithm is automatically selected.

NB

ReduceScatter, AllGather, AllReduce, Broadcast, and Scatter

int8, int16, int32, int64, float16, float32, and bfp16

sum, prod, max, and min

Single-operator/Graph mode

Supported

Atlas Training Series Product

The H-D_R or ring algorithm is automatically selected.

Restrictions

In the current version, the intra-server communication algorithm can only be set to NA.

Applicability

Atlas Training Series Product