HCCL_ALGO

Description

Configures the communication algorithms between servers and supernodes. The algorithms can be configured globally or by operator.

This section only briefly describes the algorithm functions. For details, see Introduction to Collective Communication Algorithms.

HCCL provides the adaptive algorithm selection function, so that it can select an appropriate algorithm by default based on the product form, data size, and number of servers, requiring no manual configuration. Enabling this environment variable to specify the inter-server or inter-supernode communication algorithm would invalidate the adaptive algorithm selection function.
In some communication operators, when a specific type of AI processor is used and the data size is small, the communication algorithm is adaptively selected by HCCL and not controlled by this environment variable.

The global configuration method is as follows:
export HCCL_ALGO="level0:NA;level1:<algo>;level2:<algo>"
- level0 indicates the intra-server communication algorithm. Currently, this value can only be set to NA.
- level1 indicates the inter-server communication algorithm. This value can be set to the following:
  - ring: a communication algorithm based on the ring topology. It features a large number of communication steps (linear complexity) and relatively high latency. However, it has a simple communication relationship and is less affected by network congestion. This algorithm applies to scenarios where there are only a few servers in the communicator, small-sized communication data, and obvious network congestion and the pipeline algorithm is not applicable.
  - H-D_R: Recursive Halving-Doubling (RHD) algorithm. It features a small number of communication steps (logarithmic complexity) and relatively low latency. However, extra traffic is introduced if the number of nodes is not an integral power of 2. This algorithm applies to scenarios where the number of servers in the communicator is an integral power of 2 and the pipeline algorithm is not applicable, or the number of servers is not an integral power of 2 but the communication data size is small.
  - NHR: Nonuniform Hierarchical Ring (NHR) algorithm. It features a small number of communication steps (logarithmic complexity) and relatively low latency. This algorithm applies to scenarios where there are lots of servers in the communicator and the pipeline algorithm is not applicable.
  - NHR_V1: NHR algorithm of an earlier version. It features a small number of communication steps (root complexity) and relatively low latency. This algorithm applies to scenarios where the number of servers in the communicator is not an integer power of 2 and the pipeline algorithm is not applicable. Theoretically, the performance of the NHR_V1 algorithm is lower than that of the new NHR algorithm. This configuration option will be deprecated in the future. It is recommended that developers use the NHR algorithm.
  - NB: Nonuniform Bruck (NB) algorithm. It features a small number of communication steps (logarithmic complexity) and relatively low latency. This algorithm applies to scenarios where there are lots of servers in the communicator and the pipeline algorithm is not applicable.
  - AHC: Asymmetric Hierarchical Concatenate (AHC) algorithm. This algorithm applies to scenarios where NPUs are symmetrically or asymmetrically distributed (that is, asymmetric number of devices) across multiple layers in the communicator. When bandwidth convergence exists between layers in the communicator, the relative benefits are better.
    Note: When level1 (inter-server communication algorithm) is set to AHC, level2 (inter-supernode communication algorithm) automatically uses the AHC algorithm without additional configuration. Even if level2 is set to another algorithm, the setting does not take effect.
  - pipeline: pipeline parallel algorithm. It can concurrently use the intra-server and inter-server links. This algorithm applies to scenarios where the communication data size is large and each server in the communicator contains multiple devices.
  - pairwise: pairwise communication algorithm. It is used only for the AlltoAll, AlltoAllV, and AlltoAllVC operators. It features a large number of communication steps (linear complexity) and relatively high latency, and requires additional memory allocation. The memory size is in direct proportion to the data size. However, it can avoid the problem of one-to-many network. This algorithm applies to scenarios where the communication data size is large and the problem of one-to-many network needs to be avoided.
  For details about the communication operators, data types, network running modes, and products supported by each inter-server communication algorithm, see Supported Algorithms for Communication Between Servers.
  If level1 is not set:
  - For the Atlas A3 training products / Atlas A3 inference products , the algorithm is automatically selected based on the product form, number of nodes, and data size.
  - For the Atlas A2 training products / Atlas A2 inference products , the algorithm is automatically selected based on the product form, number of nodes, and data size.
  - For the Atlas training products , if the number of servers in a communicator is not an integral power of 2, the ring algorithm is used by default. In other scenarios, the H-D_R algorithm is used by default.
- level2 indicates the inter-supernode communication algorithm. This value can be set to the following:
  - ring: a communication algorithm based on the ring topology. It features a large number of communication steps (linear complexity) and relatively high latency. However, it has a simple communication relationship and is less affected by network congestion. This algorithm applies to scenarios where the number of supernodes in the communicator is small and is not an integral power of 2.
  - H-D_R: Recursive Halving-Doubling (RHD) algorithm. It features a small number of communication steps (logarithmic complexity) and relatively low latency. However, extra traffic is introduced if the number of nodes is not a power of 2. This algorithm applies to scenarios where the number of supernodes in the communicator is an integral power of 2, or the number of servers is not an integral power of 2 but the communication data size is small.
  - NHR: Nonuniform Hierarchical Ring (NHR) algorithm. It features a small number of communication steps (logarithmic complexity) and relatively low latency. This algorithm applies to scenarios where the number of supernodes in the communicator is large.
  - NB: Nonuniform Bruck (NB) algorithm. It features a small number of communication steps (logarithmic complexity) and relatively low latency. This algorithm applies to scenarios where the number of supernodes in the communicator is large.
  - pipeline: pipeline parallel algorithm. It can concurrently use the intra-supernode and inter-supernode links. This algorithm applies to scenarios where the communication data size is large and each supernode in the communicator contains multiple devices.
  For details about the communication operators, data types, and network running modes supported by each inter-supernode communication algorithm, see Supported Algorithms for Communication Between Supernodes.
  
  If level2 is not set, the ring algorithm is used when the number of supernodes in the communicator is less than 8 and is not an integral power of 2. In other scenarios, the H-D_R algorithm is used.
  Currently, the level2 configuration applies only to the following scenarios:
  - It is supported only by the Atlas A3 training products / Atlas A3 inference products .
  - It is supported only when the orchestration expansion position of the communication algorithm is AI_CPU. The orchestration expansion position of the communication algorithm can be configured using the environment variable HCCL_OP_EXPANSION_MODE.
Configure the communication algorithm based on the operator type as follows:
export HCCL_ALGO="<op0>=level0:NA;level1:<algo0>;level2:<algo1>/<op1>=level0:NA;level1:<algo3>;level2:<algo4>"
Where:
- <op> indicates the type of the communication operator. The options are as follows:
  - allgather: corresponds to the communication operators AllGather and AllGatherV.
  - reducescatter: corresponds to the communication operators ReduceScatter and ReduceScatterV.
  - allreduce: corresponds to the communication operator AllReduce.
  - broadcast: corresponds to the communication operator Broadcast.
  - reduce: corresponds to the communication operator Reduce.
  - scatter: corresponds to the communication operator Scatter.
  - alltoall: corresponds to the communication operators AlltoAll, AlltoAllV, and AlltoAllVC.
- <algo> specifies the communication algorithm used by the specified communication operator. The supported configuration is the same as that of level1 and level2 in the global configuration method. Ensure that the specified communication algorithm is supported by the communication operator. For details about the communication operators supported by each algorithm, see Supported Algorithms for Communication Between Servers and Supported Algorithms for Communication Between Supernodes. If no communication algorithm is specified for a communication operator, the system automatically selects a communication algorithm based on the product form, number of nodes, and data size.
- Use slashes (/) to separate the configurations of multiple operators.

Example

Global configuration

export HCCL_ALGO="level0:NA;level1:H-D_R"

Configuration by operator

# The AllReduce operator uses the ring algorithm and the AllGather operator uses the RHD algorithm. Other operators automatically select a communication algorithm based on the product form, number of nodes, and data size.
export HCCL_ALGO="allreduce=level0:NA;level1:ring/allgather=level0:NA;level1:H-D_R"

Restrictions

In the current version, the intra-server communication algorithm can only be set to NA.
For Atlas A2 training products / Atlas A2 inference products , you are advised not to configure the HCCL_ALGO environment variable in the order-preserving scenario of strict deterministic computing.
If you call the HCCL C APIs to initialize a communicator with specific configurations and specify the communication algorithm using the hcclAlgo parameter of HcclCommConfig, the configuration of the communicator takes precedence.

Supported Algorithms for Communication Between Servers

**Table 1** Supported algorithms for communication between servers
Algorithm Type	Collective Communication Operator	Data Type	Network Running Mode	Deterministic Computing	Supported Products	Unsupported Operator Processing Method
ring	ReduceScatter, AllGather, AllReduce, and Reduce	int8, int16, int32, int64, float16, float32, and bfp16	Single-operator mode Graph mode (Ascend IR)	Supported	Atlas training products Atlas A2 training products / Atlas A2 inference products Atlas A3 training products / Atlas A3 inference products	The NHR or H-D_R algorithm is automatically selected.
	ReduceScatterV	int8, int16, int32, int64 (supported only in single-operator mode on the Atlas A3 training products / Atlas A3 inference products ), float16, float32, and bfp16	Single-operator mode The graph mode (Ascend IR) is supported only on the Atlas A2 training products / Atlas A2 inference products .	Supported	Atlas A2 training products / Atlas A2 inference products Atlas A3 training products / Atlas A3 inference products	The NHR or H-D_R algorithm is automatically selected.
	Scatter	int8, int16, int32, int64, float16, float32, and bfp16	Single-operator mode	Supported	Atlas A2 training products / Atlas A2 inference products Atlas A3 training products / Atlas A3 inference products	The NHR or H-D_R algorithm is automatically selected.
	AllGatherV	int8, int16, int32, int64, float16, float32, and bfp16	Single-operator mode The graph mode (Ascend IR) is supported only on the Atlas A2 training products / Atlas A2 inference products .	Supported	Atlas A2 training products / Atlas A2 inference products Atlas A3 training products / Atlas A3 inference products	The NHR or H-D_R algorithm is automatically selected.
H-D_R	ReduceScatter, AllGather, AllReduce, Broadcast, and Reduce	int8, int16, int32, int64, float16, float32, and bfp16	Single-operator mode Graph mode (Ascend IR)	Supported	Atlas training products Atlas A2 training products / Atlas A2 inference products	The NHR or ring algorithm is automatically selected.
NHR	ReduceScatter, AllGather, AllReduce, and Broadcast	int8, int16, int32, int64, float16, float32, and bfp16	Single-operator mode Graph mode (Ascend IR)	Supported	Atlas training products Atlas A2 training products / Atlas A2 inference products Atlas A3 training products / Atlas A3 inference products	The H-D_R or ring algorithm is automatically selected.
	ReduceScatterV	int8, int16, int32, int64 (supported only in single-operator mode on the Atlas A3 training products / Atlas A3 inference products ), float16, float32, and bfp16	Single-operator mode The graph mode (Ascend IR) is supported only on the Atlas A2 training products / Atlas A2 inference products .	Supported	Atlas A2 training products / Atlas A2 inference products Atlas A3 training products / Atlas A3 inference products	The H-D_R or ring algorithm is automatically selected.
	Scatter	int8, int16, int32, int64, float16, float32, and bfp16	Single-operator mode	Supported	Atlas A2 training products / Atlas A2 inference products Atlas A3 training products / Atlas A3 inference products	The H-D_R or ring algorithm is automatically selected.
	AllGatherV	int8, int16, int32, int64, float16, float32, and bfp16	Single-operator mode The graph mode (Ascend IR) is supported only on the Atlas A2 training products / Atlas A2 inference products .	Supported	Atlas A2 training products / Atlas A2 inference products Atlas A3 training products / Atlas A3 inference products	The H-D_R or ring algorithm is automatically selected.
NHR_V1	ReduceScatter, AllGather, AllReduce, and Broadcast	int8, int16, int32, int64, float16, float32, and bfp16	Single-operator mode Graph mode (Ascend IR)	Supported	Atlas training products Atlas A2 training products / Atlas A2 inference products	The NHR, H-D_R, or ring algorithm is automatically selected.
NB	ReduceScatter, AllGather, AllReduce, and Broadcast	int8, int16, int32, int64, float16, float32, and bfp16	Single-operator mode Graph mode (Ascend IR)	Supported	Atlas training products Atlas A2 training products / Atlas A2 inference products Atlas A3 training products / Atlas A3 inference products	The NHR, H-D_R, or ring algorithm is automatically selected.
	ReduceScatterV	int8, int16, int32, int64 (supported only in single-operator mode on the Atlas A3 training products / Atlas A3 inference products ), float16, float32, and bfp16	Single-operator mode The graph mode (Ascend IR) is supported only on the Atlas A2 training products / Atlas A2 inference products .	Supported	Atlas training products Atlas A2 training products / Atlas A2 inference products Atlas A3 training products / Atlas A3 inference products	The NHR, H-D_R, or ring algorithm is automatically selected.
	AllGatherV	int8, int16, int32, int64, float16, float32, and bfp16	Single-operator mode The graph mode (Ascend IR) is supported only on the Atlas A2 training products / Atlas A2 inference products .	Supported	Atlas training products Atlas A2 training products / Atlas A2 inference products Atlas A3 training products / Atlas A3 inference products	The NHR, H-D_R, or ring algorithm is automatically selected.
	Scatter	int8, int16, int32, int64, float16, float32, and bfp16	Single-operator mode	Supported	Atlas training products Atlas A2 training products / Atlas A2 inference products Atlas A3 training products / Atlas A3 inference products	The NHR, H-D_R, or ring algorithm is automatically selected.
AHC	ReduceScatter, AllGather, and AllReduce	int8, int16, int32, int64, float16, float32, and bfp16	Single-operator mode Graph mode (Ascend IR)	Supported	Atlas A3 training products / Atlas A3 inference products	The NHR, H-D_R, or ring algorithm is automatically selected.
pipeline	AllReduce	int8, int16, int32, float16, float32, and bfp16	Single-operator mode Graph mode (Ascend IR): For the floating-point compute overflow mode, the saturation mode is not supported and only the INF/NaN mode is supported.	Not supported	Atlas A2 training products / Atlas A2 inference products	The NHR, H-D_R, or ring algorithm is automatically selected.
	AllGather	int8, int16, int32, int64, float16, float32, and bfp16	Single-operator mode Graph mode (Ascend IR)	-	Atlas A2 training products / Atlas A2 inference products	The NHR, H-D_R, or ring algorithm is automatically selected.
	AllGatherV	int8, int16, int32, int64, float16, float32, and bfp16	Single-operator mode Graph mode (Ascend IR)	Supported	Atlas A2 training products / Atlas A2 inference products	The NHR, H-D_R, or ring algorithm is automatically selected.
	ReduceScatter	int8, int16, int32, float16, float32, and bfp16	Single-operator mode Graph mode (Ascend IR)	Not supported	Atlas A2 training products / Atlas A2 inference products	The NHR, H-D_R, or ring algorithm is automatically selected.
	AlltoAll	int8, int16, int32, int64, float16, float32, and bfp16	Single-operator mode Dynamic shape scenario of the graph mode (Ascend IR)	-	Atlas A2 training products / Atlas A2 inference products	The pairwise algorithm is automatically selected.
	AlltoAllV	int8, int16, int32, int64, float16, float32, and bfp16	Single-operator mode Dynamic shape scenario of the graph mode (Ascend IR)	-	Atlas A2 training products / Atlas A2 inference products	The pairwise algorithm is automatically selected.
	AlltoAllVC	int8, int16, int32, int64, float16, float32, and bfp16	Dynamic shape scenario of the graph mode (Ascend IR)	-	Atlas A2 training products / Atlas A2 inference products	The pairwise algorithm is automatically selected.
pairwise	AlltoAll, AlltoAllV, and AlltoAllVC	int8, int16, int32, int64, float16, float32, and bfp16	Single-operator mode Graph mode (Ascend IR)	-	Atlas A2 training products / Atlas A2 inference products	The NHR, H-D_R, or ring algorithm is automatically selected.

Supported Algorithms for Communication Between Supernodes

**Table 2** Supported Algorithms for Communication Between Supernodes
Algorithm Type	Collective Communication Operator	Data Type	Network Running Mode	Deterministic Computing	Supported Products	Unsupported Operator Processing Method
ring	ReduceScatter, AllGather, AllReduce, and Reduce	int8, int16, int32, int64, float16, float32, and bfp16	Single-operator mode Graph mode (Ascend IR)	Supported	Atlas A3 training products / Atlas A3 inference products	The NHR or H-D_R algorithm is automatically selected.
	ReduceScatterV	int8, int16, int32, int64, float16, float32, and bfp16	Single-operator mode	Supported	Atlas A3 training products / Atlas A3 inference products	The NHR or H-D_R algorithm is automatically selected.
	Scatter and AllGatherV	int8, int16, int32, int64, float16, float32, and bfp16	Single-operator mode	Supported	Atlas A3 training products / Atlas A3 inference products	The NHR or H-D_R algorithm is automatically selected.
H-D_R	AllReduce, Broadcast, and Reduce	int8, int16, int32, int64, float16, float32, and bfp16	Single-operator mode Graph mode (Ascend IR)	Supported	Atlas A3 training products / Atlas A3 inference products	The NHR or ring algorithm is automatically selected.
NHR	ReduceScatter, AllGather, AllReduce, and Broadcast	int8, int16, int32, int64, float16, float32, and bfp16	Single-operator mode Graph mode (Ascend IR)	Supported	Atlas A3 training products / Atlas A3 inference products	The H-D_R or ring algorithm is automatically selected.
	ReduceScatterV	int8, int16, int32, int64, float16, float32, and bfp16	Single-operator mode	Supported	Atlas A3 training products / Atlas A3 inference products	The H-D_R or ring algorithm is automatically selected.
	Scatter and AllGatherV	int8, int16, int32, int64, float16, float32, and bfp16	Single-operator mode	Supported	Atlas A3 training products / Atlas A3 inference products	The H-D_R or ring algorithm is automatically selected.
NB	ReduceScatter, AllGather, AllReduce, and Broadcast	int8, int16, int32, int64, float16, float32, and bfp16	Single-operator mode Graph mode (Ascend IR)	Supported	Atlas A3 training products / Atlas A3 inference products	The NHR, H-D_R, or ring algorithm is automatically selected.
	ReduceScatterV	int8, int16, int32, int64, float16, float32, and bfp16	Single-operator mode	Supported	Atlas A3 training products / Atlas A3 inference products	The NHR, H-D_R, or ring algorithm is automatically selected.
	Scatter and AllGatherV	int8, int16, int32, int64, float16, float32, and bfp16	Single-operator mode	Supported	Atlas A3 training products / Atlas A3 inference products	The NHR, H-D_R, or ring algorithm is automatically selected.
pipeline	AllGather	int8, int16, int32, int64, float16, float32, and bfp16	Single-operator mode (valid only when zero copy is enabled)	Supported	Atlas A3 training products / Atlas A3 inference products	The NHR, H-D_R, or ring algorithm is automatically selected.
pipeline	ReduceScatter	int8, int16, int32, float16, float32, and bfp16	Single-operator mode (valid only when zero copy is enabled)	Supported	Atlas A3 training products / Atlas A3 inference products	The NHR, H-D_R, or ring algorithm is automatically selected.

Applicability

Atlas training products

Atlas A2 training products / Atlas A2 inference products

Atlas A3 training products / Atlas A3 inference products

Parent topic: Function