HCCL_ALGO

Description

Configures the communication algorithms between servers and supernodes. The algorithms can be configured globally or by operator.

This section only briefly describes the algorithm functions. For details, see Introduction to Collective Communication Algorithms.

  • HCCL provides the adaptive algorithm selection function, so that it can select an appropriate algorithm by default based on the product form, data size, and number of servers, requiring no manual configuration. Enabling this environment variable to specify the inter-server or inter-supernode communication algorithm would invalidate the adaptive algorithm selection function.
  • In some communication operators, when a specific type of AI processor is used and the data size is small, the communication algorithm is adaptively selected by HCCL and not controlled by this environment variable.
  • The global configuration method is as follows:

    export HCCL_ALGO="level0:NA;level1:<algo>;level2:<algo>"

    • level0 indicates the intra-server communication algorithm. Currently, this value can only be set to NA.
    • level1 indicates the inter-server communication algorithm. This value can be set to the following:
      • ring: a communication algorithm based on the ring topology. It features a large number of communication steps (linear complexity) and relatively high latency. However, it has a simple communication relationship and is less affected by network congestion. This algorithm applies to scenarios where there are only a few servers in the communicator, small-sized communication data, and obvious network congestion and the pipeline algorithm is not applicable.
      • H-D_R: Recursive Halving-Doubling (RHD) algorithm. It features a small number of communication steps (logarithmic complexity) and relatively low latency. However, extra traffic is introduced if the number of nodes is not an integral power of 2. This algorithm applies to scenarios where the number of servers in the communicator is an integral power of 2 and the pipeline algorithm is not applicable, or the number of servers is not an integral power of 2 but the communication data size is small.
      • NHR: Nonuniform Hierarchical Ring (NHR) algorithm. It features a small number of communication steps (logarithmic complexity) and relatively low latency. This algorithm applies to scenarios where there are lots of servers in the communicator and the pipeline algorithm is not applicable.
      • NHR_V1: NHR algorithm of an earlier version. It features a small number of communication steps (root complexity) and relatively low latency. This algorithm applies to scenarios where the number of servers in the communicator is not an integer power of 2 and the pipeline algorithm is not applicable. Theoretically, the performance of the NHR_V1 algorithm is lower than that of the new NHR algorithm. This configuration option will be deprecated in the future. It is recommended that developers use the NHR algorithm.
      • NB: Nonuniform Bruck (NB) algorithm. It features a small number of communication steps (logarithmic complexity) and relatively low latency. This algorithm applies to scenarios where there are lots of servers in the communicator and the pipeline algorithm is not applicable.
      • AHC: Asymmetric Hierarchical Concatenate (AHC) algorithm. This algorithm applies to scenarios where NPUs are symmetrically or asymmetrically distributed (that is, asymmetric number of devices) across multiple layers in the communicator. When bandwidth convergence exists between layers in the communicator, the relative benefits are better.

        Note: When level1 (inter-server communication algorithm) is set to AHC, level2 (inter-supernode communication algorithm) automatically uses the AHC algorithm without additional configuration. Even if level2 is set to another algorithm, the setting does not take effect.

      • pipeline: pipeline parallel algorithm. It can concurrently use the intra-server and inter-server links. This algorithm applies to scenarios where the communication data size is large and each server in the communicator contains multiple devices.
      • pairwise: pairwise communication algorithm. It is used only for the AlltoAll, AlltoAllV, and AlltoAllVC operators. It features a large number of communication steps (linear complexity) and relatively high latency, and requires additional memory allocation. The memory size is in direct proportion to the data size. However, it can avoid the problem of one-to-many network. This algorithm applies to scenarios where the communication data size is large and the problem of one-to-many network needs to be avoided.

      For details about the communication operators, data types, network running modes, and products supported by each inter-server communication algorithm, see Supported Algorithms for Communication Between Servers.

      If level1 is not set:
      • For the Atlas A3 training products / Atlas A3 inference products , the algorithm is automatically selected based on the product form, number of nodes, and data size.
      • For the Atlas A2 training products / Atlas A2 inference products , the algorithm is automatically selected based on the product form, number of nodes, and data size.
      • For the Atlas training products , if the number of servers in a communicator is not an integral power of 2, the ring algorithm is used by default. In other scenarios, the H-D_R algorithm is used by default.
    • level2 indicates the inter-supernode communication algorithm. This value can be set to the following:
      • ring: a communication algorithm based on the ring topology. It features a large number of communication steps (linear complexity) and relatively high latency. However, it has a simple communication relationship and is less affected by network congestion. This algorithm applies to scenarios where the number of supernodes in the communicator is small and is not an integral power of 2.
      • H-D_R: Recursive Halving-Doubling (RHD) algorithm. It features a small number of communication steps (logarithmic complexity) and relatively low latency. However, extra traffic is introduced if the number of nodes is not a power of 2. This algorithm applies to scenarios where the number of supernodes in the communicator is an integral power of 2, or the number of servers is not an integral power of 2 but the communication data size is small.
      • NHR: Nonuniform Hierarchical Ring (NHR) algorithm. It features a small number of communication steps (logarithmic complexity) and relatively low latency. This algorithm applies to scenarios where the number of supernodes in the communicator is large.
      • NB: Nonuniform Bruck (NB) algorithm. It features a small number of communication steps (logarithmic complexity) and relatively low latency. This algorithm applies to scenarios where the number of supernodes in the communicator is large.
      • pipeline: pipeline parallel algorithm. It can concurrently use the intra-supernode and inter-supernode links. This algorithm applies to scenarios where the communication data size is large and each supernode in the communicator contains multiple devices.

      For details about the communication operators, data types, and network running modes supported by each inter-supernode communication algorithm, see Supported Algorithms for Communication Between Supernodes.

      If level2 is not set, the ring algorithm is used when the number of supernodes in the communicator is less than 8 and is not an integral power of 2. In other scenarios, the H-D_R algorithm is used.

      Currently, the level2 configuration applies only to the following scenarios:
      • It is supported only by the Atlas A3 training products / Atlas A3 inference products .
      • It is supported only when the orchestration expansion position of the communication algorithm is AI_CPU. The orchestration expansion position of the communication algorithm can be configured using the environment variable HCCL_OP_EXPANSION_MODE.
  • Configure the communication algorithm based on the operator type as follows:

    export HCCL_ALGO="<op0>=level0:NA;level1:<algo0>;level2:<algo1>/<op1>=level0:NA;level1:<algo3>;level2:<algo4>"

    Where:
    • <op> indicates the type of the communication operator. The options are as follows:
      • allgather: corresponds to the communication operators AllGather and AllGatherV.
      • reducescatter: corresponds to the communication operators ReduceScatter and ReduceScatterV.
      • allreduce: corresponds to the communication operator AllReduce.
      • broadcast: corresponds to the communication operator Broadcast.
      • reduce: corresponds to the communication operator Reduce.
      • scatter: corresponds to the communication operator Scatter.
      • alltoall: corresponds to the communication operators AlltoAll, AlltoAllV, and AlltoAllVC.
    • <algo> specifies the communication algorithm used by the specified communication operator. The supported configuration is the same as that of level1 and level2 in the global configuration method. Ensure that the specified communication algorithm is supported by the communication operator. For details about the communication operators supported by each algorithm, see Supported Algorithms for Communication Between Servers and Supported Algorithms for Communication Between Supernodes. If no communication algorithm is specified for a communication operator, the system automatically selects a communication algorithm based on the product form, number of nodes, and data size.
    • Use slashes (/) to separate the configurations of multiple operators.

Example

  • Global configuration
    export HCCL_ALGO="level0:NA;level1:H-D_R"
  • Configuration by operator
    # The AllReduce operator uses the ring algorithm and the AllGather operator uses the RHD algorithm. Other operators automatically select a communication algorithm based on the product form, number of nodes, and data size.
    export HCCL_ALGO="allreduce=level0:NA;level1:ring/allgather=level0:NA;level1:H-D_R"

Restrictions

  • In the current version, the intra-server communication algorithm can only be set to NA.
  • For Atlas A2 training products / Atlas A2 inference products , you are advised not to configure the HCCL_ALGO environment variable in the order-preserving scenario of strict deterministic computing.
  • If you call the HCCL C APIs to initialize a communicator with specific configurations and specify the communication algorithm using the hcclAlgo parameter of HcclCommConfig, the configuration of the communicator takes precedence.

Supported Algorithms for Communication Between Servers

Table 1 Supported algorithms for communication between servers

Algorithm Type

Collective Communication Operator

Data Type

Network Running Mode

Deterministic Computing

Supported Products

Unsupported Operator Processing Method

ring

ReduceScatter, AllGather, AllReduce, and Reduce

int8, int16, int32, int64, float16, float32, and bfp16

  • Single-operator mode
  • Graph mode (Ascend IR)

Supported

Atlas training products

Atlas A2 training products / Atlas A2 inference products

Atlas A3 training products / Atlas A3 inference products

The NHR or H-D_R algorithm is automatically selected.

ReduceScatterV

int8, int16, int32, int64 (supported only in single-operator mode on the Atlas A3 training products / Atlas A3 inference products ), float16, float32, and bfp16

  • Single-operator mode
  • The graph mode (Ascend IR) is supported only on the Atlas A2 training products / Atlas A2 inference products .

Supported

Atlas A2 training products / Atlas A2 inference products

Atlas A3 training products / Atlas A3 inference products

The NHR or H-D_R algorithm is automatically selected.

Scatter

int8, int16, int32, int64, float16, float32, and bfp16

  • Single-operator mode

Supported

Atlas A2 training products / Atlas A2 inference products

Atlas A3 training products / Atlas A3 inference products

The NHR or H-D_R algorithm is automatically selected.

AllGatherV

int8, int16, int32, int64, float16, float32, and bfp16

  • Single-operator mode
  • The graph mode (Ascend IR) is supported only on the Atlas A2 training products / Atlas A2 inference products .

Supported

Atlas A2 training products / Atlas A2 inference products

Atlas A3 training products / Atlas A3 inference products

The NHR or H-D_R algorithm is automatically selected.

H-D_R

ReduceScatter, AllGather, AllReduce, Broadcast, and Reduce

int8, int16, int32, int64, float16, float32, and bfp16

  • Single-operator mode
  • Graph mode (Ascend IR)

Supported

Atlas training products

Atlas A2 training products / Atlas A2 inference products

The NHR or ring algorithm is automatically selected.

NHR

ReduceScatter, AllGather, AllReduce, and Broadcast

int8, int16, int32, int64, float16, float32, and bfp16

  • Single-operator mode
  • Graph mode (Ascend IR)

Supported

Atlas training products

Atlas A2 training products / Atlas A2 inference products

Atlas A3 training products / Atlas A3 inference products

The H-D_R or ring algorithm is automatically selected.

ReduceScatterV

int8, int16, int32, int64 (supported only in single-operator mode on the Atlas A3 training products / Atlas A3 inference products ), float16, float32, and bfp16

  • Single-operator mode
  • The graph mode (Ascend IR) is supported only on the Atlas A2 training products / Atlas A2 inference products .

Supported

Atlas A2 training products / Atlas A2 inference products

Atlas A3 training products / Atlas A3 inference products

The H-D_R or ring algorithm is automatically selected.

Scatter

int8, int16, int32, int64, float16, float32, and bfp16

  • Single-operator mode

Supported

Atlas A2 training products / Atlas A2 inference products

Atlas A3 training products / Atlas A3 inference products

The H-D_R or ring algorithm is automatically selected.

AllGatherV

int8, int16, int32, int64, float16, float32, and bfp16

  • Single-operator mode
  • The graph mode (Ascend IR) is supported only on the Atlas A2 training products / Atlas A2 inference products .

Supported

Atlas A2 training products / Atlas A2 inference products

Atlas A3 training products / Atlas A3 inference products

The H-D_R or ring algorithm is automatically selected.

NHR_V1

ReduceScatter, AllGather, AllReduce, and Broadcast

int8, int16, int32, int64, float16, float32, and bfp16

  • Single-operator mode
  • Graph mode (Ascend IR)

Supported

Atlas training products

Atlas A2 training products / Atlas A2 inference products

The NHR, H-D_R, or ring algorithm is automatically selected.

NB

ReduceScatter, AllGather, AllReduce, and Broadcast

int8, int16, int32, int64, float16, float32, and bfp16

  • Single-operator mode
  • Graph mode (Ascend IR)

Supported

Atlas training products

Atlas A2 training products / Atlas A2 inference products

Atlas A3 training products / Atlas A3 inference products

The NHR, H-D_R, or ring algorithm is automatically selected.

ReduceScatterV

int8, int16, int32, int64 (supported only in single-operator mode on the Atlas A3 training products / Atlas A3 inference products ), float16, float32, and bfp16

  • Single-operator mode
  • The graph mode (Ascend IR) is supported only on the Atlas A2 training products / Atlas A2 inference products .

Supported

Atlas training products

Atlas A2 training products / Atlas A2 inference products

Atlas A3 training products / Atlas A3 inference products

The NHR, H-D_R, or ring algorithm is automatically selected.

AllGatherV

int8, int16, int32, int64, float16, float32, and bfp16

  • Single-operator mode
  • The graph mode (Ascend IR) is supported only on the Atlas A2 training products / Atlas A2 inference products .

Supported

Atlas training products

Atlas A2 training products / Atlas A2 inference products

Atlas A3 training products / Atlas A3 inference products

The NHR, H-D_R, or ring algorithm is automatically selected.

Scatter

int8, int16, int32, int64, float16, float32, and bfp16

  • Single-operator mode

Supported

Atlas training products

Atlas A2 training products / Atlas A2 inference products

Atlas A3 training products / Atlas A3 inference products

The NHR, H-D_R, or ring algorithm is automatically selected.

AHC

ReduceScatter, AllGather, and AllReduce

int8, int16, int32, int64, float16, float32, and bfp16

  • Single-operator mode
  • Graph mode (Ascend IR)

Supported

Atlas A3 training products / Atlas A3 inference products

The NHR, H-D_R, or ring algorithm is automatically selected.

pipeline

AllReduce

int8, int16, int32, float16, float32, and bfp16

  • Single-operator mode
  • Graph mode (Ascend IR): For the floating-point compute overflow mode, the saturation mode is not supported and only the INF/NaN mode is supported.

Not supported

Atlas A2 training products / Atlas A2 inference products

The NHR, H-D_R, or ring algorithm is automatically selected.

AllGather

int8, int16, int32, int64, float16, float32, and bfp16

  • Single-operator mode
  • Graph mode (Ascend IR)

-

Atlas A2 training products / Atlas A2 inference products

The NHR, H-D_R, or ring algorithm is automatically selected.

AllGatherV

int8, int16, int32, int64, float16, float32, and bfp16

  • Single-operator mode
  • Graph mode (Ascend IR)

Supported

Atlas A2 training products / Atlas A2 inference products

The NHR, H-D_R, or ring algorithm is automatically selected.

ReduceScatter

int8, int16, int32, float16, float32, and bfp16

  • Single-operator mode
  • Graph mode (Ascend IR)

Not supported

Atlas A2 training products / Atlas A2 inference products

The NHR, H-D_R, or ring algorithm is automatically selected.

AlltoAll

int8, int16, int32, int64, float16, float32, and bfp16

  • Single-operator mode
  • Dynamic shape scenario of the graph mode (Ascend IR)

-

Atlas A2 training products / Atlas A2 inference products

The pairwise algorithm is automatically selected.

AlltoAllV

int8, int16, int32, int64, float16, float32, and bfp16

  • Single-operator mode
  • Dynamic shape scenario of the graph mode (Ascend IR)

-

Atlas A2 training products / Atlas A2 inference products

The pairwise algorithm is automatically selected.

AlltoAllVC

int8, int16, int32, int64, float16, float32, and bfp16

Dynamic shape scenario of the graph mode (Ascend IR)

-

Atlas A2 training products / Atlas A2 inference products

The pairwise algorithm is automatically selected.

pairwise

AlltoAll, AlltoAllV, and AlltoAllVC

int8, int16, int32, int64, float16, float32, and bfp16

  • Single-operator mode
  • Graph mode (Ascend IR)

-

Atlas A2 training products / Atlas A2 inference products

The NHR, H-D_R, or ring algorithm is automatically selected.

Supported Algorithms for Communication Between Supernodes

Table 2 Supported Algorithms for Communication Between Supernodes

Algorithm Type

Collective Communication Operator

Data Type

Network Running Mode

Deterministic Computing

Supported Products

Unsupported Operator Processing Method

ring

ReduceScatter, AllGather, AllReduce, and Reduce

int8, int16, int32, int64, float16, float32, and bfp16

  • Single-operator mode
  • Graph mode (Ascend IR)

Supported

Atlas A3 training products / Atlas A3 inference products

The NHR or H-D_R algorithm is automatically selected.

ReduceScatterV

int8, int16, int32, int64, float16, float32, and bfp16

  • Single-operator mode

Supported

Atlas A3 training products / Atlas A3 inference products

The NHR or H-D_R algorithm is automatically selected.

Scatter and AllGatherV

int8, int16, int32, int64, float16, float32, and bfp16

  • Single-operator mode

Supported

Atlas A3 training products / Atlas A3 inference products

The NHR or H-D_R algorithm is automatically selected.

H-D_R

AllReduce, Broadcast, and Reduce

int8, int16, int32, int64, float16, float32, and bfp16

  • Single-operator mode
  • Graph mode (Ascend IR)

Supported

Atlas A3 training products / Atlas A3 inference products

The NHR or ring algorithm is automatically selected.

NHR

ReduceScatter, AllGather, AllReduce, and Broadcast

int8, int16, int32, int64, float16, float32, and bfp16

  • Single-operator mode
  • Graph mode (Ascend IR)

Supported

Atlas A3 training products / Atlas A3 inference products

The H-D_R or ring algorithm is automatically selected.

ReduceScatterV

int8, int16, int32, int64, float16, float32, and bfp16

  • Single-operator mode

Supported

Atlas A3 training products / Atlas A3 inference products

The H-D_R or ring algorithm is automatically selected.

Scatter and AllGatherV

int8, int16, int32, int64, float16, float32, and bfp16

  • Single-operator mode

Supported

Atlas A3 training products / Atlas A3 inference products

The H-D_R or ring algorithm is automatically selected.

NB

ReduceScatter, AllGather, AllReduce, and Broadcast

int8, int16, int32, int64, float16, float32, and bfp16

  • Single-operator mode
  • Graph mode (Ascend IR)

Supported

Atlas A3 training products / Atlas A3 inference products

The NHR, H-D_R, or ring algorithm is automatically selected.

ReduceScatterV

int8, int16, int32, int64, float16, float32, and bfp16

  • Single-operator mode

Supported

Atlas A3 training products / Atlas A3 inference products

The NHR, H-D_R, or ring algorithm is automatically selected.

Scatter and AllGatherV

int8, int16, int32, int64, float16, float32, and bfp16

  • Single-operator mode

Supported

Atlas A3 training products / Atlas A3 inference products

The NHR, H-D_R, or ring algorithm is automatically selected.

pipeline

AllGather

int8, int16, int32, int64, float16, float32, and bfp16

  • Single-operator mode (valid only when zero copy is enabled)

Supported

Atlas A3 training products / Atlas A3 inference products

The NHR, H-D_R, or ring algorithm is automatically selected.

ReduceScatter

int8, int16, int32, float16, float32, and bfp16

  • Single-operator mode (valid only when zero copy is enabled)

Supported

Atlas A3 training products / Atlas A3 inference products

The NHR, H-D_R, or ring algorithm is automatically selected.

Applicability

Atlas training products

Atlas A2 training products / Atlas A2 inference products

Atlas A3 training products / Atlas A3 inference products