Algorithm Overview

The same communication operator may use different communication algorithms according to different network topologies, communication data sizes, hardware resources, and more to maximize cluster communication performance. HCCL provides topology algorithms such as Mesh, Ring, Recursive Halving-Doubling (RHD), Pairwise, and Pipeline for intra-server, inter-server, and inter-supernode collective communication.

Intra-Server Communication Algorithms

Within a server in the HCCL communicator, the supported algorithms are Mesh, Ring, Double-Ring, and Star. They are automatically selected based on the hardware topology, and cannot be configured by users.

Inter-Server/Inter-Supernode Communication Algorithms

Between servers and supernodes in the HCCL communicator, the following algorithms can be adaptively selected based on the product form, data size, and number of servers. By default, they do not need to be configured by users.
  • Ring algorithm: This is a communication algorithm based on the ring topology. It features a large number of communication steps (linear complexity) and relatively high latency. However, it has a simple communication relationship and is less affected by network congestion. This algorithm applies to scenarios where there are only a few servers in the communicator, small-sized communication data, and obvious network congestion and the Pipeline algorithm is not applicable.
  • RHD algorithm: This algorithm features a small number of communication steps (logarithmic complexity) and relatively low latency. However, it introduces extra communication traffic when the number of servers is not a power of 2. This algorithm applies to scenarios where the number of servers in the communicator is an integral power of 2 and the Pipeline algorithm is not applicable, or the number of servers is not an integral power of 2 but the communication data size is small.
  • Nonuniform Hierarchical Ring (NHR) algorithm: This algorithm features a small number of communication steps (logarithmic complexity) and relatively low latency. It applies to scenarios where there are lots of servers in the communicator and the Pipeline algorithm is not applicable.
  • Nonuniform Bruck (NB) algorithm: This algorithm features a small number of communication steps (logarithmic complexity) and relatively low latency. It applies to scenarios where there are lots of servers in the communicator and the Pipeline algorithm is not applicable.
  • Pipeline algorithm: This algorithm can concurrently use the intra-server and inter-server links or intra-supernode and inter-supernode links. It applies to scenarios where the communication data size is large and each server in the communicator contains multiple devices.
  • Pairwise algorithm: This algorithm is used only for the AlltoAll, AlltoAllV, and AlltoAllVC operators. It features a large number of communication steps (linear complexity) and relatively high latency. However, it effectively mitigates the one-to-many network congestion issue (that is, one rank sends data to multiple other ranks through the same port). This algorithm applies to scenarios where the communication data size is large and one-to-many network bottlenecks need to be avoided.
  • Asymmetric Hierarchical Concatenate (AHC) algorithm: This algorithm is used only for the ReduceScatter, AllGather, and AllReduce operators, and applies to scenarios where NPUs are distributed at multiple layers in the communicator and NPUs are symmetrically or asymmetrically distributed at multiple layers. When bandwidth convergence exists between layers in the communicator, the relative benefits are better.
  • You can specify an inter-server or inter-supernode communication algorithm using the environment variable HCCL_ALGO. Note that if an inter-server or inter-supernode communication algorithm is specified using the environment variable HCCL_ALGO, the function for adaptively selecting a communication algorithm does not take effect. The algorithm specified by the user is used.
  • For details about the operators and products supported by each algorithm, see the environment variable HCCL_ALGO.

Estimated Required Time

HCCL uses the α–β model (Hockney) to evaluate performance. The variables used for calculating the required time of an algorithm are defined as follows:

  • α: fixed latency between ranks, in seconds. It is determined by the communication hardware and underlying software stack.
  • β: time required for transmitting each byte of data, in s/byte. It is determined by the communication link capability.
  • n: size of the data for inter-rank communication, in bytes. It is determined by the communication algorithm.
  • γ: time required for performing reduction calculation on each byte of data, in s/byte. It is determined by the compute hardware capability.
  • p: number of ranks in the communicator, which affects the number of communication steps. It is determined by the communicator where the communication operator is located.

The time required for one-step transmission and reduction of n bytes of data is calculated as follows: D = α + nβ + nγ.

The collective communication algorithm optimizes the communication relationship and steps based on the network topology, reduces the number of communication times to reduce the fixed latency, and reduces the actual communication data size to reduce the time required for transmission and computing, thereby optimizing the collective communication performance.

Hierarchical Communication Principles

HCCL is usually divided into a two-level topology (intra-rank and inter-rank) or three-level topology (intra-rank, inter-rank, and inter-supernode) to implement hierarchical collective communication. The link bandwidth varies with the level. Hierarchical communication enables communication task orchestration to be affinity with the network topology, maximizing the utilization of the link capability.

The following table describes the hierarchical communication process of each collective communication operator in the single-operator mode of the Atlas A2 training products/Atlas A2 inference products with a two-level topology (intra-rank and inter-rank).

Collective Communication Operator

Phase 1

Phase 2

Phase 3

ReduceScatter

Inter-server ReduceScatter

Intra-server ReduceScatter

/

ReduceScatterV

Inter-server ReduceScatterV

Inter-server ReduceScatterV

/

AllGather

Intra-server AllGather

Inter-server AllGather

/

AllGatherV

Intra-server AllGatherV

Inter-server AllGatherV

/

AllReduce

Intra-server ReduceScatter

Inter-server AllReduce

Intra-server AllGather

Scatter

Inter-server Scatter

Intra-server scatter

/

Broadcast

Intra-server scatter

Inter-server Broadcast

Intra-server AllGather

Reduce

Intra-server ReduceScatter

Inter-server Reduce

Intra-server Gather

HCCL does not provide the Gather operator. The difference between the Gather operation and the AllGather operation is that the Gather operation sends the result only to the output buffer of the root rank.

AlltoAll

Intra-server AlltoAll

Inter-server AlltoAll

/

AlltoAllV

Intra-server AlltoAllV

Inter-server AlltoAllV

/

AlltoAllVC

Inter-server AlltoAllVC

Inter-server AlltoAllVC

/

For details about the hierarchical communication process, see Hierarchical Communication Principles.