HCCL Overview

Huawei Collective Communication Library (HCCL) is a high-performance collective communication library based on Ascend AI processors. It provides data parallel and model parallel solutions for collective communication in single-server multi-device and multi-server multi-device scenarios.

HCCL supports communication primitives such as AllReduce, Broadcast, AllGather, ReduceScatter and AlltoAll and communication algorithms such as Ring, Mesh, and Halving-Doubling (HD), and implements collective communication based on HCCS, RoCE, and PCIe high-speed links.

Supported Products

Atlas Training Series Product

HCCL's Position in the System

Figure 1 HCCL's position in the system

HCCL provides APIs in C and Python languages. The C language APIs are used to implement framework adaptation in single-operator mode. For example, the HCCL single-operator APIs are embedded in the PyTorch backend code, and PyTorch users can directly use the native PyTorch collective communication APIs, implementing the distributed capability. Python APIs are used to implement framework adaptation in graph mode. For example, the TensorFlow network implements distributed optimization based on the Python APIs of HCCL.

HCCL Software Architecture

Figure 2 HCCL software architecture

The HCCL software architecture is divided into three layers:

Adaptation layer: consists of graph engine (GE) adaptation and single-operator adaptation, providing communicator management and communication operator APIs.
Service layer: consists of the communication framework and communication algorithm modules.
- Communication framework: manages the communicator, connects services of communication operators, selects algorithms through the collaborative communication algorithm module, and applies for resources and delivers collective communication tasks through the collaborative communication platform module.
- Communication algorithm: serves as the carrier module of collective communication algorithms. This module provides resource computing for specific collective communication operations and orchestrates communication tasks based on the communicator information.
Platform layer: provides resource abstraction related to collective communication on the NPU, and provides maintenance and test capabilities related to collective communication.

Collective Communication Process

In distributed scenarios, HCCL provides high-performance collective communication between servers. The communication process is shown in Figure 3.

Figure 3 Collective communication process in distributed scenarios

The communication between servers goes through four phases:

Communicator initialization: Obtain necessary collective communication parameters and initialize the communicator.
This phase does not involve communications between NPU devices.
Communication setup: Establish a socket connection and exchange communication parameters and memory information between two communication ends.
In this phase, HCCL establishes links with other NPU devices according to cluster information provided by users and the network topology, and exchanges parameters used for communications. If no response is received from other NPU devices within the link setup timeout, a link setup timeout error is reported and the service process exits.
Communication operations: Synchronize the device execution status and transfer memory data through the wait and notify mechanism.
In this phase, HCCL delivers tasks such as communication algorithm orchestration and memory access to the Task Scheduler of the Ascend device through Runtime. The device schedules and executes the tasks based on the orchestration information.
Communicator destruction: Destroy the communicator and release communication resources.