Overview
HCCL provides C and Python development APIs to implement distributed capabilities.
- C APIs are used to implement framework adaptation in single-operator mode to implement distributed capabilities.
- Python APIs are used to implement framework adaptation in graph mode. Currently, they are only used to implement distributed optimization of the TensorFlow network on the Ascend AI Processor.
This section describes how to call C APIs of the HCCL to develop collective communication functions.
The following figure shows the process of calling C APIs of the HCCL to implement collective communication functions.
Figure 1 Collective communication process
- Configure the cluster information, create a communicator handle, and initialize the HCCL communicator.
- Implement HCCL communications, including point-to-point communication and collective communication.
- Point-to-point communication refers to the process of directly transmitting data between two NPUs when there are multiple NPUs. It is usually used to transmit and receive activation values in pipeline parallel scenarios. HCCL provides point-to-point communication at different granularities, including the single-rank RX and single-rank TX interfaces and the batch RX and TX interfaces.
- Collective communication means that multiple NPUs participate in data transmission operations, such as AllReduce, AllGather, and Broadcast. It is usually used for gradient synchronization and parameter update between different NPUs in a large-scale cluster. Collective communication enables all compute nodes to exchange data in parallel, efficient, and orderly mode, improving data transmission efficiency.
- After collective communication is complete, destroy the communicator and free related memory and stream resources.