Introduction
Application Scenario
To test the performance of Huawei Collective Communication Library (HCCL) in distributed training scenarios, you can use the HCCL Performance Tester.
This tool supports only network performance tests based on HCCL single-operator APIs.
Obtaining Source Package of the Tool
After the CANN Toolkit software package is installed, you can find the source code of the HCCL Performance Tester in ${INSTALL_DIR}/tools/hccl_test. Replace ${INSTALL_DIR} with the CANN component directory. For example, if the installation is performed by the root user, the default file storage path is /usr/local/Ascend/cann.
Compilation is required before you use the tool.
Supported Products
For the
For the
Restrictions
- For the
Atlas A3 training products /Atlas A3 inference products , the HCCL performance tester supports the performance test of a maximum of 32000 communication ranks in a cluster.For the AlltoAll and AlltoAllV operators, the HCCL performance tester supports the performance test of a maximum of 8000 communication ranks in a cluster.
- For the
Atlas A2 training products /Atlas A2 inference products , the HCCL performance tester supports the performance test of a maximum of 32000 communication ranks in a cluster. - For the
Atlas training products , the HCCL performance tester supports the performance test of a maximum of 4096 communication ranks in a cluster.
Background Knowledge
- Bandwidth for collective communication
The collective communication bandwidth refers to the algorithm bandwidth, that is, the data volume/time consumed when a collective communication operation is performed.
For example, if the AllReduce operation is performed on eight devices on a single server, the algorithm bandwidth of the AllReduce operator is the data size divided by the time required for completing the AllReduce operation.
When the HCCL performance tester is used for the test, the bandwidth data refers to the algorithm bandwidth.
The algorithm bandwidth is affected by the following factors:- RDMA bandwidth between servers (RoCE link)
- SDMA communication bandwidth between devices in a server (HCCS link)
- PCIe link bandwidth
- Implementation of communication algorithm orchestration
- Physical bandwidth
The physical bandwidth in a cluster includes the physical bandwidth of HCCS links and RoCE links. The physical bandwidth is a factor that affects the algorithm bandwidth.