Introduction

Application Scenario

To test the performance of Huawei Collective Communication Library (HCCL) in distributed training scenarios, you can use the HCCL Performance Tester.

This tool supports only network performance tests based on HCCL single-operator APIs.

Obtaining Source Package of the Tool

After the CANN Toolkit software package is installed, you can find the source code of the HCCL Performance Tester in ${INSTALL_DIR}/tools/hccl_test. Replace ${INSTALL_DIR} with the CANN component directory. For example, if the installation is performed by the root user, the default file storage path is /usr/local/Ascend/cann.

Compilation is required before you use the tool.

Supported Products

Atlas A3 training products/Atlas A3 inference products

Atlas A2 training products/Atlas A2 inference products

Atlas training products

Atlas inference products

For the Atlas A2 training products/Atlas A2 inference products, only the Atlas 800I A2 inference server, Atlas 300I A2 inference card, and A200I A2 Box heterogeneous components are supported.

For the Atlas inference products, only the Atlas 300I Duo inference card is supported.

Restrictions

  • For the Atlas A3 training products/Atlas A3 inference products, the HCCL performance tester supports the performance test of a maximum of 32000 communication ranks in a cluster.

    For the AlltoAll and AlltoAllV operators, the HCCL performance tester supports the performance test of a maximum of 8000 communication ranks in a cluster.

  • For the Atlas A2 training products/Atlas A2 inference products, the HCCL performance tester supports the performance test of a maximum of 32000 communication ranks in a cluster.
  • For the Atlas training products, the HCCL performance tester supports the performance test of a maximum of 4096 communication ranks in a cluster.

Background Knowledge

  • Bandwidth for collective communication

    The collective communication bandwidth refers to the algorithm bandwidth, that is, the data volume/time consumed when a collective communication operation is performed.

    For example, if the AllReduce operation is performed on eight devices on a single server, the algorithm bandwidth of the AllReduce operator is the data size divided by the time required for completing the AllReduce operation.

    When the HCCL performance tester is used for the test, the bandwidth data refers to the algorithm bandwidth.

    The algorithm bandwidth is affected by the following factors:
    • RDMA bandwidth between servers (RoCE link)
    • SDMA communication bandwidth between devices in a server (HCCS link)
    • PCIe link bandwidth
    • Implementation of communication algorithm orchestration
  • Physical bandwidth

    The physical bandwidth in a cluster includes the physical bandwidth of HCCS links and RoCE links. The physical bandwidth is a factor that affects the algorithm bandwidth.