Operator-Level Online Recovery

In Atlas A3 training product, communication operators can be retransmitted by HCCL when a parameter plane network fault occurs. If the faulty process does not exit, operator-level online recovery can tolerate network exceptions for a longer time, without interrupting training jobs.

If operator-level online recovery (HCCL communication operator retransmission) fails to rectify network faults, process-level online recovery is triggered.

For details about the key configuration steps of operator-level online recovery, see Configuring Operator-Level Online Recovery.

HCCL is a distributed collective communication library designed by Huawei for Ascend AI processors. It optimizes the efficient collaboration between multiple devices (such as NPUs and GPUs) to accelerate distributed training of deep learning models, supporting AI scenarios that demand large-scale computing power. In distributed training, HCCL coordinates data synchronization (such as gradient aggregation and parameter update) between multiple Ascend processors, reducing communication overheads and improving training efficiency.

Scenario

Currently, operator-level online recovery can be used to rectify the following faults.

  • For processor network faults, if operator retransmission is successful, Volcano processes the current job as an unhealthy job. If operator retransmission fails, Volcano triggers rescheduling.
  • For UnifiedBus interconnect device faults, if operator-level online recovery is performed by HCCL, Volcano processes the current job as an unhealthy job.

Restrictions

  • This feature does not support the scenario where MC2 is enabled.
  • Watchdog cannot be enabled.

Supported Products and Frameworks

Table 1 Supported products and frameworks

Product Type

Product

Training Framework

Atlas A3 training products

Atlas 900 A3 SuperPoD cluster computing system

-

Operator-level Online Recovery Principles

Figure 1 Schematic diagram

The details of each step are as follows:

  1. During training, a linkdown fault occurs on the HCCS or RoCE network plane.
  2. CANN detects the network fault. Once the current operator is terminated, the system attempts to recover the network link by switching BGP links on the HCCS plane or by enabling link failover communication on the RoCE network plane. After recovery, the network operator is re-executed.
  3. After the operator is re-executed successfully, the training iteration is resumed.