Operator-Level Online Recovery
In
If operator-level online recovery (HCCL communication operator retransmission) fails to rectify network faults, process-level online recovery is triggered.
For details about the key configuration steps of operator-level online recovery, see Configuring Operator-Level Online Recovery.
HCCL is a distributed collective communication library designed by Huawei for Ascend AI processors. It optimizes the efficient collaboration between multiple devices (such as NPUs and GPUs) to accelerate distributed training of deep learning models, supporting AI scenarios that demand large-scale computing power. In distributed training, HCCL coordinates data synchronization (such as gradient aggregation and parameter update) between multiple Ascend processors, reducing communication overheads and improving training efficiency.
Scenario
Currently, operator-level online recovery can be used to rectify the following faults.
- For processor network faults, if operator retransmission is successful, Volcano processes the current job as an unhealthy job. If operator retransmission fails, Volcano triggers rescheduling.
- For UnifiedBus interconnect device faults, if operator-level online recovery is performed by HCCL, Volcano processes the current job as an unhealthy job.
Restrictions
- This feature does not support the scenario where MC2 is enabled.
- Watchdog cannot be enabled.
Supported Products and Frameworks
Product Type |
Product |
Training Framework |
|---|---|---|
Atlas A3 training products |
Atlas 900 A3 SuperPoD cluster computing system |
- |
Operator-level Online Recovery Principles

The details of each step are as follows:
- During training, a linkdown fault occurs on the HCCS or RoCE network plane.
- CANN detects the network fault. Once the current operator is terminated, the system attempts to recover the network link by switching BGP links on the HCCS plane or by enabling link failover communication on the RoCE network plane. After recovery, the network operator is re-executed.
- After the operator is re-executed successfully, the training iteration is resumed.