Suspension and Switchback of Link Failover Communication
In
For details about how to configure suspension and switchback of link failover communication, see Configuring Suspension and Switchback of Link Failover Communication.
- Before calling Link Failover and Switchback APIs to perform link failover and switchback, understand NPU networking first and ensure that the network link of the target NPU is normal. If the target NPU is in the linkdown state, the operation fails.
- The following uses the interface interconnection in the networking guide as an example to describe the dev-op mapping when SwitchNicTrack is called.
- If device 0 and device 8 are switched from QDD8 to QDD7, dev should be [device0, device8] and op should be [true, true].
- If device 0 and device 8 are switched back from QDD7 to QDD8, dev should be [device0, device8] and op should be [false, false].
- If device 0 is switched from PortA of QDD8 to PortA of QDD7, dev should be [device0] and op should be [true].
- If device 0 is switched back from PortA of QDD7 to PortA of QDD8, dev should be [device0] and op should be [false].
- If devices of leaf 1 are switched to leaf 2, dev should be [device0, device8, device2, device10, device4, device12, device6, device14] and op should be [true, true, true, true, true, true, true, true].
- If all devices of leaf 2 are switched back to leaf 1, dev should be [device0, device8, device2, device10, device4, device12, device6, device14] and op should be [false, false, false, false, false, false, false, false].
Figure 1 Port interconnection relationship
Scenario
Currently, this feature can be used in the following two scenarios:
- Switch upgrade: Link failover is manually triggered to upgrade switches. After that, links are switched back.
- Troubleshooting: After the faulty port where link failover occurs is recovered, manually switch back links.
Restrictions
- Deliver the link failover or switchback command after the training iteration is normal.
- Ensure that process-level recovery has been enabled.
- For MindSpore, set export TASKD_PROCESS_ENABLE to on before starting TaskD Manager.
Supported Products and AI Frameworks
Product Type |
Product |
Training Framework |
|---|---|---|
Atlas A3 training products |
|
|
Principles of Suspension and Switchback of Link Failover Communication

The details of each step are as follows:
- An AI platform integrates ClusterD and calls the gRPC interface of ClusterD to deliver a failover operation and specify the target NPU.
- ClusterD instructs MindIO to suspend training.
- TaskD Manager instructs all TaskD Workers to call the training framework interface to perform the failover operation.
- The training framework calls CANN interfaces one by one by communicator to perform the failover operation.
- After ClusterD determines that the failover operation of all NPUs is complete, TaskD instructs MindIO to continue the next step of training.
Function Adaptation Points
During suspension and switchback of link failover communication, the framework initializes the MindIO service. After the service is started, the optimizer updates the corresponding status to MindIO. The graceful suspension mechanism is called for job suspension and failover. The cluster brain needs to provide an external interface to receive failover instructions and manage the link failover communication process.
For non-MindSpeed-LLM/MindCluster users, adapt the following functions as listed in Table 2.
Function |
Description |
Adapted Component |
Reference Link |
|---|---|---|---|
Boot while initialization |
The MindIO service is started while a training framework is initialized. |
Distributed training framework |
|
Optimizer update status reporting |
Before optimizer update, the start and end of the update process are reported. |
||
Graceful suspension |
The MindIO function is called at the end of the training iteration to implement active suspension. |
||
Link failover management |
Used to deliver link failover requests and control the suspension and restart of training processes. |
AI platform |