Suspension and Switchback of Link Failover Communication

In Atlas A3 training product, MindCluster cluster scheduling components provide suspension and switchback functions for link failover communication of training jobs, allowing you to freely switch RoCE network ports used by NPUs during training via active link failover and switchback interfaces.

For details about how to configure suspension and switchback of link failover communication, see Configuring Suspension and Switchback of Link Failover Communication.

  • Before calling Link Failover and Switchback APIs to perform link failover and switchback, understand NPU networking first and ensure that the network link of the target NPU is normal. If the target NPU is in the linkdown state, the operation fails.
  • The following uses the interface interconnection in the networking guide as an example to describe the dev-op mapping when SwitchNicTrack is called.
    1. If device 0 and device 8 are switched from QDD8 to QDD7, dev should be [device0, device8] and op should be [true, true].
    2. If device 0 and device 8 are switched back from QDD7 to QDD8, dev should be [device0, device8] and op should be [false, false].
    3. If device 0 is switched from PortA of QDD8 to PortA of QDD7, dev should be [device0] and op should be [true].
    4. If device 0 is switched back from PortA of QDD7 to PortA of QDD8, dev should be [device0] and op should be [false].
    5. If devices of leaf 1 are switched to leaf 2, dev should be [device0, device8, device2, device10, device4, device12, device6, device14] and op should be [true, true, true, true, true, true, true, true].
    6. If all devices of leaf 2 are switched back to leaf 1, dev should be [device0, device8, device2, device10, device4, device12, device6, device14] and op should be [false, false, false, false, false, false, false, false].
    Figure 1 Port interconnection relationship

Scenario

Currently, this feature can be used in the following two scenarios:

  • Switch upgrade: Link failover is manually triggered to upgrade switches. After that, links are switched back.
  • Troubleshooting: After the faulty port where link failover occurs is recovered, manually switch back links.

Restrictions

  • Deliver the link failover or switchback command after the training iteration is normal.
  • Ensure that process-level recovery has been enabled.
  • For MindSpore, set export TASKD_PROCESS_ENABLE to on before starting TaskD Manager.

Supported Products and AI Frameworks

Table 1 Supported products and frameworks

Product Type

Product

Training Framework

Atlas A3 training products

  • Atlas 900 A3 SuperPoD
  • Atlas 800T A3 SuperPoD Server
  • MindSpore
  • PyTorch

Principles of Suspension and Switchback of Link Failover Communication

Figure 2 Schematic diagram

The details of each step are as follows:

  1. An AI platform integrates ClusterD and calls the gRPC interface of ClusterD to deliver a failover operation and specify the target NPU.
  2. ClusterD instructs MindIO to suspend training.
  3. TaskD Manager instructs all TaskD Workers to call the training framework interface to perform the failover operation.
  4. The training framework calls CANN interfaces one by one by communicator to perform the failover operation.
  5. After ClusterD determines that the failover operation of all NPUs is complete, TaskD instructs MindIO to continue the next step of training.

Function Adaptation Points

During suspension and switchback of link failover communication, the framework initializes the MindIO service. After the service is started, the optimizer updates the corresponding status to MindIO. The graceful suspension mechanism is called for job suspension and failover. The cluster brain needs to provide an external interface to receive failover instructions and manage the link failover communication process.

For non-MindSpeed-LLM/MindCluster users, adapt the following functions as listed in Table 2.

Table 2 Functions adapted for suspension and switchback of link failover communication

Function

Description

Adapted Component

Reference Link

Boot while initialization

The MindIO service is started while a training framework is initialized.

Distributed training framework

Adapting to non-MindSpeed-LLM Framework

Optimizer update status reporting

Before optimizer update, the start and end of the update process are reported.

Graceful suspension

The MindIO function is called at the end of the training iteration to implement active suspension.

Link failover management

Used to deliver link failover requests and control the suspension and restart of training processes.

AI platform

See here.