Online Stress Testing

MindCluster supports online stress tests during training. That is, you can call the online stress testing interface to suspend a specified training job and perform hardware P2P or AIC stress tests on the nodes running the job. If no fault exists, training resumes. If a fault exists, the faulty node is isolated and resumable training is triggered.

Restrictions

  • For PyTorch, this function must be used with MindSpeed-LLM 2.3.0. For details about the version mapping, see MindSpeed-LLM.
  • For MindSpore, this function must be used with MindFormers master. For details about the version mapping, see MindSpore MindFormers.
  • Deliver the online stress test command after the training iteration is normal.
  • Ensure that process-level recovery has been enabled.
  • ClusterD cannot be restarted during stress testing. If ClusterD is restarted unexpectedly, restart training and deliver a stress test.
  • During stress testing, hot reset needs to be disabled.
  • P2P stress testing requires the device to have more than 10 GB of idle memory.
  • Add the nodeDEnable=on label to the node to ensure that the node where stress testing is performed can be isolated.
  • For MindSpore, set export TASKD_PROCESS_ENABLE to on before starting TaskD Manager.

Supported Products and AI Frameworks

Table 1 Products and frameworks supported for online stress testing

Product Type

Hardware Form

Training Framework

Atlas A2 training product

Atlas 800T A2 training server

  • MindSpore
  • PyTorch

Atlas A3 training product

Atlas 900 A3 SuperPoD

  • MindSpore
  • PyTorch

Online Stress Testing Principles

Figure 1 Schematic diagram

The details of each step are as follows:

  1. An AI platform integrates ClusterD and calls the gRPC interface of ClusterD to deliver a stress test and specify the node to be tested.
  2. ClusterD instructs MindIO to suspend training.
  3. TaskD Manager instructs specified the TaskD Workers to call the training framework interface to perform stress testing.
  4. The training framework calls the CANN interface on the specified NPU to perform stress testing.
  5. After ClusterD determines that stress testing of the specified NPU is complete, TaskD instructs MindIO to continue the next step of training.

Function Adaptation Points

During online stress testing, the framework initializes the MindIO service. After the service is started, the optimizer updates the corresponding status to MindIO. The graceful suspension mechanism is called for job suspension. After the suspension, a hardware stress test is performed. After the test is complete, training continues. The cluster brain needs to provide an external interface to receive stress test instructions and manage the stress test process.

For non-MindSpeed-LLM/MindCluster users, adapt the following functions as listed in Table 2.

Table 2 Functions adapted for online stress testing

Function

Description

Adapted Component

Reference Link

Boot while initialization

The MindIO service is started while a training framework is initialized.

Distributed training framework

Adapting to non-MindSpeed-LLM Framework

Optimizer update status reporting

Before optimizer update, the start and end of the update process are reported.

Graceful suspension

The MindIO function is called at the end of the training iteration to implement active suspension.

Online stress testing management

Used to deliver online stress testing requests and control the suspension and resumption of training processes.

AI platform

See here.