Online Stress Testing
MindCluster supports online stress tests during training. That is, you can call the online stress testing interface to suspend a specified training job and perform hardware P2P or AIC stress tests on the nodes running the job. If no fault exists, training resumes. If a fault exists, the faulty node is isolated and resumable training is triggered.
Restrictions
- For PyTorch, this function must be used with MindSpeed-LLM 2.3.0. For details about the version mapping, see MindSpeed-LLM.
- For MindSpore, this function must be used with MindFormers master. For details about the version mapping, see MindSpore MindFormers.
- Deliver the online stress test command after the training iteration is normal.
- Ensure that process-level recovery has been enabled.
- ClusterD cannot be restarted during stress testing. If ClusterD is restarted unexpectedly, restart training and deliver a stress test.
- During stress testing, hot reset needs to be disabled.
- P2P stress testing requires the device to have more than 10 GB of idle memory.
- Add the nodeDEnable=on label to the node to ensure that the node where stress testing is performed can be isolated.
- For MindSpore, set export TASKD_PROCESS_ENABLE to on before starting TaskD Manager.
Supported Products and AI Frameworks
Product Type |
Hardware Form |
Training Framework |
|---|---|---|
Atlas 800T A2 training server |
|
|
Atlas 900 A3 SuperPoD |
|
Online Stress Testing Principles

The details of each step are as follows:
- An AI platform integrates ClusterD and calls the gRPC interface of ClusterD to deliver a stress test and specify the node to be tested.
- ClusterD instructs MindIO to suspend training.
- TaskD Manager instructs specified the TaskD Workers to call the training framework interface to perform stress testing.
- The training framework calls the CANN interface on the specified NPU to perform stress testing.
- After ClusterD determines that stress testing of the specified NPU is complete, TaskD instructs MindIO to continue the next step of training.
Function Adaptation Points
During online stress testing, the framework initializes the MindIO service. After the service is started, the optimizer updates the corresponding status to MindIO. The graceful suspension mechanism is called for job suspension. After the suspension, a hardware stress test is performed. After the test is complete, training continues. The cluster brain needs to provide an external interface to receive stress test instructions and manage the stress test process.
For non-MindSpeed-LLM/MindCluster users, adapt the following functions as listed in Table 2.
Function |
Description |
Adapted Component |
Reference Link |
|---|---|---|---|
Boot while initialization |
The MindIO service is started while a training framework is initialized. |
Distributed training framework |
|
Optimizer update status reporting |
Before optimizer update, the start and end of the update process are reported. |
||
Graceful suspension |
The MindIO function is called at the end of the training iteration to implement active suspension. |
||
Online stress testing management |
Used to deliver online stress testing requests and control the suspension and resumption of training processes. |
AI platform |