Hardware Stress Test Case
Case: In a cluster with nearly 5,000 cards, the loss remains unmatched, and numerous spikes appear in grad_norm.
For a large-scale cluster, hardware stress tests are performed preferentially to check for faulty nodes.
The 4,800 cards are divided into 100 groups of (3 x 16 cards) tasks to run the same training task. The fixed randomness and deterministic computing are used to check whether any group is abnormal based on the final loss curve. If an abnormal group is found, perform the dmi stress test on the group.
Run the ascend-dmi -dg -i aicore -s -sc 60 -q command to perform a stress test on the machine and view the fault detection result.
|
Command Output |
Meaning |
|---|---|
|
PASS |
The stress test is passed, and the result is normal. |
|
SKIP |
The current device does not support P2P stress tests. |
|
EMERGENCY_WARN |
Emergency warning. The stress test fails. Contact Huawei engineers to replace the hardware. |
|
FAIL |
The P2P stress test fails. Contact Huawei technical support. |
The detection result shows that there are faulty nodes. After the faulty nodes are excluded, the accuracy is normal. For details, see the following figure.
- After the hardware fault is rectified, the loss spikes disappear.
Figure 3 Loss spikes disappeared
- After the hardware fault is rectified, the grad_norm spikes are significantly reduced.
Figure 4 Loss grad_norm spikes reduced