(Optional) Creating a Fault
This section describes how to construct simple faults, including node faults, parameter plane network faults, and service plane faults.
Constructing a processor fault may cause security risks. Contact Huawei technical support to perform this operation.
Constructing a Node Fault
Restart the training node to simulate node status loss caused by node power-off. This fault can be automatically rectified after the node is restarted.
- After an iteration ends in a normal training job, log in to the node that is being trained.
- Run the following command to restart the training node to simulate a node status loss fault:
reboot
- Run the following command on the master node for multiple times to check the pod status:
kubectl get pod -A
The pod status changes from Terminating to Pending and finally to Running, indicating that the training job has been restarted.
- Run the following command on the master node to view the training logs and record the time when the training is successfully resumed:
kubectl logs -n Namespace_name Pod_name
The sample output below shows that the latest checkpoint file from the ninth iteration is used to resume training at the tenth iteration after a fault occurs.1 2 3
[2025-06-22 14:47:00] iteration 10/ 5000 | consumed samples: 640 | elapsed time per iteration (ms): 1932.5 | learning rate: 2.500000E-07 | global batch size: 64 | lm loss: 1.053084E+01 | loss scale: 1.0 | g rad norm: 56.739 | num zeros: 0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2025-06-22 14:47:02] iteration 11/ 5000 | consumed samples: 704 | elapsed time per iteration (ms): 1981.0 | learning rate: 2.750000E-07 | global batch size: 64 | lm loss: 1.044677E+01 | loss scale: 1.0 | g rad norm: 57.590 | num zeros: 0 | number of skipped iterations: 0 | number of nan iterations: 0 | ......
Constructing a Parameter Plane Network Fault
A parameter plane network fault can be simulated by disconnecting the NPU network link. NPU network faults do not affect single-server training jobs. After the link is disconnected, you need to manually restore the link. Otherwise, the fault persists.
- After an iteration ends in a normal training job, log in to the node that is being trained.
- Run the following command to create the NPU network link fault:
hccn_tool -i {device_id} -link -s down
device_id indicates the NPU ID. You can run the npu-smi info command to view the NPU ID.
- Run the following command to check the status of the NPU network link:
hccn_tool -i {device_id} -net_health -gIf the following information is displayed, the NPU network link fault is successfully created.1net health status: Fault
- Run the following command on the master node for multiple times to check the pod status:
kubectl get pod -A
The pod status changes from Terminating to Pending and finally to Running, indicating that the training job has been restarted.
- Run the following command on the master node to view the training logs and record the time when the training is successfully resumed:
kubectl logs -n Namespace_name Pod_name
The sample output below shows that the latest checkpoint file from the ninth iteration is used to resume training at the tenth iteration after a fault occurs.1 2 3
[2025-06-22 14:47:00] iteration 10/ 5000 | consumed samples: 640 | elapsed time per iteration (ms): 1932.5 | learning rate: 2.500000E-07 | global batch size: 64 | lm loss: 1.053084E+01 | loss scale: 1.0 | g rad norm: 56.739 | num zeros: 0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2025-06-22 14:47:02] iteration 11/ 5000 | consumed samples: 704 | elapsed time per iteration (ms): 1981.0 | learning rate: 2.750000E-07 | global batch size: 64 | lm loss: 1.044677E+01 | loss scale: 1.0 | g rad norm: 57.590 | num zeros: 0 | number of skipped iterations: 0 | number of nan iterations: 0 | ......
- Run the following command to rectify the NPU network link fault:
hccn_tool -i {device_id} -cfg recovery - Run the following command to check the status of the NPU network link:
hccn_tool -i {device_id} -net_health -gIf the following information is displayed, the NPU network link fault is rectified.1net health status: Success
Constructing a Service Plane Fault
A service plane fault can be simulated by deleting the training process.
- After an iteration ends in a normal training job, log in to the node that is being trained.
- Run the following command to use the training startup script to query the training process information:
ps -ef | grep python| grep Training startup script .py
- Run the following command to manually delete the training process with the smallest PID:
kill -9 pid
- Run the following command on the master node for multiple times to check the pod status:
kubectl get pod -A
The pod status changes from Terminating to Pending and finally to Running, indicating that the training job has been restarted.
- Run the following command on the master node to view the training logs and record the time when the training is successfully resumed:
kubectl logs -n Namespace_name Pod_name
The sample output below shows that the latest checkpoint file from the ninth iteration is used to resume training at the tenth iteration after a fault occurs.
1 2 3
[2025-06-22 14:47:00] iteration 10/ 5000 | consumed samples: 640 | elapsed time per iteration (ms): 1932.5 | learning rate: 2.500000E-07 | global batch size: 64 | lm loss: 1.053084E+01 | loss scale: 1.0 | g rad norm: 56.739 | num zeros: 0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2025-10-16 14:47:02] iteration 11/ 5000 | consumed samples: 704 | elapsed time per iteration (ms): 1981.0 | learning rate: 2.750000E-07 | global batch size: 64 | lm loss: 1.044677E+01 | loss scale: 1.0 | g rad norm: 57.590 | num zeros: 0 | number of skipped iterations: 0 | number of nan iterations: 0 | ......