(Optional) Creating a Fault

This section describes how to construct simple faults, including node faults, parameter plane network faults, and service plane faults.

Constructing a processor fault may cause security risks. Contact Huawei technical support to perform this operation.

Constructing a Node Fault

Restart the training node to simulate node status loss caused by node power-off. This fault can be automatically rectified after the node is restarted.

  1. After an iteration ends in a normal training job, log in to the node that is being trained.
  2. Run the following command to restart the training node to simulate a node status loss fault:
    reboot
  3. Run the following command on the master node for multiple times to check the pod status:
    kubectl get pod -A

    The pod status changes from Terminating to Pending and finally to Running, indicating that the training job has been restarted.

  4. Run the following command on the master node to view the training logs and record the time when the training is successfully resumed:
    kubectl logs -n Namespace_name Pod_name
    The sample output below shows that the latest checkpoint file from the ninth iteration is used to resume training at the tenth iteration after a fault occurs.
    1
    2
    3
    [2025-06-22 14:47:00] iteration       10/    5000 | consumed samples:          640 | elapsed time per iteration (ms): 1932.5 | learning rate: 2.500000E-07 | global batch size:    64 | lm loss: 1.053084E+01 | loss scale: 1.0 | g      rad norm: 56.739 | num zeros: 0 | number of skipped iterations:   0 | number of nan iterations:   0 |
    [2025-06-22 14:47:02] iteration       11/    5000 | consumed samples:          704 | elapsed time per iteration (ms): 1981.0 | learning rate: 2.750000E-07 | global batch size:    64 | lm loss: 1.044677E+01 | loss scale: 1.0 | g      rad norm: 57.590 | num zeros: 0 | number of skipped iterations:   0 | number of nan iterations:   0 |
    ......
    

Constructing a Parameter Plane Network Fault

A parameter plane network fault can be simulated by disconnecting the NPU network link. NPU network faults do not affect single-server training jobs. After the link is disconnected, you need to manually restore the link. Otherwise, the fault persists.

  1. After an iteration ends in a normal training job, log in to the node that is being trained.
  2. Run the following command to create the NPU network link fault:
    hccn_tool -i {device_id} -link -s down

    device_id indicates the NPU ID. You can run the npu-smi info command to view the NPU ID.

  3. Run the following command to check the status of the NPU network link:
    hccn_tool -i {device_id} -net_health -g
    If the following information is displayed, the NPU network link fault is successfully created.
    1
    net health status: Fault
    
  4. Run the following command on the master node for multiple times to check the pod status:
    kubectl get pod -A

    The pod status changes from Terminating to Pending and finally to Running, indicating that the training job has been restarted.

  5. Run the following command on the master node to view the training logs and record the time when the training is successfully resumed:
    kubectl logs -n Namespace_name Pod_name
    The sample output below shows that the latest checkpoint file from the ninth iteration is used to resume training at the tenth iteration after a fault occurs.
    1
    2
    3
    [2025-06-22 14:47:00] iteration       10/    5000 | consumed samples:          640 | elapsed time per iteration (ms): 1932.5 | learning rate: 2.500000E-07 | global batch size:    64 | lm loss: 1.053084E+01 | loss scale: 1.0 | g      rad norm: 56.739 | num zeros: 0 | number of skipped iterations:   0 | number of nan iterations:   0 |
    [2025-06-22 14:47:02] iteration       11/    5000 | consumed samples:          704 | elapsed time per iteration (ms): 1981.0 | learning rate: 2.750000E-07 | global batch size:    64 | lm loss: 1.044677E+01 | loss scale: 1.0 | g      rad norm: 57.590 | num zeros: 0 | number of skipped iterations:   0 | number of nan iterations:   0 |
    ......
    
  6. Run the following command to rectify the NPU network link fault:
    hccn_tool -i {device_id} -cfg recovery
  7. Run the following command to check the status of the NPU network link:
    hccn_tool -i {device_id} -net_health -g
    If the following information is displayed, the NPU network link fault is rectified.
    1
    net health status: Success
    

Constructing a Service Plane Fault

A service plane fault can be simulated by deleting the training process.

  1. After an iteration ends in a normal training job, log in to the node that is being trained.
  2. Run the following command to use the training startup script to query the training process information:
    ps -ef | grep python| grep Training startup script .py
  3. Run the following command to manually delete the training process with the smallest PID:
    kill -9 pid
  4. Run the following command on the master node for multiple times to check the pod status:
    kubectl get pod -A

    The pod status changes from Terminating to Pending and finally to Running, indicating that the training job has been restarted.

  5. Run the following command on the master node to view the training logs and record the time when the training is successfully resumed:
    kubectl logs -n Namespace_name Pod_name

    The sample output below shows that the latest checkpoint file from the ninth iteration is used to resume training at the tenth iteration after a fault occurs.

    1
    2
    3
    [2025-06-22 14:47:00] iteration       10/    5000 | consumed samples:          640 | elapsed time per iteration (ms): 1932.5 | learning rate: 2.500000E-07 | global batch size:    64 | lm loss: 1.053084E+01 | loss scale: 1.0 | g      rad norm: 56.739 | num zeros: 0 | number of skipped iterations:   0 | number of nan iterations:   0 |
    [2025-10-16 14:47:02] iteration       11/    5000 | consumed samples:          704 | elapsed time per iteration (ms): 1981.0 | learning rate: 2.750000E-07 | global batch size:    64 | lm loss: 1.044677E+01 | loss scale: 1.0 | g      rad norm: 57.590 | num zeros: 0 | number of skipped iterations:   0 | number of nan iterations:   0 |
    ......