Creating a Fault
You can refer to this section to create a fault.
(Optional) Creating an NPU Fault
The parameter plane network fault can be simulated by disconnecting the NPU network link. NPU network faults do not affect single-server training jobs. After the link is disconnected, you need to manually restore the link. Otherwise, the fault persists.
- Log in to a compute node.
- Run the following command to create a NPU network link fault:
hccn_tool -i {device_id} -link -s down
device_id indicates the NPU ID. You can run the npu-smi info command to view the NPU ID.
- Run the following command to check the status of the NPU network link:
hccn_tool -i {device_id} -net_health -gIf the following information is displayed, the NPU network link fault is successfully created.1net health status: Fault
- Run the following command to rectify the NPU network link fault:
hccn_tool -i {device_id} -cfg recovery - Run the following command to check the status of the NPU network link:
hccn_tool -i {device_id} -net_health -gIf the following information is displayed, the NPU network link fault is rectified.1net health status: Success
Parent topic: Viewing Results