Creating a Fault

You can refer to this section to create a fault.

(Optional) Creating an NPU Fault

The parameter plane network fault can be simulated by disconnecting the NPU network link. NPU network faults do not affect single-server training jobs. After the link is disconnected, you need to manually restore the link. Otherwise, the fault persists.

  1. Log in to a compute node.
  2. Run the following command to create a NPU network link fault:
    hccn_tool -i {device_id} -link -s down

    device_id indicates the NPU ID. You can run the npu-smi info command to view the NPU ID.

  3. Run the following command to check the status of the NPU network link:
    hccn_tool -i {device_id} -net_health -g
    If the following information is displayed, the NPU network link fault is successfully created.
    1
    net health status: Fault
    
  4. Run the following command to rectify the NPU network link fault:
    hccn_tool -i {device_id} -cfg recovery
  5. Run the following command to check the status of the NPU network link:
    hccn_tool -i {device_id} -net_health -g
    If the following information is displayed, the NPU network link fault is rectified.
    1
    net health status: Success