Diagnosing Faults on a Single Server
- Create a directory for storing the single-server diagnosis result.
mkdir Single-server_diagnosis_result_output_directory
- Run the command to start diagnosis.By default, the data of the fault event module is returned for single-server diagnosis.
ascend-fd single-diag -i Collection_directory -o Output_directory_of_the_single-server_diagnosis_result
The following information is displayed when a training job exits abnormally during diagnosis:The single-diag job starts. Please wait. Job id: [****], run log file is [****]. +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Ascend Fault-Diag Report | +--------------+------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Version | Type | Version | +--------------+------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | | Fault-Diag | 7.3.0 | +--------------+------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Fault event analysis | Type | Description | +--------------+------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | | Note | 1. Multiple faults are diagnosed and sorted by occurrence time. Check the faults that are occurred earlier. | | | | 2. Only 16 faulty devices are displayed. All faulty devices can be queried in the diag_report.json file. | +--------------+------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | | Status code | xxx | | | Fault type | Type: Network Component: Network Module: Network | | | Faulty device | ['worker-0 device-2'] | | | Fault name | Link Down: NPU intermittent disconnection | | | Fault description | The link of an NPU network port on the server is down for more than 30s. | | | Solution | 1. Contact physical network O&M personnel to collect switch logs and check whether hardware faults occur (for example, whether optical modules work properly and whether switch links are intermittently disconnected). | | | Key log | /usr/local/Ascend/driver/tools/hccn_tool -i 2 -link_stat -g | | | | [devid 2]current time : Fri Sep 1 06:37:26 2023 | | | | [devid 2]link up count : 2 | | | | [devid 2]link change records : | | | | [devid 2] Fri Sep 1 06:34:43 2023 LINK DOWN | | | | [devid 2] Thu Aug 31 07:30:46 2023 LINK UP | | | | [devid 2] Thu Aug 31 07:30:44 2023 LINK DOWN | | | | [devid 2] Thu Aug 31 07:30:43 2023 LINK UP | | | Key fault propagation | ['worker-0'] | | | | Fault code 1 (Link Down: NPU intermittent disconnection error) -> Fault code 2 (excessive RDMA retransmissions) -> Fault code 3 (notify wait timeout) | +--------------+------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ The diag job is complete.
The following table describes key parameters in the command output.Table 1 Parameter description Level-1 Parameter
Level-2 Parameter
Description
Fault event analysis
-
Used to analyze the root cause of the device where the root cause node is located.
-
Status code
- When a fault is diagnosed, the specific fault code is displayed.
- If no fault is diagnosed, NORMAL OR UNSUPPORTED is displayed.
-
Fault name
Specific fault name.
-
Fault type
Fault type and the component and module where the fault occurs.
-
Faulty device
Device where a fault occurs.
-
Fault description
Detailed description of a fault.
-
Suggestion
Handling suggestions for a fault.
-
Key logs
Key logs of a fault.
-
Key propagation chain
Used to display the longest fault link.
Notes:
- During single-server diagnosis, fault events in all valid logs on the node are scanned. If results of fault event analysis are displayed in the command output, the current fault may cause the training or inference job to exit abnormally.
After the diagnosis is complete, you can perform optimization based on the recommended solution in the single-server diagnosis result.Single-server_diagnosis_result_output_directory ├── fault_diag_result ├── diag_report.json # Diagnosis result
- If an error occurs during single-server fault diagnosis, the description (or analysis failure) field in the fault event analysis command output will display the failure information. To view all exception information, view the diag_report.json file.
Parent topic: Instructions