Diagnosing Faults on a Single Server

  1. Create a directory for storing the single-server diagnosis result.
    mkdir Single-server_diagnosis_result_output_directory
  2. Run the command to start diagnosis.
    By default, the data of the fault event module is returned for single-server diagnosis.
    ascend-fd single-diag -i Collection_directory -o Output_directory_of_the_single-server_diagnosis_result
    The following information is displayed when a training job exits abnormally during diagnosis:
    The single-diag job starts. Please wait. Job id: [****], run log file is [****].
    +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |                                                                                       Ascend Fault-Diag Report                                                                                       |
    +--------------+------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    | Version | Type | Version                                                                                                                                                                      |
    +--------------+------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |              | Fault-Diag | 7.3.0                                                                                                                                                                    |
    +--------------+------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    | Fault event analysis |    Type    | Description                                                                                                                                                          |
    +--------------+------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |              |    Note | 1. Multiple faults are diagnosed and sorted by occurrence time. Check the faults that are occurred earlier.                                                                                                              |
    |              |            | 2. Only 16 faulty devices are displayed. All faulty devices can be queried in the diag_report.json file.                                                                                             | 
    +--------------+------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |              |   Status code   |  xxx                                                                                                                                                                      |
    |              |  Fault type | Type: Network Component: Network Module: Network                                                                                                                                   |
    |              | Faulty device | ['worker-0 device-2']                                                                                                                                                    |
    |              |  Fault name | Link Down: NPU intermittent disconnection                                                                                                                                                 |
    |             |  Fault description  | The link of an NPU network port on the server is down for more than 30s.                                                                                                            |
    |              |  Solution  | 1. Contact physical network O&M personnel to collect switch logs and check whether hardware faults occur (for example, whether optical modules work properly and whether switch links are intermittently disconnected).                                                              |
    |              |  Key log  | /usr/local/Ascend/driver/tools/hccn_tool -i 2 -link_stat -g                                                                                                               |
    |              |            | [devid 2]current time        : Fri Sep  1 06:37:26 2023                                                                                                                  |
    |              |            | [devid 2]link up count       : 2                                                                                                                                         |
    |              |            | [devid 2]link change records :                                                                                                                                           |
    |              |            | [devid 2]    Fri Sep  1 06:34:43 2023    LINK DOWN                                                                                                                       |
    |              |            | [devid 2]    Thu Aug 31 07:30:46 2023    LINK UP                                                                                                                         |
    |              |            | [devid 2]    Thu Aug 31 07:30:44 2023    LINK DOWN                                                                                                                       |
    |              |            | [devid 2]    Thu Aug 31 07:30:43 2023    LINK UP                                                                                                                         |
    |              | Key fault propagation | ['worker-0']                                                                                                                                                             |
    |              |            | Fault code 1 (Link Down: NPU intermittent disconnection error) -> Fault code 2 (excessive RDMA retransmissions) -> Fault code 3 (notify wait timeout)                                                                           |
    +--------------+------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    The diag job is complete.
    The following table describes key parameters in the command output.
    Table 1 Parameter description

    Level-1 Parameter

    Level-2 Parameter

    Description

    Fault event analysis

    -

    Used to analyze the root cause of the device where the root cause node is located.

    -

    Status code

    • When a fault is diagnosed, the specific fault code is displayed.
    • If no fault is diagnosed, NORMAL OR UNSUPPORTED is displayed.

    -

    Fault name

    Specific fault name.

    -

    Fault type

    Fault type and the component and module where the fault occurs.

    -

    Faulty device

    Device where a fault occurs.

    -

    Fault description

    Detailed description of a fault.

    -

    Suggestion

    Handling suggestions for a fault.

    -

    Key logs

    Key logs of a fault.

    -

    Key propagation chain

    Used to display the longest fault link.

    Notes:

    • During single-server diagnosis, fault events in all valid logs on the node are scanned. If results of fault event analysis are displayed in the command output, the current fault may cause the training or inference job to exit abnormally.
    After the diagnosis is complete, you can perform optimization based on the recommended solution in the single-server diagnosis result.
    Single-server_diagnosis_result_output_directory
    ├── fault_diag_result    
        ├── diag_report.json    # Diagnosis result
    • If an error occurs during single-server fault diagnosis, the description (or analysis failure) field in the fault event analysis command output will display the failure information. To view all exception information, view the diag_report.json file.