Single-Server Fault Diagnosis

API Prototype

  • Clean all logs, process log cleaning results, diagnose fault events, and output analysis reports on a single server.
    ascend-fd single-diag -i Collection_directory -o Output_directory_of_the_single-server_diagnosis_result
  • Enter a log directory for single-server diagnosis.
    ascend-fd single-diag --host_log Collection_directory_of_OS_logs_on_the_host --device_log Collection_directory_of device_logs --train_logCollection_directory_of_user_training_or_inference_logs --process_log  Collection_directory_of_CANN_App_logs --env_check Collection_directory_of_NPU_network_port, status_information, and resource_information  --dl_log Collection_directory_of_MindCluster_component_logs  --mindie_log  Collection_directory_of_MindIE_component_logs  --amct_log Collection_directory_of_AMCT_logs  -o Cleaning_result_output_directory
  • If the -i and detailed log collection directory parameters are used in pair, the system preferentially reads the input values of the detailed log collection directory parameters and then reads the remaining log collection directories specified by -i.
  • If -i and the eight detailed log collection directory parameters are configured at the same time, -i does not take effect.
  • At least one of --input_path, --host_log, --device_log, --train_log, --process_log, --env_check, --dl_log, --mindie_log, and --amct_log must be specified. Otherwise, the cleaning command fails to be executed.
  • The drive space of the output directory specified by the cleaning command must be greater than 5 GB. If the drive space is insufficient, some cleaning results may be lost, causing abnormal or inaccurate diagnosis results.

Description

This API starts a single-node diagnosis task. After training or inference fails, original logs such as run logs and NPU environment check files of a single server are diagnosed.

Parameters

Table 1 Parameters

Parameter

Abbreviation

Required (Yes/No)

Value Type

Description

--host_log

None

No

String

Collection directory of OS logs on the host. The value can contain only digits, uppercase letters, lowercase letters, tildes (~), hyphens (-), plus signs (+), underscores (_), periods (.), slashes (/), and spaces.

--device_log

None

No

String

Collection directory of device logs. The value can contain only digits, uppercase letters, lowercase letters, tildes (~), hyphens (-), plus signs (+), underscores (_), periods (.), slashes (/), and spaces.

--train_log

None

No

String

Collection directory of user training or inference logs.

  • --train_log supports multiple paths. The path can be the name of a single collected log file or the collection directory of dumped logs. However, a maximum of 20 paths can be read, and the extra paths will be discarded.
  • When --train_log is used to specify the file name, there are no naming restrictions on the user training and inference logs. When --train_log is used to specify the path, the files whose names end with .txt or .log in the path are regarded as training and inference logs.

--process_log

None

No

String

Collection directory of CANN App logs. The value can contain only digits, uppercase letters, lowercase letters, tildes (~), hyphens (-), plus signs (+), underscores (_), periods (.), slashes (/), and spaces.

--env_check

None

No

String

Collection directory of NPU network ports, status information, and resource information. The value can contain only digits, uppercase letters, lowercase letters, tildes (~), hyphens (-), plus signs (+), underscores (_), periods (.), slashes (/), and spaces.

--dl_log

None

No

String

Collection directory of Ascend Device Plugin, NodeD, Ascend Docker Runtime, NPU Exporter, and Volcano logs. The value can contain only digits, uppercase letters, lowercase letters, tildes (~), hyphens (-), plus signs (+), underscores (_), periods (.), slashes (/), and spaces.

--mindie_log

None

No

String

Collection directory of logs generated by MindIE Server, MindIE LLM, MindIE SD, MindIE RT, MindIE Torch, MindIE MS, MindIE Benchmark, and MindIE Client. The value can contain only digits, uppercase letters, lowercase letters, tildes (~), hyphens (-), plus signs (+), underscores (_), periods (.), slashes (/), and spaces.

--amct_log

None

No

String

AMCT logs The value can contain only digits, uppercase letters, lowercase letters, tildes (~), hyphens (-), plus signs (+), underscores (_), periods (.), slashes (/), and spaces.

--input_path

-i

No

String

Path for storing preprocessed data. The value can contain only digits, uppercase letters, lowercase letters, tildes (~), hyphens (-), plus signs (+), underscores (_), periods (.), slashes (/), and spaces.

--output_path

-o

Yes

String

Output path of cleaned data. The value can contain only digits, uppercase letters, lowercase letters, tildes (~), hyphens (-), plus signs (+), underscores (_), periods (.), slashes (/), and spaces.

--help

-h

No

-

Displays the meanings and usage instructions of level-2 commands and parameters.

Return Value

Execution status of a single-server diagnosis task:
The single-diag job starts. Please wait. Job id: [****], run log file is [****].
Diagnosis content
The single-diag job is complete.