Overview

MindCluster Ascend FaultDiag provides command APIs and SDK APIs, allowing you to implement related features.

Command APIs allow direct use of commands. Available features include log cleaning, fault diagnosis (single-server and general), custom configuration files, custom fault entities, fault log masking, version queries, and help information.

SDK APIs are code-level APIs that can be directly called using functions and methods. Available features include service flow cleaning, root cause node cleaning, root cause node diagnosis, fault event cleaning, and fault event diagnosis.

Table 1 Command APIs

Command

Function

ascend-fd parse

Command for cleaning logs. When a log cleaning task starts, this command cleans the intermediate result data collected during training or inference.

ascend-fd diag

Command for fault diagnosis. When a fault analysis task starts, this command analyzes the root cause of the fault, and outputs an analysis report.

ascend-fd single-diag

Command for single-server fault diagnosis. It starts a single-server fault analysis task and generates an analysis report.

ascend-fd entity

Command for customizing a fault entity. You can customize fault entities. MindCluster Ascend FaultDiag supports log cleaning, fault diagnosis, and fault log masking for custom faults.

ascend-fd blacklist

Commands for masking fault logs. The log information containing fault keywords is not recorded in the file after log cleaning.

ascend-fd config

Command for customizing configuration files. You can customize whether to clean key ModelArts logs, configure the size of console logs to be read, and configure file parsing.

ascend-fd version

Command for querying the component version information.

ascend-fd -h

Command for querying help information.

Table 2 SDK APIs

Command

Description

parse_fault_type

Cleans the service flow.

parse_root_cluster

Cleans the root cause node.

diag_root_cluster

Diagnoses the root cause node.

parse_knowledge_graph

Cleans the fault event.

diag_knowledge_graph

Diagnoses the fault event.

When this component is used, operation logs and run logs are generated in the ${HOME}/.ascend_faultdiag directory. The directory structure is as follows:

${HOME}/.ascend_faultdiag
└── ascend_faultdiag_operation.log    # Operation logs
└── RUN_LOG                           # Run logs
  └─ 20241104142355468743_6797877f-7143-443f-a9c6-361e33032c5c

Mechanism for saving logs: The size of a log file cannot exceed 10 MB. If the log size exceeds 10 MB, the system automatically dumps another log file. The number of log files with the same PID cannot exceed 10. If the number exceeds 10, the earliest log files are overwritten.