Overview
MindCluster Ascend FaultDiag provides command APIs and SDK APIs, allowing you to implement related features.
Command APIs allow direct use of commands. Available features include log cleaning, fault diagnosis (single-server and general), custom configuration files, custom fault entities, fault log masking, version queries, and help information.
SDK APIs are code-level APIs that can be directly called using functions and methods. Available features include service flow cleaning, root cause node cleaning, root cause node diagnosis, fault event cleaning, and fault event diagnosis.
Command |
Function |
|---|---|
ascend-fd parse |
Command for cleaning logs. When a log cleaning task starts, this command cleans the intermediate result data collected during training or inference. |
ascend-fd diag |
Command for fault diagnosis. When a fault analysis task starts, this command analyzes the root cause of the fault, and outputs an analysis report. |
ascend-fd single-diag |
Command for single-server fault diagnosis. It starts a single-server fault analysis task and generates an analysis report. |
ascend-fd entity |
Command for customizing a fault entity. You can customize fault entities. MindCluster Ascend FaultDiag supports log cleaning, fault diagnosis, and fault log masking for custom faults. |
ascend-fd blacklist |
Commands for masking fault logs. The log information containing fault keywords is not recorded in the file after log cleaning. |
ascend-fd config |
Command for customizing configuration files. You can customize whether to clean key ModelArts logs, configure the size of console logs to be read, and configure file parsing. |
ascend-fd version |
Command for querying the component version information. |
ascend-fd -h |
Command for querying help information. |
Command |
Description |
|---|---|
parse_fault_type |
Cleans the service flow. |
parse_root_cluster |
Cleans the root cause node. |
diag_root_cluster |
Diagnoses the root cause node. |
parse_knowledge_graph |
Cleans the fault event. |
diag_knowledge_graph |
Diagnoses the fault event. |
When this component is used, operation logs and run logs are generated in the ${HOME}/.ascend_faultdiag directory. The directory structure is as follows:
${HOME}/.ascend_faultdiag
└── ascend_faultdiag_operation.log # Operation logs
└── RUN_LOG # Run logs
└─ 20241104142355468743_6797877f-7143-443f-a9c6-361e33032c5c
Mechanism for saving logs: The size of a log file cannot exceed 10 MB. If the log size exceeds 10 MB, the system automatically dumps another log file. The number of log files with the same PID cannot exceed 10. If the number exceeds 10, the earliest log files are overwritten.