Application Scenarios and Solutions
The intelligent fault diagnosis feature can be used to solve the problem of difficult fault locating and demarcation for cluster training and inference jobs. The process of pinpointing issues in cluster training and inference jobs can be challenging and time-consuming due to the vast number of cluster logs, the complexity of full-stack AI log analysis, and the cross-domain problem analysis of computing, networking, and storage. Furthermore, this process also requires expertise in multiple domains.
Consequently, the intelligent fault diagnosis feature is designed to effectively improve the problem locating capability of training and inference jobs, encouraging you to try this new feature and expanding the product ecosystem.
Specifically, this feature provides log cleaning and fault diagnosis functions for each device in a training or inference cluster, assisting you in collecting and cleaning logs, dumping cleaned information files to a specific path for diagnosis, and quickly locating faults through analysis results. In addition, you are allowed to customize fault entities or mask error logs of CANN App logs.
Two scenarios are involved based on actual services.
Scenario |
User |
Job Type |
Description |
|---|---|---|---|
Enterprises, governments, and public institutions (with AI cluster O&M platform capabilities) |
Training and inference job |
Requires data related to training, inference, CANN, host resources, and hardware. The data to be collected is complex and can be used by AI cluster O&M platform users to diagnose complex tasks. |
|
Individual |
Training and inference job |
Requires training or inference logs and CANN logs. The content to be collected and collecting method are simple, suitable for basic task diagnosis for individual users. |
Full-Process Application Scenario
In the training scenario, multiple types of logs and metrics, including training logs, host resource logs, NPU logs, and hardware logs, are required.
In the inference scenario, inference job logs, CANN App logs, device logs, and MindIE component logs are required.
Some metrics need to be collected additionally. Therefore, this scenario is recommended for the AI cluster O&M platform.
The following figure shows the full-process application solution. You need to install MindCluster Ascend FaultDiag on all training or inference devices. After a training or inference job is complete, each device needs to collect all the aforementioned logs and metrics, and then use the cleaning function of MindCluster Ascend FaultDiag to filter out and extract valid information. Finally, the original logs, metric information, and cleaning results of all devices are dumped to the AI cluster O&M platform. The platform uses the diagnosis function of MindCluster Ascend FaultDiag to analyze the root cause of the fault. In addition, you are allowed to customize fault entities or mask error logs of CANN App logs.

Table 1 describes the data sources and usage of metrics and logs to be collected in the full-process application scenario.
Data Type |
Log Description |
Data Source |
Data Usage |
|---|---|---|---|
Training job logs |
Logs generated in the model training process. |
Training job |
Used for fault event analysis |
NPU network port check files before and after training |
Before and after a training job is executed, use hccn_tool to check the network port information of each NPU. |
Training job |
Used for fault event analysis |
Host resource information |
NPU status monitoring metrics, including the CPU usage (%CPU) used by the main training process of each NPU and the used physical memory (RES). |
Training job |
Used for device resource analysis |
NPU network port resource information |
Statistics about packets sent and received by the NPU network port. |
Training job |
Used for network congestion analysis |
OS logs |
Linux system logs. |
Training job |
Used for fault event analysis |
MindCluster component logs |
SuperPoD logs, AI server logs, and component logs collected by Ascend Device Plugin, NodeD, Ascend Docker Runtime, NPU Exporter, and Volcano. |
Training job |
Used for fault event analysis |
Inference job logs |
Logs generated in the inference process. |
Inference job |
Used for fault event analysis |
NPU device run logs |
Logs and files on the device, including slog and hisi_logs. |
Training and inference job |
Used for fault event analysis |
CANN App logs |
Run logs generated by CANN. |
Training and inference job |
Used for root cause node analysis and fault event analysis. |
MindIE component logs |
Logs generated by MindIE Server, MindIE LLM, MindIE SD, MindIE RT, MindIE Torch, MindIE MS, MindIE Benchmark, and MindIE Client |
Inference job |
Used for fault event analysis |
AMCT logs |
Logs generated when AMCT compresses models. |
Model compression |
Used by AMCT for fault event analysis |
MindIE pod console logs |
MindIE pod console logs |
Inference job |
Used for root cause node analysis |
Basic Application Scenario
Based on different user requirements, basic application scenarios where only training or inference logs and CANN App logs are diagnosed are supported. Such logs are generated by training or inference jobs and do not need to be collected additionally.
The following figure shows a basic application scenario. You need to install MindCluster Ascend FaultDiag on all training or inference devices. After a training or inference job is complete, each device needs to collect at least training or inference logs and CANN App logs. Then, you need to use the cleaning function of MindCluster Ascend FaultDiag to filter out and extract valid information, dump the original logs and cleaning results of all devices to the same general-purpose device, and use the diagnosis function of MindCluster Ascend FaultDiag to analyze the root cause of a fault. In addition, you are allowed to customize fault entities or mask error logs of CANN App logs.

The following table lists the data sources and usage of metrics and logs to be collected.
For details about how to collect all logs and metrics, see Collecting Logs.
Data Type |
Log Description |
Data Source |
Data Usage |
|---|---|---|---|
Training job logs |
Logs generated by a training job. |
Training job |
Used for fault event analysis |
CANN App logs |
Run logs generated by CANN. |
Training and inference job |
Used for root cause node analysis and fault event analysis. |
Inference job logs |
Logs generated in the inference process. |
Inference job |
Used for fault event analysis |
