Application Scenarios and Solutions

The intelligent fault diagnosis feature can be used to solve the problem of difficult fault locating and demarcation for cluster training and inference jobs. The process of pinpointing issues in cluster training and inference jobs can be challenging and time-consuming due to the vast number of cluster logs, the complexity of full-stack AI log analysis, and the cross-domain problem analysis of computing, networking, and storage. Furthermore, this process also requires expertise in multiple domains.

Consequently, the intelligent fault diagnosis feature is designed to effectively improve the problem locating capability of training and inference jobs, encouraging you to try this new feature and expanding the product ecosystem.

Specifically, this feature provides log cleaning and fault diagnosis functions for each device in a training or inference cluster, assisting you in collecting and cleaning logs, dumping cleaned information files to a specific path for diagnosis, and quickly locating faults through analysis results. In addition, you are allowed to customize fault entities or mask error logs of CANN App logs.

Two scenarios are involved based on actual services.

Scenario	User	Job Type	Description
Full-process application scenario	Enterprises, governments, and public institutions (with AI cluster O&M platform capabilities)	Training and inference job	Requires data related to training, inference, CANN, host resources, and hardware. The data to be collected is complex and can be used by AI cluster O&M platform users to diagnose complex tasks.
Basic application scenario	Individual	Training and inference job	Requires training or inference logs and CANN logs. The content to be collected and collecting method are simple, suitable for basic task diagnosis for individual users.

Scenario

User

Job Type

Description

Full-process application scenario

Enterprises, governments, and public institutions

(with AI cluster O&M platform capabilities)

Training and inference job

Requires data related to training, inference, CANN, host resources, and hardware. The data to be collected is complex and can be used by AI cluster O&M platform users to diagnose complex tasks.

Basic application scenario

Individual

Training and inference job

Requires training or inference logs and CANN logs. The content to be collected and collecting method are simple, suitable for basic task diagnosis for individual users.

Full-Process Application Scenario

In the training scenario, multiple types of logs and metrics, including training logs, host resource logs, NPU logs, and hardware logs, are required.

In the inference scenario, inference job logs, CANN App logs, device logs, and MindIE component logs are required.

Some metrics need to be collected additionally. Therefore, this scenario is recommended for the AI cluster O&M platform.

The following figure shows the full-process application solution. You need to install MindCluster Ascend FaultDiag on all training or inference devices. After a training or inference job is complete, each device needs to collect all the aforementioned logs and metrics, and then use the cleaning function of MindCluster Ascend FaultDiag to filter out and extract valid information. Finally, the original logs, metric information, and cleaning results of all devices are dumped to the AI cluster O&M platform. The platform uses the diagnosis function of MindCluster Ascend FaultDiag to analyze the root cause of the fault. In addition, you are allowed to customize fault entities or mask error logs of CANN App logs.

Figure 1 Full-process application solution

Table 1 describes the data sources and usage of metrics and logs to be collected in the full-process application scenario.

**Table 1** Training and inference job logs and metrics
Data Type	Log Description	Data Source	Data Usage
Training job logs	Logs generated in the model training process.	Training job	Used for fault event analysis
NPU network port check files before and after training	Before and after a training job is executed, use hccn_tool to check the network port information of each NPU.	Training job	Used for fault event analysis
Host resource information	NPU status monitoring metrics, including the CPU usage (%CPU) used by the main training process of each NPU and the used physical memory (RES).	Training job	Used for device resource analysis
NPU network port resource information	Statistics about packets sent and received by the NPU network port.	Training job	Used for network congestion analysis
OS logs	Linux system logs.	Training job	Used for fault event analysis
MindCluster component logs	SuperPoD logs, AI server logs, and component logs collected by Ascend Device Plugin, NodeD, Ascend Docker Runtime, NPU Exporter, and Volcano.	Training job	Used for fault event analysis
Inference job logs	Logs generated in the inference process.	Inference job	Used for fault event analysis
NPU device run logs	Logs and files on the device, including slog and hisi_logs.	Training and inference job	Used for fault event analysis
CANN App logs	Run logs generated by CANN.	Training and inference job	Used for root cause node analysis and fault event analysis.
MindIE component logs	Logs generated by MindIE Server, MindIE LLM, MindIE SD, MindIE RT, MindIE Torch, MindIE MS, MindIE Benchmark, and MindIE Client	Inference job	Used for fault event analysis
AMCT logs	Logs generated when AMCT compresses models.	Model compression	Used by AMCT for fault event analysis
MindIE pod console logs	MindIE pod console logs	Inference job	Used for root cause node analysis

For details about how to collect all logs and metric data, see Log Collection.

Basic Application Scenario

Based on different user requirements, basic application scenarios where only training or inference logs and CANN App logs are diagnosed are supported. Such logs are generated by training or inference jobs and do not need to be collected additionally.

The following figure shows a basic application scenario. You need to install MindCluster Ascend FaultDiag on all training or inference devices. After a training or inference job is complete, each device needs to collect at least training or inference logs and CANN App logs. Then, you need to use the cleaning function of MindCluster Ascend FaultDiag to filter out and extract valid information, dump the original logs and cleaning results of all devices to the same general-purpose device, and use the diagnosis function of MindCluster Ascend FaultDiag to analyze the root cause of a fault. In addition, you are allowed to customize fault entities or mask error logs of CANN App logs.

Figure 2 Basic application solution

The following table lists the data sources and usage of metrics and logs to be collected.

For details about how to collect all logs and metrics, see Collecting Logs.

**Table 2** Training and inference job logs and metrics
Data Type	Log Description	Data Source	Data Usage
Training job logs	Logs generated by a training job.	Training job	Used for fault event analysis
CANN App logs	Run logs generated by CANN.	Training and inference job	Used for root cause node analysis and fault event analysis.
Inference job logs	Logs generated in the inference process.	Inference job	Used for fault event analysis