Application Scenarios and Solutions

The intelligent fault diagnosis feature can be used to solve the problem of difficult fault locating and demarcation for cluster training and inference jobs. The process of pinpointing issues in cluster training and inference jobs can be challenging and time-consuming due to the vast number of cluster logs, the complexity of full-stack AI log analysis, and the cross-domain problem analysis of computing, networking, and storage. Furthermore, this process also requires expertise in multiple domains.

Consequently, the intelligent fault diagnosis feature is designed to effectively improve the problem locating capability of training and inference jobs, encouraging you to try this new feature and expanding the product ecosystem.

Specifically, this feature provides log cleaning and fault diagnosis functions for each device in a training or inference cluster, assisting you in collecting and cleaning logs, dumping cleaned information files to a specific path for diagnosis, and quickly locating faults through analysis results. In addition, you are allowed to customize fault entities or mask error logs of CANN App logs.

Two scenarios are involved based on actual services.

Scenario

User

Job Type

Description

Full-process application scenario

Enterprises, governments, and public institutions

(with AI cluster O&M platform capabilities)

Training and inference job

Requires data related to training, inference, CANN, host resources, and hardware. The data to be collected is complex and can be used by AI cluster O&M platform users to diagnose complex tasks.

Basic application scenario

Individual

Training and inference job

Requires training or inference logs and CANN logs. The content to be collected and collecting method are simple, suitable for basic task diagnosis for individual users.

Full-Process Application Scenario

In the training scenario, multiple types of logs and metrics, including training logs, host resource logs, NPU logs, and hardware logs, are required.

In the inference scenario, inference job logs, CANN App logs, device logs, and MindIE component logs are required.

Some metrics need to be collected additionally. Therefore, this scenario is recommended for the AI cluster O&M platform.

The following figure shows the full-process application solution. You need to install MindCluster Ascend FaultDiag on all training or inference devices. After a training or inference job is complete, each device needs to collect all the aforementioned logs and metrics, and then use the cleaning function of MindCluster Ascend FaultDiag to filter out and extract valid information. Finally, the original logs, metric information, and cleaning results of all devices are dumped to the AI cluster O&M platform. The platform uses the diagnosis function of MindCluster Ascend FaultDiag to analyze the root cause of the fault. In addition, you are allowed to customize fault entities or mask error logs of CANN App logs.

Figure 1 Full-process application solution

Table 1 describes the data sources and usage of metrics and logs to be collected in the full-process application scenario.

Table 1 Training and inference job logs and metrics

Data Type

Log Description

Data Source

Data Usage

Training job logs

Logs generated in the model training process.

Training job

Used for fault event analysis

NPU network port check files before and after training

Before and after a training job is executed, use hccn_tool to check the network port information of each NPU.

Training job

Used for fault event analysis

Host resource information

NPU status monitoring metrics, including the CPU usage (%CPU) used by the main training process of each NPU and the used physical memory (RES).

Training job

Used for device resource analysis

NPU network port resource information

Statistics about packets sent and received by the NPU network port.

Training job

Used for network congestion analysis

OS logs

Linux system logs.

Training job

Used for fault event analysis

MindCluster component logs

SuperPoD logs, AI server logs, and component logs collected by Ascend Device Plugin, NodeD, Ascend Docker Runtime, NPU Exporter, and Volcano.

Training job

Used for fault event analysis

Inference job logs

Logs generated in the inference process.

Inference job

Used for fault event analysis

NPU device run logs

Logs and files on the device, including slog and hisi_logs.

Training and inference job

Used for fault event analysis

CANN App logs

Run logs generated by CANN.

Training and inference job

Used for root cause node analysis and fault event analysis.

MindIE component logs

Logs generated by MindIE Server, MindIE LLM, MindIE SD, MindIE RT, MindIE Torch, MindIE MS, MindIE Benchmark, and MindIE Client

Inference job

Used for fault event analysis

AMCT logs

Logs generated when AMCT compresses models.

Model compression

Used by AMCT for fault event analysis

MindIE pod console logs

MindIE pod console logs

Inference job

Used for root cause node analysis

For details about how to collect all logs and metric data, see Log Collection.

Basic Application Scenario

Based on different user requirements, basic application scenarios where only training or inference logs and CANN App logs are diagnosed are supported. Such logs are generated by training or inference jobs and do not need to be collected additionally.

The following figure shows a basic application scenario. You need to install MindCluster Ascend FaultDiag on all training or inference devices. After a training or inference job is complete, each device needs to collect at least training or inference logs and CANN App logs. Then, you need to use the cleaning function of MindCluster Ascend FaultDiag to filter out and extract valid information, dump the original logs and cleaning results of all devices to the same general-purpose device, and use the diagnosis function of MindCluster Ascend FaultDiag to analyze the root cause of a fault. In addition, you are allowed to customize fault entities or mask error logs of CANN App logs.

Figure 2 Basic application solution

The following table lists the data sources and usage of metrics and logs to be collected.

For details about how to collect all logs and metrics, see Collecting Logs.

Table 2 Training and inference job logs and metrics

Data Type

Log Description

Data Source

Data Usage

Training job logs

Logs generated by a training job.

Training job

Used for fault event analysis

CANN App logs

Run logs generated by CANN.

Training and inference job

Used for root cause node analysis and fault event analysis.

Inference job logs

Logs generated in the inference process.

Inference job

Used for fault event analysis