Overview

MindCluster Ascend FaultDiag is a fault diagnosis tool, which can clean logs, diagnose faults, extract key information about logs related to training and inference processes, and analyze the root cause node and fault events based on the cleaned key information of all nodes in a cluster.

Key Features

MindCluster Ascend FaultDiag provides the following functions:

Log cleaning

If a training or inference job fails, MindCluster Ascend FaultDiag cleans the original logs and monitoring metrics. The cleaning result is dumped to the same path as the original information to provide data support for diagnosis tasks.

Currently, only original training and inference logs and monitoring metrics can be cleaned, including user training and inference logs, CANN App logs, host resource information, and NPU network port resource information.

Fault diagnosis

MindCluster Ascend FaultDiag offers diagnostic capabilities, including root cause node analysis, fault event analysis, device resource analysis, and network congestion analysis for two types of problems.

Fault Type

Diagnosis Content

Abnormal exit of training and inference jobs

  • Root cause node analysis: Locate the root cause node based on the HCCL error information of a cluster.
  • Fault event analysis: Analyze the error of the device where the root cause node resides based on the fault mode contained in the fault knowledge graph.

Performance deterioration during training and inference

  • Device resource analysis: Analyze the resource status of devices. Specifically, by analyzing the collected metric files, you can locate problems such as frequency reduction during computing and CPU resource preemption.
  • Network congestion analysis: Analyze the network status between nodes. It is usually used to locate network problems in the Spine + Leaf dual-layer networking scenario. By analyzing the collected metric files of NPU network ports, you can determine whether network congestion occurs on node links.

Note:

Performance deterioration is diagnosed only when a training or inference job does not exit abnormally.

Usage Process

The following table describes the process of using MindCluster Ascend FaultDiag.

Procedure

Description

Reference

Log collection

If a training or inference job fails or runs abnormally, collect logs of each training or inference device and store the logs in the preset structure.

For details about the logs to be collected, see Table 1.

Collecting Logs

Log cleaning

After logs are collected, enable cleaning function of MindCluster Ascend FaultDiag on each training or inference device to clean the collected original logs and metric data, and filter out and extract valid information.

Cleaning and Dumping Logs

Cleaning result dump

After log cleaning is complete, dump and summarize the cleaning results of each training or inference device into the same training device or general-purpose device, and store the results based on the preset structure.

Cleaning and Dumping Logs

Fault diagnosis

Based on the dumped cleaning results, enable the diagnosis function of MindCluster Ascend FaultDiag to analyze the root causes of failures or exceptions in training or inference jobs.

Diagnosing Faults

Note

In the preceding process, MindCluster Ascend FaultDiag does not provide the log collection and cleaning result dump functions. This document provides only reference for their operations.