Overview
MindCluster Ascend FaultDiag is a fault diagnosis tool, which can clean logs, diagnose faults, extract key information about logs related to training and inference processes, and analyze the root cause node and fault events based on the cleaned key information of all nodes in a cluster.
Key Features
MindCluster Ascend FaultDiag provides the following functions:
Log cleaning
If a training or inference job fails, MindCluster Ascend FaultDiag cleans the original logs and monitoring metrics. The cleaning result is dumped to the same path as the original information to provide data support for diagnosis tasks.
Currently, only original training and inference logs and monitoring metrics can be cleaned, including user training and inference logs, CANN App logs, host resource information, and NPU network port resource information.
Fault diagnosis
MindCluster Ascend FaultDiag offers diagnostic capabilities, including root cause node analysis, fault event analysis, device resource analysis, and network congestion analysis for two types of problems.
Fault Type |
Diagnosis Content |
|---|---|
Abnormal exit of training and inference jobs |
|
Performance deterioration during training and inference |
|
Note: Performance deterioration is diagnosed only when a training or inference job does not exit abnormally. |
|
Usage Process
The following table describes the process of using MindCluster Ascend FaultDiag.
Procedure |
Description |
Reference |
|---|---|---|
Log collection |
If a training or inference job fails or runs abnormally, collect logs of each training or inference device and store the logs in the preset structure. For details about the logs to be collected, see Table 1. |
|
Log cleaning |
After logs are collected, enable cleaning function of MindCluster Ascend FaultDiag on each training or inference device to clean the collected original logs and metric data, and filter out and extract valid information. |
|
Cleaning result dump |
After log cleaning is complete, dump and summarize the cleaning results of each training or inference device into the same training device or general-purpose device, and store the results based on the preset structure. |
|
Fault diagnosis |
Based on the dumped cleaning results, enable the diagnosis function of MindCluster Ascend FaultDiag to analyze the root causes of failures or exceptions in training or inference jobs. |
|
Note In the preceding process, MindCluster Ascend FaultDiag does not provide the log collection and cleaning result dump functions. This document provides only reference for their operations. |
||