Usage Process

This section uses the full-process application scenario as an example to describe the overall fault diagnosis process. You can perform operations by referring to Figure 1 and Table 1.

Figure 1 Usage process
Table 1 Operation instructions

Key Operation

Chapter

Description

Collect logs, including logs of training and inference jobs, CANN, host, and NPU resources based on the log collection directory.

Log Collection Directory Structure

Collect cluster platform logs based on the actual situation. The directory structure and collection sample provided in related chapters are for reference only.

Collect logs, and prepare the NPU environment check file before training and inference.

Log Collection Before Training or Inference

Collect logs, including information about NPU network ports, NPU status monitoring metrics, host resources, and MindIE Pod logs during training and inference.

Collection During Training or Inference

Collect logs, including the NPU environment check file after training and inference, user training and inference logs, CANN App logs, host OS logs, and device logs.

Collection After Training or Inference

(Optional) Customize fault entities.

(Optional) Customizing Fault Entities

For details about the command APIs, see Fault Entity Customization.

(Optional) Mask error logs of CANN App logs.

(Optional) Masking Fault Logs

For details about the command APIs, see Fault Log Masking.

NOTE:

For details about types of CANN App logs, see CANN Log Reference.

Use the component to clean the collection directory and dump the cleaned logs of each node.

Cleaning and Dumping Logs

  • Log cleaning on a single node is used as an example in the provided chapter. In actual clusters, log cleaning needs to be performed based on the number of nodes.
  • For details about the command APIs, see Log Cleaning.

Use the component to diagnose the log directory after cleaning and dumping.

Diagnosing Faults

For details about the command APIs, see Fault Diagnosis.