Usage Process
This section uses the full-process application scenario as an example to describe the overall fault diagnosis process. You can perform operations by referring to Figure 1 and Table 1.
Key Operation |
Chapter |
Description |
|---|---|---|
Collect logs, including logs of training and inference jobs, CANN, host, and NPU resources based on the log collection directory. |
Collect cluster platform logs based on the actual situation. The directory structure and collection sample provided in related chapters are for reference only. |
|
Collect logs, and prepare the NPU environment check file before training and inference. |
||
Collect logs, including information about NPU network ports, NPU status monitoring metrics, host resources, and MindIE Pod logs during training and inference. |
||
Collect logs, including the NPU environment check file after training and inference, user training and inference logs, CANN App logs, host OS logs, and device logs. |
||
(Optional) Customize fault entities. |
For details about the command APIs, see Fault Entity Customization. |
|
(Optional) Mask error logs of CANN App logs. |
For details about the command APIs, see Fault Log Masking. |
|
Use the component to clean the collection directory and dump the cleaned logs of each node. |
|
|
Use the component to diagnose the log directory after cleaning and dumping. |
For details about the command APIs, see Fault Diagnosis. |
