Instruction

Suggestions

  • To enable the diagnosis function, it is recommended that the number of servers in a cluster be less than or equal to 128 (1024 cards) due to the limitation of the maximum number of processes (1024 by default) in Linux. If the number of servers exceeds the upper limit, run the ulimit -n command to adjust the upper limit of file descriptors.
  • Do not use pipe commands when using MindCluster Ascend FaultDiag-related commands. Otherwise, user IP address acquisition and log audit may be affected.

Applicable Scenarios

  • MindCluster Ascend FaultDiag provides the fault diagnosis capability only for training and inference jobs on a server equipped with full card configurations. In other scenarios, the location of the root cause of a fault may be incorrect or fail.
  • MindCluster Ascend FaultDiag supports only IPv4 addresses.

System Time Description

  • Synchronize the system time of each training or inference server. If the system time is inconsistent, the analysis result may be inaccurate.
  • Synchronize the system time of the host on each training or inference server with that of the device. If the system time is inconsistent, the analysis result may be inaccurate.
  • If a container is used to execute training or inference jobs, synchronize the system time of the host machine with that of the training or inference container. If the system time is inconsistent, the analysis result may be inaccurate.

Version Mapping of Fault Diagnosis Logs

Table 1 Software versions corresponding to logs

Log File

Software

Software Version

Description

CANN App logs

CANN

7.0.RC1 or later

Host App logs and device App logs printed by CANN. For more information, see "Viewing Logs (Ascend EP)" in CANN Log Reference.

Training and inference logs of the PyTorch framework

PyTorch 1.11.0 Adapter plugin

5.0.RC3 or later

-

Training logs of the MindSpore framework

MindSpore

2.1.0 or later

The description of some fault types contains the description of the corresponding MindSpore version. Refer to the actual fault diagnosis.

Training logs of the TensorFlow framework

TensorFlow

-

Only user-defined TensorFlow faults are supported.

Host OS logs

-

-

  • Host OS logs of CentOS 7.6, Debian 10.0, EulerOS 2.10, EulerOS 2.12, CTyunOS 22.06, and other systems can be detected. The keywords in logs may vary according to the operating system.
  • It is recommended that the host OS log size be less than 512 MB.

Device logs

Ascend HDK

23.0.RC3 or later

-

MindCluster component logs

Ascend Device Plugin, NodeD, Ascend Docker Runtime, NPU Exporter, Volcano

6.0.RC3 or later

-

MindIE component logs

MindIE Server, MindIE LLM, MindIE SD, MindIE RT, MindIE Torch, MindIE MS, MindIE Benchmark, MindIE Client

6.0.0 or later

-

AMCT logs

AMCT

7.0.RC1 or later

AMCT is integrated into the CANN package for release. For more information, see AMCT User Guide.

MindIE Pod console logs

MindIE Pod console logs

-

-