Instruction
Suggestions
- To enable the diagnosis function, it is recommended that the number of servers in a cluster be less than or equal to 128 (1024 cards) due to the limitation of the maximum number of processes (1024 by default) in Linux. If the number of servers exceeds the upper limit, run the ulimit -n command to adjust the upper limit of file descriptors.
- Do not use pipe commands when using MindCluster Ascend FaultDiag-related commands. Otherwise, user IP address acquisition and log audit may be affected.
Applicable Scenarios
- MindCluster Ascend FaultDiag provides the fault diagnosis capability only for training and inference jobs on a server equipped with full card configurations. In other scenarios, the location of the root cause of a fault may be incorrect or fail.
- MindCluster Ascend FaultDiag supports only IPv4 addresses.
System Time Description
- Synchronize the system time of each training or inference server. If the system time is inconsistent, the analysis result may be inaccurate.
- Synchronize the system time of the host on each training or inference server with that of the device. If the system time is inconsistent, the analysis result may be inaccurate.
- If a container is used to execute training or inference jobs, synchronize the system time of the host machine with that of the training or inference container. If the system time is inconsistent, the analysis result may be inaccurate.
Version Mapping of Fault Diagnosis Logs
Parent topic: Instructions