Service Process

In most cases, troubleshooting includes information collection, fault locating, and fault rectification. When an alarm occurs, you can collect fault information, find out root causes of the current fault, and locate and rectify the fault.

Troubleshooting is a process of using appropriate methods to find the cause of a fault and rectify the fault. The troubleshooting process is to narrow down the scope of possible causes for a fault to reduce troubleshooting complexity, identify the root cause, and rectify the fault.

Figure 1 shows the troubleshooting process.

Figure 1 Troubleshooting process

As shown in the preceding figure, if the problem is caused by a cluster scheduling component, you need to further locate the component where the problem occurs.

If the training vcjob exists but the corresponding pod does not exist, Volcano is faulty. In this case, check the corresponding logs.
If the resource allocation is incorrect (no corresponding resource is displayed in the pod details or the NPU resource does not have a timestamp), Volcano is faulty. In this case, check the corresponding logs.
Check the pod details. The Ascend Device Plugin is faulty: If NPU resource allocation information is displayed but the corresponding NPU details are not displayed. Or, the NPU resource registration is incorrect in the Kubernetes node details. In this case, further analysis is required.
Check the CM content. If the value of status is initializing, the HCCL-Controller is faulty and needs to be further analyzed.

Parent topic: Common Job Troubleshooting