Training Job Is in the Pending State Because "nodes are unavailable"

Symptom

After being delivered, the vcjob training job is not running.

  1. Run the kubectl get pod --all-namespaces command to check whether the pod to which the training job belongs is in the Pending state, as shown in the following figure.

  2. Run the kubectl describe pod sasa-resnet1-acc-default-test-0 -n vcjob command to view the pod details. In the event field, the following error is reported: all nodes are unavailable: 1 node annotations(7) not same node idle(8).

Cause Analysis

The number of unused NPUs on the node is different from the number of unused NPUs displayed in Annotations. Volcano considered that the system was unstable and could not allocate NPU resources.

The kubectl describe nodes command was run to check the huawei.com/Ascend910: field in Allocated resources and Annotations of the node.

The possible causes are that Kubernetes runs slowly due to a large number of tasks and the startup mode of Ascend Device Plugin is incorrect.

Solution

Reinstall Ascend Device Plugin. For details, see Ascend Device Plugin.