Training Job Is in the Pending State Because "nodes are unavailable"

Symptom

After being delivered, the vcjob training job is not running.

  1. Run the kubectl get pod --all-namespaces command to check whether the pod to which the training job belongs is in the Pending state, as shown in the following figure.

  2. Run the kubectl describe pod sasa-resnet1-acc-default-test-0 -n vcjob command to view the pod details. In the event field, the following error is reported: all nodes are unavailable: 1 node annotations(7) not same node idle(8).

Causes

The number of unused NPUs on the node is different from the number of unused NPUs displayed in Annotations. Volcano considered that the system was unstable and could not allocate NPU resources.

The kubectl describe nodes command was run to check the huawei.com/Ascend910: field in Allocated resources and Annotations of the node.

According to the command output, the Ascend Device Plugin startup mode is incorrect, and Kubernetes run slowly when the number of jobs is large.

Solution

Reinstall Ascend Device Plugin. For details, see the installation-related content in the MindX DL Cluster Scheduling User Guide.