查看vcjob训练任务状态

通过Volcano创建的训练任务可能会出现资源、标签和模型等不满足要求的情况,将导致vcjob任务、pod处于pending或failed状态。

  1. 获取vcjob任务列表。

    kubectl get vcjob --all-namespaces

    root@ubuntu:~# kubectl get vcjob --all-namespaces
    NAMESPACE   NAME                AGE
    xxx-test    mindx-dls-npu-16p   42s

  2. 查看vcjob详情。

    kubectl describe vcjob -n xxx-test mindx-dls-npu-16p

    其中:

    • xxx-test:请按实际的命名空间写入。
    • mindx-dls-npu-16p:请根据需要查看的vcjob任务名称写入。
    root@ubuntu:~# kubectl describe vcjob -n xxx-test mindx-dls-npu-16p
    Name:         mindx-dls-npu-16p
    Namespace:    xxx-test
    Labels:       ring-controller.atlas=ascend-910
    Annotations:  kubectl.kubernetes.io/last-applied-configuration:
                    {"apiVersion":"batch.volcano.sh/v1alpha1","kind":"Job","metadata":{"annotations":{},"labels":{"ring-controller.atlas":"ascend-910"},"name"...
    API Version:  batch.volcano.sh/v1alpha1
    Kind:         Job
    ...
    Status:
      Min Available:  2
      State:
        Last Transition Time:  2021-03-09T02:43:04Z
        Phase:                 Failed
        Reason:                all nodes are unavailable: 1 node(s) resource fit failed, 1 task(mindx-dls-npu-16p-default-1p-1) in node(ubuntu-65-157):no matching label on this node key[host-arch] : task(huawei-x86) node(huawei-arm) conf(huawei-arm|huawei-x86).
      Version:                 3
    Events:                    <none>

    如上所示,vcjob(mindx-dls-npu-16p)失败。Reason中描述了失败原因。

  3. 查看pod列表。

    kubectl get pod --all-namespaces

    root@ubuntu:~# kubectl get pod --all-namespaces
    NAMESPACE        NAME                                       READY   STATUS      RESTARTS   AGE
    xxx-test         mindx-dls-npu-1-1p-default-1p-0            0/1     Pending     0          2m18s
    xxx-test         mindx-dls-npu-1p-default-1p-0              1/1     Running     0          2m40s
    xxx-test         mindx-dls-npu-2p-default-1p-0              1/1     Running     0          2m42s
    xxx-test         mindx-dls-npu-4p-default-1p-0              1/1     Running     0          2m45s
    xxx-test         mindx-dls-npu-8p-default-1p-0              1/1     Running     0          2m49s
    ...
    npu-exporter     npu-exporter-jwq5l                         1/1     Running     0          9h
    vcjob            mindx-dls-test-default-test-0              1/1     Running     0          5m3s
    volcano-system   volcano-controllers-68769b787f-2wgw7       1/1     Running     0          15h
    volcano-system   volcano-scheduler-768ddcd774-f9w5w         1/1     Running     0          15h

  4. 查看pod详细信息。

    kubectl describe pod -n xxx-test mindx-dls-npu-1-1p-default-1p-0

    其中:

    • xxx-test:请按实际的命名空间写入。
    • mindx-dls-npu-1-1p-default-1p-0:请根据需要查看的vcjob任务名称写入。
    root@ubuntu:~# kubectl describe pod -n xxx-test mindx-dls-npu-1-1p-default-1p-0 
    Name:           mindx-dls-npu-1-1p-default-1p-0
    Namespace:      xxx-test
    Priority:       0
    Node:           <none>
    Labels:         app=tf
                    ring-controller.atlas=ascend-910
                    volcano.sh/job-name=mindx-dls-npu-1-1p
                    volcano.sh/job-namespace=xxx-test
    Annotations:    scheduling.k8s.io/group-name: mindx-dls-npu-1-1p
                    volcano.sh/job-name: mindx-dls-npu-1-1p
                    volcano.sh/job-version: 0
                    volcano.sh/task-spec: default-1p
    Status:         Pending
    ...
    Node-Selectors:  host-arch=huawei-x86
    Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                     node.kubernetes.io/unreachable:NoExecute for 300s
    Events:
      Type     Reason            Age   From     Message
      ----     ------            ----  ----     -------
      Warning  FailedScheduling  103s  volcano  all nodes are unavailable: 2 node(s) resource fit failed.

    如上所示,pod(mindx-dls-npu-1-1p-default-1p-0)处于pending状态。Events中描述了pending原因。