Checking the Running Status

Procedure

Check the pod status on the master node. Ensure that the pod status is Running.

Example of a single-server single-processor training job

root@ubuntu:~# kubectl get pod --all-namespaces -o wide
NAMESPACE        NAME                                       READY   STATUS              RESTARTS   AGE     IP                NODE           NOMINATED NODE   READINESS GATES
...
vcjob            mindx-dls-test-default-test-0             1/1     Running            0          4m      192.168.243.198   ubuntu         <none>           <none>
...

Example of two training nodes running 2 x 8P distributed training jobs

root@ubuntu:~# kubectl get pod --all-namespaces -o wide
NAMESPACE        NAME                                       READY   STATUS              RESTARTS   AGE     IP                NODE           NOMINATED NODE   READINESS GATES
...
vcjob            mindx-dls-test-default-test-0             1/1     Running            0          3m      192.168.243.198   ubuntu         <none>           <none>
vcjob            mindx-dls-test-default-test-1             1/1     Running            0          3m      192.168.243.199   ubuntu         <none>           <none>
...

Run the following command on the master node to check the NPU allocation of the worker node:

kubectl describe nodes {Name_of_the_node_where_the_job_is_running}

Example of a single-server single-processor training job

root@ubuntu:/home/test/yaml# kubectl describe nodes
Name:               ubuntu
Roles:              master,worker
Labels:             accelerator=huawei-Ascend910
...
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource              Requests        Limits
  --------              --------        ------
  cpu                   37250m (19%)    37500m (19%)
  memory                117536Mi (15%)  119236Mi (15%)
  ephemeral-storage     0 (0%)          0 (0%)
  huawei.com/Ascend910  1               1
Events:                 <none>

If the value of the huawei.com/Ascend910 field in Allocated resources is set to 1, one processor is used for the training.

One of the two training nodes running 2 x 8P distributed training jobs

root@ubuntu:/home/test/yaml# kubectl describe nodes
Name:               ubuntu
Roles:              master,worker
Labels:             accelerator=huawei-Ascend910
                    beta.kubernetes.io/arch=arm64
...
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource              Requests        Limits
  --------              --------        ------
  cpu                   37250m (19%)    37500m (19%)
  memory                117536Mi (15%)  119236Mi (15%)
  ephemeral-storage     0 (0%)          0 (0%)
  huawei.com/Ascend910  8               8
Events:                 <none>

If the value of the huawei.com/Ascend910 field in Allocated resources is set to 8, all processors are used for the distributed training.

View the NPU usage of a pod.

In this example, run the kubectl describe pod mindx-dls-test-default-test-0 -n vcjob command to check the running status of the pod.

Example of a single-node single-processor training job. If the following information in bold is displayed, the job is normal.

root@ubuntu:/home/test/yaml# kubectl describe pod mindx-dls-test-default-test-0 -n vcjob
Name:         mindx-dls-test-default-test-0
Namespace:    vcjob
Priority:     0
Node:         ubuntu/XXX.XXX.XXX.XXX
Start Time:   Wed, 30 Sep 2020 15:38:22 +0800
Labels:       app=tf
              ring-controller.atlas=ascend-910
              volcano.sh/job-name=mindx-dls-test
              volcano.sh/job-namespace=vcjob
Annotations:  atlas.kubectl.kubernetes.io/ascend-910-configuration:
                {"pod_name":"0","server_id":"XXX.XXX.XXX.XXX","devices":[{"device_id":"3","device_ip":"192.168.20.102"}...
              cni.projectcalico.org/podIP: 192.168.243.195/32
              cni.projectcalico.org/podIPs: 192.168.243.195/32
              huawei.com/Ascend910: Ascend910-3
              huawei.com/AscendReal: Ascend910-3
              huawei.com/kltDev: Ascend910-3
              predicate-time: 18446744073709551615
              scheduling.k8s.io/group-name: mindx-dls-test
              volcano.sh/job-name: mindx-dls-test
              volcano.sh/job-version: 0
              volcano.sh/task-spec: default-test
Status:       Running

Example of two training nodes running 2 x 8P distributed training jobs. If the following information in bold is displayed, the jobs are normal.

root@ubuntu:/home/test/yaml# kubectl describe pod mindx-dls-test-default-test-0 -n vcjob
Name:         mindx-dls-test-default-test-0
Namespace:    vcjob
Priority:     0
Node:         ubuntu/XXX.XXX.XXX.XXX
Start Time:   Wed, 30 Sep 2020 15:38:22 +0800
Labels:       app=tf
              ring-controller.atlas=ascend-910
              volcano.sh/job-name=mindx-dls-test
              volcano.sh/job-namespace=vcjob
Annotations:  atlas.kubectl.kubernetes.io/ascend-910-configuration:
                {"pod_name":"0","server_id":"XXX.XXX.XXX.XXX","devices":[{"device_id":"0","device_ip":"192.168.20.100"}...
              cni.projectcalico.org/podIP: 192.168.243.195/32
              cni.projectcalico.org/podIPs: 192.168.243.195/32
              huawei.com/Ascend910: Ascend910-0,Ascend910-1,Ascend910-2,Ascend910-3,Ascend910-4,Ascend910-5,Ascend910-6,Ascend910-7
              huawei.com/AscendReal: Ascend910-1,Ascend910-2,Ascend910-3,Ascend910-4,Ascend910-5,Ascend910-6,Ascend910-7,Ascend910-0
              huawei.com/kltDev: Ascend910-3,Ascend910-4,Ascend910-5,Ascend910-6,Ascend910-7,Ascend910-0,Ascend910-1,Ascend910-2
              predicate-time: 18446744073709551615
              scheduling.k8s.io/group-name: mindx-dls-test
              volcano.sh/job-name: mindx-dls-test
              volcano.sh/job-version: 0
              volcano.sh/task-spec: default-test
Status:       Running

Parent topic: Training Job