Job Progress Viewing

Procedure

  1. Check the pod status of a job on the management node. Ensure that the pod status is Running.
    Run the following command to check the pod running status:
    kubectl get pod --all-namespaces -o wide
    • Command output example of a single-server single-processor training job
      1
      2
      3
      4
      NAMESPACE        NAME                                       READY   STATUS              RESTARTS   AGE     IP                NODE           NOMINATED NODE   READINESS GATES
      ...
      vcjob            mindx-dls-test-default-test-0             1/1     Running            0          4m      192.168.243.198   ubuntu         <none>           <none>
      ...
      
    • Command output example of two training nodes running a 2 × 8P distributed training job
      1
      2
      3
      4
      5
      NAMESPACE        NAME                                       READY   STATUS              RESTARTS   AGE     IP                NODE           NOMINATED NODE   READINESS GATES
      ...
      vcjob            mindx-dls-test-default-test-0             1/1     Running            0          3m      192.168.243.198   ubuntu         <none>           <none>
      vcjob            mindx-dls-test-default-test-1             1/1     Running            0          3m      192.168.243.199   ubuntu         <none>           <none>
      ...
      
  2. Run the following command on the management node to check the NPU allocation of the compute nodes:
    kubectl describe nodes {Name_of_each_node_where_the_job_is_running}
    • Command output example of a single-server single-processor training job using the full NPU scheduling feature.
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      Name:               ubuntu
      Roles:              master,worker
      Labels:             accelerator=huawei-Ascend910
      ...
      Allocated resources:
        (Total limits may be over 100 percent, i.e., overcommitted.)
        Resource              Requests        Limits
        --------              --------        ------
        cpu                   37250m (19%)    37500m (19%)
        memory                117536Mi (15%)  119236Mi (15%)
        ephemeral-storage     0 (0%)          0 (0%)
        huawei.com/Ascend910  1               1
      Events:                 <none>
      

      If the value of the huawei.com/Ascend910 field in Allocated resources is set to 1, only one NPU is used for training.

    • Command output example of a single-server single-processor training job using the static vNPU scheduling feature.
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      Name:               ubuntu
      Roles:              master,worker
      Labels:             accelerator=huawei-Ascend910
      ...
      Allocated resources:
        (Total limits may be over 100 percent, i.e., overcommitted.)
        Resource              Requests        Limits
        --------              --------        ------
        cpu                   37250m (19%)    37500m (19%)
        memory                117536Mi (15%)  119236Mi (15%)
        ephemeral-storage     0 (0%)          0 (0%)
        huawei.com/Ascend910-2c  1               1
      Events:                 <none>
      

      If the value of the huawei.com/Ascend910-2c field in Allocated resources is set to 1, one vNPU containing two AI Cores is used for training.

    • Example of checking one of the two training nodes running a 2 × 8P distributed training job. Static vNPU scheduling does not support distributed training jobs.
      Name:               ubuntu
      Roles:              master,worker
      Labels:             accelerator=huawei-Ascend910
                          beta.kubernetes.io/arch=arm64
      ...
      Allocated resources:
        (Total limits may be over 100 percent, i.e., overcommitted.)
        Resource              Requests        Limits
        --------              --------        ------
        cpu                   37250m (19%)    37500m (19%)
        memory                117536Mi (15%)  119236Mi (15%)
        ephemeral-storage     0 (0%)          0 (0%)
        huawei.com/Ascend910  8               8
      Events:                 <none>

      If the value of the huawei.com/Ascend910 field in Allocated resources is set to 8, all NPUs on the node are used for distributed training.

  3. View the NPU usage of a pod.

    In this example, run the kubectl describe pod mindx-dls-test-default-test-0 -n vcjob command to check the pod running status.

    • Example of a single-server single-processor training job. If the following information in bold is displayed, the job is normal.
      root@ubuntu:/home/test/yaml# kubectl describe pod mindx-dls-test-default-test-0 -n vcjob
      Name:         mindx-dls-test-default-test-0
      Namespace:    vcjob
      Priority:     0
      Node:         ubuntu/XXX.XXX.XXX.XXX
      Start Time:   Wed, 30 Sep 2020 15:38:22 +0800
      Labels:       app=tf
                    ring-controller.atlas=ascend-910
                    volcano.sh/job-name=mindx-dls-test
                    volcano.sh/job-namespace=vcjob
      Annotations:  ascend.kubectl.kubernetes.io/ascend-910-configuration:
                      {"pod_name":"0","server_id":"xx-xx-xx-xx","devices":[{"device_id":"3","device_ip":"192.168.20.102"}...
                    cni.projectcalico.org/podIP: 192.168.243.195/32
                    cni.projectcalico.org/podIPs: 192.168.243.195/32
                    huawei.com/Ascend910: Ascend910-3
                    huawei.com/AscendReal: Ascend910-3
                    huawei.com/kltDev: Ascend910-3
                    predicate-time: 18446744073709551615
                    scheduling.k8s.io/group-name: mindx-dls-test
                    volcano.sh/job-name: mindx-dls-test
                    volcano.sh/job-version: 0
                    volcano.sh/task-spec: default-test
      Status:       Running
    • Example of two training nodes running a 2 × 8P distributed training job. If the following information in bold is displayed, the job is normal.
      root@ubuntu:/home/test/yaml# kubectl describe pod mindx-dls-test-default-test-0 -n vcjob
      Name:         mindx-dls-test-default-test-0
      Namespace:    vcjob
      Priority:     0
      Node:         ubuntu/XXX.XXX.XXX.XXX
      Start Time:   Wed, 30 Sep 2020 15:38:22 +0800
      Labels:       app=tf
                    ring-controller.atlas=ascend-910
                    volcano.sh/job-name=mindx-dls-test
                    volcano.sh/job-namespace=vcjob
      Annotations:  ascend.kubectl.kubernetes.io/ascend-910-configuration:
                      {"pod_name":"0","server_id":"xx-xx-xx-xx","devices":[{"device_id":"0","device_ip":"192.168.20.100"}...
                    cni.projectcalico.org/podIP: 192.168.243.195/32
                    cni.projectcalico.org/podIPs: 192.168.243.195/32
                    huawei.com/Ascend910: Ascend910-0,Ascend910-1,Ascend910-2,Ascend910-3,Ascend910-4,Ascend910-5,Ascend910-6,Ascend910-7
                    huawei.com/AscendReal: Ascend910-1,Ascend910-2,Ascend910-3,Ascend910-4,Ascend910-5,Ascend910-6,Ascend910-7,Ascend910-0
                    huawei.com/kltDev: Ascend910-3,Ascend910-4,Ascend910-5,Ascend910-6,Ascend910-7,Ascend910-0,Ascend910-1,Ascend910-2
                    predicate-time: 18446744073709551615
                    scheduling.k8s.io/group-name: mindx-dls-test
                    volcano.sh/job-name: mindx-dls-test
                    volcano.sh/job-version: 0
                    volcano.sh/task-spec: default-test
      Status:       Running