查看任务进程
操作步骤
- 在管理节点查看任务Pod的状态,需要保证Pod状态为Running。执行以下命令,查看Pod运行情况。
kubectl get pod --all-namespaces -o wide
- 单机单芯片训练任务回显示例。
1 2 3 4
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES ... vcjob mindx-dls-test-default-test-0 1/1 Running 0 4m 192.168.243.198 ubuntu <none> <none> ...
- 两个训练节点,执行2*8芯片分布式训练任务回显示例。
1 2 3 4 5
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES ... vcjob mindx-dls-test-default-test-0 1/1 Running 0 3m 192.168.243.198 ubuntu <none> <none> vcjob mindx-dls-test-default-test-1 1/1 Running 0 3m 192.168.243.199 ubuntu <none> <none> ...
- 单机单芯片训练任务回显示例。
- 查看计算节点的NPU分配情况,在管理节点执行以下命令查看。
kubectl describe nodes {任务运行节点的节点名}
- 使用整卡调度特性,单机单芯片训练任务回显示例。
1 2 3 4 5 6 7 8 9 10 11 12 13
Name: ubuntu Roles: master,worker Labels: accelerator=huawei-Ascend910 ... Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 37250m (19%) 37500m (19%) memory 117536Mi (15%) 119236Mi (15%) ephemeral-storage 0 (0%) 0 (0%) huawei.com/Ascend910 1 1 Events: <none>
Allocated resources的字段huawei.com/Ascend910的值为1,表明训练使用了一个NPU。
- 使用静态vNPU调度特性,单机单芯片训练任务回显示例。
1 2 3 4 5 6 7 8 9 10 11 12 13
Name: ubuntu Roles: master,worker Labels: accelerator=huawei-Ascend910 ... Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 37250m (19%) 37500m (19%) memory 117536Mi (15%) 119236Mi (15%) ephemeral-storage 0 (0%) 0 (0%) huawei.com/Ascend910-2c 1 1 Events: <none>
Allocated resources的字段huawei.com/Ascend910-2c的值为1,表明训练使用了一个包含了2个AI Core的vNPU。
- 两个训练节点,执行2*8芯片分布式训练任务,查看其中一个节点示例。静态vNPU调度不支持分布式训练任务。
Name: ubuntu Roles: master,worker Labels: accelerator=huawei-Ascend910 beta.kubernetes.io/arch=arm64 ... Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 37250m (19%) 37500m (19%) memory 117536Mi (15%) 119236Mi (15%) ephemeral-storage 0 (0%) 0 (0%) huawei.com/Ascend910 8 8 Events: <none>
Allocated resources的字段huawei.com/Ascend910的值为8,表明分布式训练使用了节点上所有的NPU。
- 使用整卡调度特性,单机单芯片训练任务回显示例。
- 查看Pod的NPU使用情况。
本例中使用kubectl describe pod mindx-dls-test-default-test-0 -n vcjob命令查看运行Pod的情况。
- 单机单芯片训练任务示例,有如下加粗的内容表示正常。
root@ubuntu:/home/test/yaml# kubectl describe pod mindx-dls-test-default-test-0 -n vcjob Name: mindx-dls-test-default-test-0 Namespace: vcjob Priority: 0 Node: ubuntu/XXX.XXX.XXX.XXX Start Time: Wed, 30 Sep 2020 15:38:22 +0800 Labels: app=tf ring-controller.atlas=ascend-910 volcano.sh/job-name=mindx-dls-test volcano.sh/job-namespace=vcjob Annotations: atlas.kubectl.kubernetes.io/ascend-910-configuration: {"pod_name":"0","server_id":"xx.xx.xx.xx","devices":[{"device_id":"3","device_ip":"192.168.20.102"}... cni.projectcalico.org/podIP: 192.168.243.195/32 cni.projectcalico.org/podIPs: 192.168.243.195/32 huawei.com/Ascend910: Ascend910-3 huawei.com/AscendReal: Ascend910-3 huawei.com/kltDev: Ascend910-3 predicate-time: 18446744073709551615 scheduling.k8s.io/group-name: mindx-dls-test volcano.sh/job-name: mindx-dls-test volcano.sh/job-version: 0 volcano.sh/task-spec: default-test Status: Running
- 两个训练节点,执行2*8芯片分布式训练任务示例,有如下加粗的内容表示正常。
root@ubuntu:/home/test/yaml# kubectl describe pod mindx-dls-test-default-test-0 -n vcjob Name: mindx-dls-test-default-test-0 Namespace: vcjob Priority: 0 Node: ubuntu/XXX.XXX.XXX.XXX Start Time: Wed, 30 Sep 2020 15:38:22 +0800 Labels: app=tf ring-controller.atlas=ascend-910 volcano.sh/job-name=mindx-dls-test volcano.sh/job-namespace=vcjob Annotations: atlas.kubectl.kubernetes.io/ascend-910-configuration: {"pod_name":"0","server_id":"XXX.XXX.XXX.XXX","devices":[{"device_id":"0","device_ip":"192.168.20.100"}... cni.projectcalico.org/podIP: 192.168.243.195/32 cni.projectcalico.org/podIPs: 192.168.243.195/32 huawei.com/Ascend910: Ascend910-0,Ascend910-1,Ascend910-2,Ascend910-3,Ascend910-4,Ascend910-5,Ascend910-6,Ascend910-7 huawei.com/AscendReal: Ascend910-1,Ascend910-2,Ascend910-3,Ascend910-4,Ascend910-5,Ascend910-6,Ascend910-7,Ascend910-0 huawei.com/kltDev: Ascend910-3,Ascend910-4,Ascend910-5,Ascend910-6,Ascend910-7,Ascend910-0,Ascend910-1,Ascend910-2 predicate-time: 18446744073709551615 scheduling.k8s.io/group-name: mindx-dls-test volcano.sh/job-name: mindx-dls-test volcano.sh/job-version: 0 volcano.sh/task-spec: default-test Status: Running
- 单机单芯片训练任务示例,有如下加粗的内容表示正常。
父主题: 通过命令行使用(Volcano)