整个过程的定位思路如下图所示:
在该图中,问题处理的逻辑如下:
kubectl descirbe node [nodeName]
查看Labels中是否是对应NPU型号的标签,Allocatable中数量是否足够。
查看示例如下:
root@ubuntu:~# kubectl describe node ubuntu-185 Name: ubuntu-185 Roles: worker Labels: accelerator=huawei-Ascend910 accelerator-type=card ... Annotations: huawei.com/Ascend910: Ascend910-0,Ascend910-1 kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock ... Capacity: cpu: 56 ephemeral-storage: 431259672Ki huawei.com/Ascend910: 2 ... Allocatable: cpu: 56 ephemeral-storage: 397448913058 huawei.com/Ascend910: 2 ...
kubectl describe ds [dsName] -n kube-system
查看ds的Node-Selector是否和1中一致。
查看示例如下:
root@ubuntu:/home/yaml# kubectl describe ds ascend-device-plugin-daemonset -n kube-system Name: ascend-device-plugin-daemonset Selector: name=ascend-device-plugin-ds Node-Selector: accelerator=huawei-Ascend910 Labels: <none> Annotations: deprecated.daemonset.template.generation: 1 kubectl.kubernetes.io/last-applied-configuration: {"apiVersion":"apps/v1","kind":"DaemonSet","metadata":{"annotations":{},"name":"ascend-device-plugin-daemonset","namespace":"kube-system"}... ...
kubectl describe po [podName] -n kube-system
查看示例如下:
root@ubuntu:/home/yaml# kubectl describe po mindx-dls-2p-default-2p-0 -n vcjob Name: mindx-dls-2p-default-2p-0 Namespace: vcjob Priority: 0 Node: ubuntu/xx.xx.xx.xx Start Time: Wed, 23 Dec 2020 22:14:27 -0500 Labels: app=tf ring-controller.atlas=ascend-910 volcano.sh/job-name=mindx-dls-2p volcano.sh/job-namespace=vcjob Annotations: ascend.kubectl.kubernetes.io/ascend-910-configuration: {"pod_name":"0","server_id":"xx.xx.xx.xx","devices":[{"device_id":"1","device_ip":"192.168.101.100"},{"device_id":"4","device_ip":"192.16... cni.projectcalico.org/podIP: 192.168.194.75/32 cni.projectcalico.org/podIPs: 192.168.194.75/32 huawei.com/Ascend910: Ascend910-0,Ascend910-1 predicate-time: 18446744073709551615 scheduling.k8s.io/group-name: mindx-dls-2p volcano.sh/job-name: mindx-dls-2p volcano.sh/job-version: 0 volcano.sh/task-spec: default-2p
可以发现Annotations的这个字段ascend.kubectl.kubernetes.io/ascend-910-configuration的值是完整且正确的。如果不正确,可参考hccl.json文件没有生成定位并处理。