device-plugin主要负责:
1.npu设备发现
2.npu设备分配
整个过程的定位思路如下图所示:

在该图中,问题处理的逻辑如下:
kubectl descirbe node [nodeName]
查看Labels中是否是对应NPU型号的标签,Allocatable中数量是否足够。
查看示例如下:
root@ubuntu:~# kubectl describe node ubuntu-185
Name: ubuntu-185
Roles: worker
Labels: accelerator=huawei-Ascend910
accelerator-type=card
...
Annotations: huawei.com/Ascend910: Ascend910-0,Ascend910-1
kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
...
Capacity:
cpu: 56
ephemeral-storage: 431259672Ki
huawei.com/Ascend910: 2
...
Allocatable:
cpu: 56
ephemeral-storage: 397448913058
huawei.com/Ascend910: 2
...
kubectl describe ds [dsName] -n kube-system
查看ds的Node-Selector是否和1中一致。
查看示例如下:
root@ubuntu:/home/yaml# kubectl describe ds ascend-device-plugin-daemonset -n kube-system
Name: ascend-device-plugin-daemonset
Selector: name=ascend-device-plugin-ds
Node-Selector: accelerator=huawei-Ascend910
Labels: <none>
Annotations: deprecated.daemonset.template.generation: 1
kubectl.kubernetes.io/last-applied-configuration:
{"apiVersion":"apps/v1","kind":"DaemonSet","metadata":{"annotations":{},"name":"ascend-device-plugin-daemonset","namespace":"kube-system"}...
...
kubectl describe po [podName] -n kube-system
查看示例如下:
root@ubuntu:/home/yaml# kubectl describe po mindx-dls-2p-default-2p-0 -n vcjob
Name: mindx-dls-2p-default-2p-0
Namespace: vcjob
Priority: 0
Node: ubuntu/51.38.67.231
Start Time: Wed, 23 Dec 2020 22:14:27 -0500
Labels: app=tf
ring-controller.atlas=ascend-910
volcano.sh/job-name=mindx-dls-2p
volcano.sh/job-namespace=vcjob
Annotations: ascend.kubectl.kubernetes.io/ascend-910-configuration:
{"pod_name":"0","server_id":"51.38.67.231","devices":[{"device_id":"1","device_ip":"192.168.101.100"},{"device_id":"4","device_ip":"192.16...
cni.projectcalico.org/podIP: 192.168.194.75/32
cni.projectcalico.org/podIPs: 192.168.194.75/32
huawei.com/Ascend910: Ascend910-0,Ascend910-1
predicate-time: 18446744073709551615
scheduling.k8s.io/group-name: mindx-dls-2p
volcano.sh/job-name: mindx-dls-2p
volcano.sh/job-version: 0
volcano.sh/task-spec: default-2p
可以发现Annotations的这个字段ascend.kubectl.kubernetes.io/ascend-910-configuration的值是完整且正确的。如果不正确,可参考hccl.json文件没有生成定位并处理。