Fault Locating
The Device Plugin is responsible for:
1. NPU device discovery
2. NPU device allocation
The following figure shows how to locate the fault.

In this figure, the problem handling logic is as follows:
- Check the node status.
kubectl descirbe node [nodeName]
Check whether the value of Labels is the label of the corresponding NPU model and whether the value of Allocatable is sufficient.
The following is an example:
root@ubuntu:~# kubectl describe node ubuntu-185 Name: ubuntu-185 Roles: worker Labels: accelerator=huawei-Ascend910 accelerator-type=card ... Annotations: huawei.com/Ascend910: Ascend910-0,Ascend910-1 kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock ... Capacity: cpu: 56 ephemeral-storage: 431259672Ki huawei.com/Ascend910: 2 ... Allocatable: cpu: 56 ephemeral-storage: 397448913058 huawei.com/Ascend910: 2 ... - Check the daemonset status of Ascend Device Plugin.
kubectl describe ds [dsName] -n kube-system
Check whether the node-selector of the daemonset is the same as that in the 1.
The following is an example:
root@ubuntu:/home/yaml# kubectl describe ds ascend-device-plugin-daemonset -n kube-system Name: ascend-device-plugin-daemonset Selector: name=ascend-device-plugin-ds Node-Selector: accelerator=huawei-Ascend910 Labels: <none> Annotations: deprecated.daemonset.template.generation: 1 kubectl.kubernetes.io/last-applied-configuration: {"apiVersion":"apps/v1","kind":"DaemonSet","metadata":{"annotations":{},"name":"ascend-device-plugin-daemonset","namespace":"kube-system"}... ... - Check the pod status.
kubectl describe po [podName] -n kube-system
The following is an example:
root@ubuntu:/home/yaml# kubectl describe po mindx-dls-2p-default-2p-0 -n vcjob Name: mindx-dls-2p-default-2p-0 Namespace: vcjob Priority: 0 Node: ubuntu/51.38.67.231 Start Time: Wed, 23 Dec 2020 22:14:27 -0500 Labels: app=tf ring-controller.atlas=ascend-910 volcano.sh/job-name=mindx-dls-2p volcano.sh/job-namespace=vcjob Annotations: ascend.kubectl.kubernetes.io/ascend-910-configuration: {"pod_name":"0","server_id":"51.38.67.231","devices":[{"device_id":"1","device_ip":"192.168.101.100"},{"device_id":"4","device_ip":"192.16... cni.projectcalico.org/podIP: 192.168.194.75/32 cni.projectcalico.org/podIPs: 192.168.194.75/32 huawei.com/Ascend910: Ascend910-0,Ascend910-1 predicate-time: 18446744073709551615 scheduling.k8s.io/group-name: mindx-dls-2p volcano.sh/job-name: mindx-dls-2p volcano.sh/job-version: 0 volcano.sh/task-spec: default-2pThe value of the ascend.kubectl.kubernetes.io/ascend-910-configuration field in Annotations is complete and correct. If the configuration is incorrect, locate and rectify the fault by referring to Failed to Generate the hccl.json File.
Parent topic: Ascend Device Plugin