Fault Locating

The Device Plugin is responsible for:

1. NPU device discovery

2. NPU device allocation

The following figure shows how to locate the fault.

In this figure, the problem handling logic is as follows:

Check the node status.

kubectl descirbe node [nodeName]

Check whether the value of Labels is the label of the corresponding NPU model and whether the value of Allocatable is sufficient.

The following is an example:

root@ubuntu:~# kubectl describe node ubuntu-185
Name:               ubuntu-185
Roles:              worker
Labels:             accelerator=huawei-Ascend910
                    accelerator-type=card
...
                    
Annotations:        huawei.com/Ascend910: Ascend910-0,Ascend910-1
                    kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
...
Capacity:
  cpu:                   56
  ephemeral-storage:     431259672Ki
  huawei.com/Ascend910:  2
...
Allocatable:
  cpu:                   56
  ephemeral-storage:     397448913058
  huawei.com/Ascend910:  2
...

Check the daemonset status of Ascend Device Plugin.

kubectl describe ds [dsName] -n kube-system

Check whether the node-selector of the daemonset is the same as that in the 1.

The following is an example:

root@ubuntu:/home/yaml# kubectl describe ds ascend-device-plugin-daemonset -n kube-system
Name:           ascend-device-plugin-daemonset
Selector:       name=ascend-device-plugin-ds
Node-Selector:  accelerator=huawei-Ascend910
Labels:         <none>
Annotations:    deprecated.daemonset.template.generation: 1
                kubectl.kubernetes.io/last-applied-configuration:
                  {"apiVersion":"apps/v1","kind":"DaemonSet","metadata":{"annotations":{},"name":"ascend-device-plugin-daemonset","namespace":"kube-system"}...
...

Check the pod status.

kubectl describe po [podName] -n kube-system

The following is an example:

root@ubuntu:/home/yaml# kubectl describe po mindx-dls-2p-default-2p-0 -n vcjob
Name:         mindx-dls-2p-default-2p-0
Namespace:    vcjob
Priority:     0
Node:         ubuntu/51.38.67.231
Start Time:   Wed, 23 Dec 2020 22:14:27 -0500
Labels:       app=tf
              ring-controller.atlas=ascend-910
              volcano.sh/job-name=mindx-dls-2p
              volcano.sh/job-namespace=vcjob
Annotations:  ascend.kubectl.kubernetes.io/ascend-910-configuration:
                {"pod_name":"0","server_id":"51.38.67.231","devices":[{"device_id":"1","device_ip":"192.168.101.100"},{"device_id":"4","device_ip":"192.16...
              cni.projectcalico.org/podIP: 192.168.194.75/32
              cni.projectcalico.org/podIPs: 192.168.194.75/32
              huawei.com/Ascend910: Ascend910-0,Ascend910-1
              predicate-time: 18446744073709551615
              scheduling.k8s.io/group-name: mindx-dls-2p
              volcano.sh/job-name: mindx-dls-2p
              volcano.sh/job-version: 0
              volcano.sh/task-spec: default-2p

The value of the ascend.kubectl.kubernetes.io/ascend-910-configuration field in Annotations is complete and correct. If the configuration is incorrect, locate and rectify the fault by referring to Failed to Generate the hccl.json File.

Parent topic: Ascend Device Plugin