Fault Locating

The Device Plugin is responsible for:

1. NPU device discovery

2. NPU device allocation

The following figure shows how to locate the fault.

In this figure, the problem handling logic is as follows:

  1. Check the node status.
    kubectl descirbe node [nodeName]

    Check whether the value of Labels is the label of the corresponding NPU model and whether the value of Allocatable is sufficient.

    The following is an example:

    root@ubuntu:~# kubectl describe node ubuntu-185
    Name:               ubuntu-185
    Roles:              worker
    Labels:             accelerator=huawei-Ascend910
                        accelerator-type=card
    ...
                        
    Annotations:        huawei.com/Ascend910: Ascend910-0,Ascend910-1
                        kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
    ...
    Capacity:
      cpu:                   56
      ephemeral-storage:     431259672Ki
      huawei.com/Ascend910:  2
    ...
    Allocatable:
      cpu:                   56
      ephemeral-storage:     397448913058
      huawei.com/Ascend910:  2
    ...
     
  2. Check the daemonset status of Ascend Device Plugin.
    kubectl describe ds [dsName] -n kube-system

    Check whether the node-selector of the daemonset is the same as that in the 1.

    The following is an example:

    root@ubuntu:/home/yaml# kubectl describe ds ascend-device-plugin-daemonset -n kube-system
    Name:           ascend-device-plugin-daemonset
    Selector:       name=ascend-device-plugin-ds
    Node-Selector:  accelerator=huawei-Ascend910
    Labels:         <none>
    Annotations:    deprecated.daemonset.template.generation: 1
                    kubectl.kubernetes.io/last-applied-configuration:
                      {"apiVersion":"apps/v1","kind":"DaemonSet","metadata":{"annotations":{},"name":"ascend-device-plugin-daemonset","namespace":"kube-system"}...
    ...
  3. Check the pod status.
    kubectl describe po [podName] -n kube-system

    The following is an example:

    root@ubuntu:/home/yaml# kubectl describe po mindx-dls-2p-default-2p-0 -n vcjob
    Name:         mindx-dls-2p-default-2p-0
    Namespace:    vcjob
    Priority:     0
    Node:         ubuntu/51.38.67.231
    Start Time:   Wed, 23 Dec 2020 22:14:27 -0500
    Labels:       app=tf
                  ring-controller.atlas=ascend-910
                  volcano.sh/job-name=mindx-dls-2p
                  volcano.sh/job-namespace=vcjob
    Annotations:  ascend.kubectl.kubernetes.io/ascend-910-configuration:
                    {"pod_name":"0","server_id":"51.38.67.231","devices":[{"device_id":"1","device_ip":"192.168.101.100"},{"device_id":"4","device_ip":"192.16...
                  cni.projectcalico.org/podIP: 192.168.194.75/32
                  cni.projectcalico.org/podIPs: 192.168.194.75/32
                  huawei.com/Ascend910: Ascend910-0,Ascend910-1
                  predicate-time: 18446744073709551615
                  scheduling.k8s.io/group-name: mindx-dls-2p
                  volcano.sh/job-name: mindx-dls-2p
                  volcano.sh/job-version: 0
                  volcano.sh/task-spec: default-2p

    The value of the ascend.kubectl.kubernetes.io/ascend-910-configuration field in Annotations is complete and correct. If the configuration is incorrect, locate and rectify the fault by referring to Failed to Generate the hccl.json File.