定位思路

device-plugin主要负责:
  1. npu设备发现
  2. npu设备分配

整个过程的定位思路如下图所示:

在该图中,问题处理的逻辑如下:

  1. 查看Node状态。

    kubectl descirbe node [nodeName]

    查看Labels中是否是对应NPU型号的标签,Allocatable中数量是否足够。

    查看示例如下:

    root@ubuntu:~# kubectl describe node ubuntu-185
    Name:               ubuntu-185
    Roles:              worker
    Labels:             accelerator=huawei-Ascend910
                        accelerator-type=card
    ...
                        
    Annotations:        huawei.com/Ascend910: Ascend910-0,Ascend910-1
                        kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
    ...
    Capacity:
      cpu:                   56
      ephemeral-storage:     431259672Ki
      huawei.com/Ascend910:  2
    ...
    Allocatable:
      cpu:                   56
      ephemeral-storage:     397448913058
      huawei.com/Ascend910:  2
    ...
     

  2. 查看Ascend Device Plugin的ds状态。

    kubectl describe ds [dsName] -n kube-system

    查看ds的Node-Selector是否和1中一致。

    查看示例如下:

    root@ubuntu:/home/yaml# kubectl describe ds ascend-device-plugin-daemonset -n kube-system
    Name:           ascend-device-plugin-daemonset
    Selector:       name=ascend-device-plugin-ds
    Node-Selector:  accelerator=huawei-Ascend910
    Labels:         <none>
    Annotations:    deprecated.daemonset.template.generation: 1
                    kubectl.kubernetes.io/last-applied-configuration:
                      {"apiVersion":"apps/v1","kind":"DaemonSet","metadata":{"annotations":{},"name":"ascend-device-plugin-daemonset","namespace":"kube-system"}...
    ...

  3. 查看Pod状态。

    kubectl describe po [podName] -n kube-system

    查看示例如下:

    root@ubuntu:/home/yaml# kubectl describe po mindx-dls-2p-default-2p-0 -n vcjob
    Name:         mindx-dls-2p-default-2p-0
    Namespace:    vcjob
    Priority:     0
    Node:         ubuntu/xx.xx.xx.xx
    Start Time:   Wed, 23 Dec 2020 22:14:27 -0500
    Labels:       app=tf
                  ring-controller.atlas=ascend-910
                  volcano.sh/job-name=mindx-dls-2p
                  volcano.sh/job-namespace=vcjob
    Annotations:  ascend.kubectl.kubernetes.io/ascend-910-configuration:
                    {"pod_name":"0","server_id":"xx.xx.xx.xx","devices":[{"device_id":"1","device_ip":"192.168.101.100"},{"device_id":"4","device_ip":"192.16...
                  cni.projectcalico.org/podIP: 192.168.194.75/32
                  cni.projectcalico.org/podIPs: 192.168.194.75/32
                  huawei.com/Ascend910: Ascend910-0,Ascend910-1
                  predicate-time: 18446744073709551615
                  scheduling.k8s.io/group-name: mindx-dls-2p
                  volcano.sh/job-name: mindx-dls-2p
                  volcano.sh/job-version: 0
                  volcano.sh/task-spec: default-2p

    可以发现Annotations的这个字段ascend.kubectl.kubernetes.io/ascend-910-configuration的值是完整且正确的。如果不正确,可参考hccl.json文件没有生成定位并处理。