Ascend Device Plugin

Deploying the Ascend Device Plugin in Binary Mode

  1. Log in to the node where the Ascend Device Plugin is deployed and run the following command to check the component status. Ensure that the component status is active (running).
    systemctl status device-plugin

    Information similar to the following is displayed.

    root@ubuntu:~# systemctl status device-plugin
    ● device-plugin.service - Ascend K8s device plugin
       Loaded: loaded (/etc/systemd/system/device-plugin.service; enabled; vendor preset: enabled)
       Active: active (running) since Mon 2022-11-21 11:20:04 CST; 8min ago
      Process: 26269 ExecStart=/bin/bash -c /usr/local/bin/device-plugin -volcanoType=true -logFile=/var/log/mindx-dl/devicePlugin/devicePlugin.log>/dev/null  2>&1 & (code=exited, status=0/SUCCESS)
     Main PID: 26270 (device-plugin)
        Tasks: 10 (limit: 7372)
       CGroup: /system.slice/device-plugin.service
               └─26270 /usr/local/bin/device-plugin -volcanoType=true -logFile=/var/log/mindx-dl/devicePlugin/devicePlugin.log
    
    
    Nov 21 11:20:04 ubuntu-155 systemd[1]: Starting Ascend K8s device plugin...
    Nov 21 11:20:04 ubuntu-155 systemd[1]: Started Ascend K8s device plugin.
    ...
  2. View component logs.
    cat /var/log/mindx-dl/devicePlugin/devicePlugin.log

    If the following information is displayed, the component is running properly:

    [INFO]     2022/11/21 11:20:04.534992 1       hwlog@v0.0.0/api.go:96    devicePlugin.log's logger init success
    [INFO]     2022/11/21 11:20:04.535750 1       main.go:127    ascend device plugin starting and the version is v3.0.0_linux-x86_64
    [INFO]     2022/11/21 11:20:05.992823 1       K8stool@v0.0.0/self_K8s_client.go:116    start to decrypt cfg
    [INFO]     2022/11/21 11:20:06.002773 1       K8stool@v0.0.0/self_K8s_client.go:125    Config loaded from file: ****tc/mindx-dl/device-plugin/.config/config6
    [INFO]     2022/11/21 11:20:06.003751 1       main.go:153    init kube client success 
    [INFO]     2022/11/21 11:20:06.003923 1       device/ascendcommon.go:104    Found Huawei Ascend, deviceType: Ascend910, deviceName: Ascend910-4
    [INFO]     2022/11/21 11:20:06.003970 1       main.go:160    init device manager success
    [INFO]     2022/11/21 11:20:06.004157 21      device/manager.go:125    starting the listen device
    [INFO]     2022/11/21 11:20:06.004285 7       device/manager.go:206    Serve start
    [INFO]     2022/11/21 11:20:06.004970 7       server/server.go:88    device plugin (Ascend910) start serving.
    [INFO]     2022/11/21 11:20:06.007285 7       server/server.go:36    register Ascend910 to kubelet success.
    [INFO]     2022/11/21 11:20:06.007521 7       server/pod_resource.go:44    pod resource client init success.
    [INFO]     2022/11/21 11:20:06.007755 35      server/plugin.go:87    ListAndWatch resp devices: Ascend910-4 Healthy   # Chip reported to Kubernetes. The actual chip prevails.
    [INFO]     2022/11/21 11:20:11.063218 21      kubeclient/client_server.go:123    reset annotation success
    ...
  3. Run the following command to view details about nodes in the Kubernetes cluster. If the Capacity and Allocatable fields in the node details contain information about the Ascend AI Processor, the Ascend Device Plugin reports chip information to the Kubernetes and the component is running properly.
    kubectl describe node {Node_name_in_the_Kubernetes_cluster}
    • The following uses an Atlas 800 training server as an example. The node contains Ascend 910 AI Processors.
      root@ubuntu:~# kubectl describe node ubuntu
      Name:               ubuntu
      Roles:              worker
      Labels:             accelerator=huawei-Ascend910
                          beta.kubernetes.io/arch=amd64
      ...
      CreationTimestamp:  Wed, 22 Dec 2021 20:10:04 +0800
      Taints:             <none>
      Unschedulable:      false
      ...
      Capacity:
        cpu:                      72
        ephemeral-storage:        479567536Ki
        huawei.com/Ascend910:     8  # The Kubernetes cluster has detected that the node has eight NPUs.
      ...
      Allocatable:
        cpu:                      72
        ephemeral-storage:        441969440446
        huawei.com/Ascend910:     8 # The Kubernetes cluster has detected that a total of eight NPUs can be allocated on the node.
      ...
    • The following uses a server (with an Atlas 300I inference card) as an example. The node contains Ascend 310 AI Processors. The number of processors on the node varies according to the actual situation.
      root@ubuntu:~# kubectl describe node ubuntu
      Name:               ubuntu
      Roles:              worker
      Labels:             accelerator=huawei-Ascend310
                          beta.kubernetes.io/arch=amd64
      ...
      CreationTimestamp:  Wed, 22 Dec 2021 20:10:04 +0800
      Taints:             <none>
      Unschedulable:      false
      ...
      Capacity:
        cpu:                       72
        ephemeral-storage:         163760Mi
        huawei.com/Ascend310:      4
      ...
      Allocatable:
        cpu:                       72
        ephemeral-storage:         154543324929
        huawei.com/Ascend310:      4
      ...
    • The following uses a server (with an Atlas 300I Pro inference card) as an example. The node contains Ascend 310P AI Processors. The number of processors on the node varies according to the actual scenario.
      root@ubuntu:~# kubectl describe node ubuntu
      Name:               ubuntu
      Roles:              worker
      Labels:             accelerator=huawei-Ascend310
                          beta.kubernetes.io/arch=amd64
      ...
      CreationTimestamp:  Wed, 22 Dec 2021 20:10:04 +0800
      Taints:             <none>
      Unschedulable:      false
      ...
      Capacity:
        cpu:                      96
        ephemeral-storage:        95596964Ki
        huawei.com/Ascend310P:    3
      ...
      Allocatable:
        cpu:                      96
        ephemeral-storage:        88102161877
        huawei.com/Ascend310P:    3
      ...

Deploying Ascend Device Plugin in Container Mode

  1. Run the following command to check the pod of the Ascend Device Plugin in the Kubernetes cluster. Ensure that STATUS of the pod is Running and READY is 1/1. If the Ascend Device Plugin is installed on multiple nodes in a cluster, you need to confirm the pod on each node.
    kubectl get pods -n kube-system -o wide

    Information similar to the following is displayed.

    root@ubuntu:~# kubectl get pods -n kube-system  -o wide
    NAME                                       READY   STATUS    RESTARTS   AGE     IP                NODE         NOMINATED NODE   READINESS GATES
    ascend-device-plugin-daemonset-910-85p9v   1/1     Running   0          19h     192.168.185.251   ubuntu       <none>           <none>
    ...
  2. Run the following command to view the logs of the Ascend Device Plugin in the Kubernetes cluster:
    kubectl logs -n kube-system {Name_of_the_Ascend_Device_Plugin's_pod}

    If the following information is displayed, the component is normal:

    root@ubuntu:~# kubectl logs -n kube-system ascend-device-plugin-daemonset-910-85p9v 
    [INFO]     2022/11/21 11:20:04.534992 1       hwlog@v0.0.0/api.go:96    devicePlugin.log's logger init success
    [INFO]     2022/11/21 11:20:04.535750 1       main.go:127    ascend device plugin starting and the version is v3.0.0_linux-x86_64
    [INFO]     2022/11/21 11:20:05.992823 1       K8stool@v0.0.0/self_K8s_client.go:116    start to decrypt cfg
    [INFO]     2022/11/21 11:20:06.002773 1       K8stool@v0.0.0/self_K8s_client.go:125    Config loaded from file: ****tc/mindx-dl/device-plugin/.config/config6
    [INFO]     2022/11/21 11:20:06.003751 1       main.go:153    init kube client success 
    [INFO]     2022/11/21 11:20:06.003923 1       device/ascendcommon.go:104    Found Huawei Ascend, deviceType: Ascend910, deviceName: Ascend910-4
    [INFO]     2022/11/21 11:20:06.003970 1       main.go:160    init device manager success
    [INFO]     2022/11/21 11:20:06.004157 21      device/manager.go:125    starting the listen device
    [INFO]     2022/11/21 11:20:06.004285 7       device/manager.go:206    Serve start
    [INFO]     2022/11/21 11:20:06.004970 7       server/server.go:88    device plugin (Ascend910) start serving.
    [INFO]     2022/11/21 11:20:06.007285 7       server/server.go:36    register Ascend910 to kubelet success.
    [INFO]     2022/11/21 11:20:06.007521 7       server/pod_resource.go:44    pod resource client init success.
    [INFO]     2022/11/21 11:20:06.007755 35      server/plugin.go:87    ListAndWatch resp devices: Ascend910-4 Healthy   # Chip reported to Kubernetes. The actual chip prevails.
    [INFO]     2022/11/21 11:20:11.063218 21      kubeclient/client_server.go:123    reset annotation success
    ...
  3. For details, see step 3.