Ascend Device Plugin
Deploying the Ascend Device Plugin in Binary Mode
- Log in to the node where the Ascend Device Plugin is deployed and run the following command to check the component status. Ensure that the component status is active (running).
systemctl status device-plugin
Information similar to the following is displayed.
root@ubuntu:~# systemctl status device-plugin ● device-plugin.service - Ascend K8s device plugin Loaded: loaded (/etc/systemd/system/device-plugin.service; enabled; vendor preset: enabled) Active: active (running) since Mon 2022-11-21 11:20:04 CST; 8min ago Process: 26269 ExecStart=/bin/bash -c /usr/local/bin/device-plugin -volcanoType=true -logFile=/var/log/mindx-dl/devicePlugin/devicePlugin.log>/dev/null 2>&1 & (code=exited, status=0/SUCCESS) Main PID: 26270 (device-plugin) Tasks: 10 (limit: 7372) CGroup: /system.slice/device-plugin.service └─26270 /usr/local/bin/device-plugin -volcanoType=true -logFile=/var/log/mindx-dl/devicePlugin/devicePlugin.log Nov 21 11:20:04 ubuntu-155 systemd[1]: Starting Ascend K8s device plugin... Nov 21 11:20:04 ubuntu-155 systemd[1]: Started Ascend K8s device plugin. ... - View component logs.
cat /var/log/mindx-dl/devicePlugin/devicePlugin.log
If the following information is displayed, the component is running properly:
[INFO] 2022/11/21 11:20:04.534992 1 hwlog@v0.0.0/api.go:96 devicePlugin.log's logger init success [INFO] 2022/11/21 11:20:04.535750 1 main.go:127 ascend device plugin starting and the version is v3.0.0_linux-x86_64 [INFO] 2022/11/21 11:20:05.992823 1 K8stool@v0.0.0/self_K8s_client.go:116 start to decrypt cfg [INFO] 2022/11/21 11:20:06.002773 1 K8stool@v0.0.0/self_K8s_client.go:125 Config loaded from file: ****tc/mindx-dl/device-plugin/.config/config6 [INFO] 2022/11/21 11:20:06.003751 1 main.go:153 init kube client success [INFO] 2022/11/21 11:20:06.003923 1 device/ascendcommon.go:104 Found Huawei Ascend, deviceType: Ascend910, deviceName: Ascend910-4 [INFO] 2022/11/21 11:20:06.003970 1 main.go:160 init device manager success [INFO] 2022/11/21 11:20:06.004157 21 device/manager.go:125 starting the listen device [INFO] 2022/11/21 11:20:06.004285 7 device/manager.go:206 Serve start [INFO] 2022/11/21 11:20:06.004970 7 server/server.go:88 device plugin (Ascend910) start serving. [INFO] 2022/11/21 11:20:06.007285 7 server/server.go:36 register Ascend910 to kubelet success. [INFO] 2022/11/21 11:20:06.007521 7 server/pod_resource.go:44 pod resource client init success. [INFO] 2022/11/21 11:20:06.007755 35 server/plugin.go:87 ListAndWatch resp devices: Ascend910-4 Healthy # Chip reported to Kubernetes. The actual chip prevails. [INFO] 2022/11/21 11:20:11.063218 21 kubeclient/client_server.go:123 reset annotation success ...
- Run the following command to view details about nodes in the Kubernetes cluster. If the Capacity and Allocatable fields in the node details contain information about the Ascend AI Processor, the Ascend Device Plugin reports chip information to the Kubernetes and the component is running properly.
kubectl describe node {Node_name_in_the_Kubernetes_cluster}- The following uses an Atlas 800 training server as an example. The node contains Ascend 910 AI Processors.
root@ubuntu:~# kubectl describe node ubuntu Name: ubuntu Roles: worker Labels: accelerator=huawei-Ascend910 beta.kubernetes.io/arch=amd64 ... CreationTimestamp: Wed, 22 Dec 2021 20:10:04 +0800 Taints: <none> Unschedulable: false ... Capacity: cpu: 72 ephemeral-storage: 479567536Ki huawei.com/Ascend910: 8 # The Kubernetes cluster has detected that the node has eight NPUs. ... Allocatable: cpu: 72 ephemeral-storage: 441969440446 huawei.com/Ascend910: 8 # The Kubernetes cluster has detected that a total of eight NPUs can be allocated on the node. ... - The following uses a server (with an Atlas 300I inference card) as an example. The node contains Ascend 310 AI Processors. The number of processors on the node varies according to the actual situation.
root@ubuntu:~# kubectl describe node ubuntu Name: ubuntu Roles: worker Labels: accelerator=huawei-Ascend310 beta.kubernetes.io/arch=amd64 ... CreationTimestamp: Wed, 22 Dec 2021 20:10:04 +0800 Taints: <none> Unschedulable: false ... Capacity: cpu: 72 ephemeral-storage: 163760Mi huawei.com/Ascend310: 4 ... Allocatable: cpu: 72 ephemeral-storage: 154543324929 huawei.com/Ascend310: 4 ... - The following uses a server (with an Atlas 300I Pro inference card) as an example. The node contains Ascend 310P AI Processors. The number of processors on the node varies according to the actual scenario.
root@ubuntu:~# kubectl describe node ubuntu Name: ubuntu Roles: worker Labels: accelerator=huawei-Ascend310 beta.kubernetes.io/arch=amd64 ... CreationTimestamp: Wed, 22 Dec 2021 20:10:04 +0800 Taints: <none> Unschedulable: false ... Capacity: cpu: 96 ephemeral-storage: 95596964Ki huawei.com/Ascend310P: 3 ... Allocatable: cpu: 96 ephemeral-storage: 88102161877 huawei.com/Ascend310P: 3 ...
- The following uses an Atlas 800 training server as an example. The node contains Ascend 910 AI Processors.
Deploying Ascend Device Plugin in Container Mode
- Run the following command to check the pod of the Ascend Device Plugin in the Kubernetes cluster. Ensure that STATUS of the pod is Running and READY is 1/1. If the Ascend Device Plugin is installed on multiple nodes in a cluster, you need to confirm the pod on each node.
kubectl get pods -n kube-system -o wide
Information similar to the following is displayed.
root@ubuntu:~# kubectl get pods -n kube-system -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES ascend-device-plugin-daemonset-910-85p9v 1/1 Running 0 19h 192.168.185.251 ubuntu <none> <none> ...
- Run the following command to view the logs of the Ascend Device Plugin in the Kubernetes cluster:
kubectl logs -n kube-system {Name_of_the_Ascend_Device_Plugin's_pod}If the following information is displayed, the component is normal:
root@ubuntu:~# kubectl logs -n kube-system ascend-device-plugin-daemonset-910-85p9v [INFO] 2022/11/21 11:20:04.534992 1 hwlog@v0.0.0/api.go:96 devicePlugin.log's logger init success [INFO] 2022/11/21 11:20:04.535750 1 main.go:127 ascend device plugin starting and the version is v3.0.0_linux-x86_64 [INFO] 2022/11/21 11:20:05.992823 1 K8stool@v0.0.0/self_K8s_client.go:116 start to decrypt cfg [INFO] 2022/11/21 11:20:06.002773 1 K8stool@v0.0.0/self_K8s_client.go:125 Config loaded from file: ****tc/mindx-dl/device-plugin/.config/config6 [INFO] 2022/11/21 11:20:06.003751 1 main.go:153 init kube client success [INFO] 2022/11/21 11:20:06.003923 1 device/ascendcommon.go:104 Found Huawei Ascend, deviceType: Ascend910, deviceName: Ascend910-4 [INFO] 2022/11/21 11:20:06.003970 1 main.go:160 init device manager success [INFO] 2022/11/21 11:20:06.004157 21 device/manager.go:125 starting the listen device [INFO] 2022/11/21 11:20:06.004285 7 device/manager.go:206 Serve start [INFO] 2022/11/21 11:20:06.004970 7 server/server.go:88 device plugin (Ascend910) start serving. [INFO] 2022/11/21 11:20:06.007285 7 server/server.go:36 register Ascend910 to kubelet success. [INFO] 2022/11/21 11:20:06.007521 7 server/pod_resource.go:44 pod resource client init success. [INFO] 2022/11/21 11:20:06.007755 35 server/plugin.go:87 ListAndWatch resp devices: Ascend910-4 Healthy # Chip reported to Kubernetes. The actual chip prevails. [INFO] 2022/11/21 11:20:11.063218 21 kubeclient/client_server.go:123 reset annotation success ...
- For details, see step 3.
Parent topic: Confirming Component Status