Ascend Device Plugin
Perform the following steps on any node to verify the installation status of Ascend Device Plugin:
Procedure
- Run the following command to check the Ascend Device Plugin pod in a Kubernetes cluster. Ensure that STATUS of the pod is Running and READY is 1/1. If the Ascend Device Plugin is installed on multiple nodes in a cluster, you need to confirm the pod on each node.
kubectl get pods -n kube-system -o wide | grep device-plugin
Information similar to the following is displayed.
1ascend-device-plugin-daemonset-910-85p9v 1/1 Running 0 19h 192.168.185.251 ubuntu <none> <none>
- View the logs of Ascend Device Plugin in a Kubernetes cluster.
kubectl logs -n kube-system {Name_of_the_Ascend Device Plugin's_pod}If the following information is displayed, the component is normal.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
root@ubuntu:~# kubectl logs -n kube-system ascend-device-plugin-daemonset-910-85p9v [INFO] 2022/11/21 11:20:04.534992 1 hwlog@v0.0.0/api.go:96 devicePlugin.log's logger init success [INFO] 2022/11/21 11:20:04.535750 1 main.go:127 ascend device plugin starting and the version is xxx_linux-x86_64 [INFO] 2022/11/21 11:20:05.992823 1 K8stool@v0.0.0/self_K8s_client.go:116 start to decrypt cfg [INFO] 2022/11/21 11:20:06.002773 1 K8stool@v0.0.0/self_K8s_client.go:125 Config loaded from file: ****tc/mindx-dl/device-plugin/.config/config6 [INFO] 2022/11/21 11:20:06.003751 1 main.go:153 init kube client success [INFO] 2022/11/21 11:20:06.003923 1 device/ascendcommon.go:104 Found Huawei Ascend, deviceType: Ascend910, deviceName: Ascend910-4 [INFO] 2022/11/21 11:20:06.003970 1 main.go:160 init device manager success [INFO] 2022/11/21 11:20:06.004157 21 device/manager.go:125 starting the listen device [INFO] 2022/11/21 11:20:06.004285 7 device/manager.go:206 Serve start [INFO] 2022/11/21 11:20:06.004970 7 server/server.go:88 device plugin (Ascend910) start serving. [INFO] 2022/11/21 11:20:06.007285 7 server/server.go:36 register Ascend910 to kubelet success. [INFO] 2022/11/21 11:20:06.007521 7 server/pod_resource.go:44 pod resource client init success. [INFO] 2022/11/21 11:20:06.007755 35 server/plugin.go:87 ListAndWatch resp devices: Ascend910-4 Healthy# Processor reported to Kubernetes. The actual processor prevails. [INFO] 2022/11/21 11:20:11.063218 21 kubeclient/client_server.go:123 reset annotation success ...
- Run the following command to view details about nodes in a Kubernetes cluster. If the Capacity and Allocatable fields in the node details contain information about Ascend AI processors, Ascend Device Plugin reports processor information to Kubernetes and operates normally.
kubectl describe node {Node_name_in_a_Kubernetes_cluster}- Take an Atlas 800 training server as an example. The command output is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
root@ubuntu:~# kubectl describe node ubuntu Name: ubuntu Roles: worker Labels: accelerator=huawei-Ascend910 beta.kubernetes.io/arch=amd64 ... CreationTimestamp: Wed, 22 Dec 2021 20:10:04 +0800 Taints: <none> Unschedulable: false ... Capacity: cpu: 72 ephemeral-storage: 479567536Ki huawei.com/Ascend910: 8 # Kubernetes has detected that the node has eight NPUs. ... Allocatable: cpu: 72 ephemeral-storage: 441969440446 huawei.com/Ascend910: 8 # Kubernetes has detected that a total of eight NPUs can be allocated on the node. ...
- The following uses a server (equipped with Atlas 300I inference cards) as an example. The number of processors on the node varies according to the actual situation.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
root@ubuntu:~# kubectl describe node ubuntu Name: ubuntu Roles: worker Labels: accelerator=huawei-Ascend310 beta.kubernetes.io/arch=amd64 ... CreationTimestamp: Wed, 22 Dec 2021 20:10:04 +0800 Taints: <none> Unschedulable: false ... Capacity: cpu: 72 ephemeral-storage: 163760Mi huawei.com/Ascend310: 4 ... Allocatable: cpu: 72 ephemeral-storage: 154543324929 huawei.com/Ascend310: 4 ...
- The following uses a server (equipped with Atlas 300I Pro inference cards) as an example. In non-mixed insertion mode, if the node contains Atlas inference product, the following information is displayed. The number of processors on the node varies according to the actual situation.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
root@ubuntu:~# kubectl describe node ubuntu Name: ubuntu Roles: worker Labels: accelerator=huawei-Ascend310 beta.kubernetes.io/arch=amd64 ... CreationTimestamp: Wed, 22 Dec 2021 20:10:04 +0800 Taints: <none> Unschedulable: false ... Capacity: cpu: 96 ephemeral-storage: 95596964Ki huawei.com/Ascend310P: 3 ... Allocatable: cpu: 96 ephemeral-storage: 88102161877 huawei.com/Ascend310P: 3 ...
- The following uses a server (with Atlas 300I Pro inference cards; mixed insertion mode) is used as an example. The node contains Atlas inference product. The number of processors on the node varies according to the actual situation.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
root@ubuntu:~# kubectl describe node ubuntu Name: ubuntu Roles: worker Labels: accelerator=huawei-Ascend310 beta.kubernetes.io/arch=amd64 ... CreationTimestamp: Wed, 22 Dec 2021 20:10:04 +0800 Taints: <none> Unschedulable: false ... Capacity: cpu: 96 ephemeral-storage: 95596964Ki huawei.com/Ascend310P-IPro: 1 huawei.com/Ascend310P-V: 1 huawei.com/Ascend310P-VPro: 1 ... Allocatable: cpu: 96 ephemeral-storage: 88102161877 huawei.com/Ascend310P-IPro: 1 huawei.com/Ascend310P-V: 1 huawei.com/Ascend310P-VPro: 1 ...
- Take an Atlas 800 training server as an example. The command output is as follows:
Parent topic: Confirming Component Status