NPU-Exporter
NPU-Exporter Deployed Using Binary Mode
- Log in to the node where the NPU-Exporter is deployed and run the following command to check the component status. Ensure that the component status is active (running).
systemctl status npu-exporter
Information similar to the following is displayed.
root@ubuntu:~# systemctl status npu-exporter ● npu-exporter.service - Ascend npu exporter Loaded: loaded (/etc/systemd/system/npu-exporter.service; enabled; vendor preset: enabled) Active: active (running) since Thu 2022-11-17 16:24:41 CST; 3 days ago Main PID: 25121 (npu-exporter) Tasks: 8 (limit: 7372) CGroup: /system.slice/npu-exporter.service └─25121 /usr/local/bin/npu-exporter -ip=127.0.0.1 -port=8082 -logFile=/var/log/mindx-dl/npu-exporter/npu-exporter.log ... - View component logs.
cat /var/log/mindx-dl/npu-exporter/npu-exporter.log
Information similar to the following is displayed.
root@ubuntu:/usr/local/bin# cat /var/log/mindx-dl/npu-exporter/npu-exporter.log [INFO] 2022/10/25 17:01:18.610431 1 hwlog@v0.0.10/api.go:96 npu-exporter.log's logger init success [INFO] 2022/10/25 17:01:18.610628 1 npu-exporter/main.go:275 listen on: 0.0.0.0 [INFO] 2022/10/25 17:01:18.610740 1 npu-exporter/main.go:112 npu exporter starting and the version is v3.0.RC3_linux-aarch64 ... [ERROR] 2022/10/25 17:01:24.191525 34 container/runtime_ops.go:91 failed to get OCI connection [ERROR] 2022/10/25 17:01:24.191736 34 container/runtime_ops.go:93 try again [INFO] 2022/10/25 17:01:24.193024 34 collector/npu_collector.go:166 Starting update cache every 5 seconds [INFO] 2022/10/25 17:01:29.315194 34 collector/npu_collector.go:178 update cache,key is npu-exporter-npu-list [INFO] 2022/10/25 17:01:29.315407 34 collector/npu_collector.go:183 update cache,key is npu-exporter-containers-devices [INFO] 2022/10/25 17:01:34.302792 34 collector/npu_collector.go:178 update cache,key is npu-exporter-npu-list [INFO] 2022/10/25 17:01:34.302983 34 collector/npu_collector.go:183 update cache,key is npu-exporter-containers-devices ...
If the following information is displayed continuously, the component is running properly.
... [INFO] 2022/10/25 17:01:29.315194 34 collector/npu_collector.go:178 update cache,key is npu-exporter-npu-list [INFO] 2022/10/25 17:01:29.315407 34 collector/npu_collector.go:183 update cache,key is npu-exporter-containers-devices ...
If the previous log contains the following information, ignore it:
[ERROR] 2022/10/25 17:01:24.191525 34 container/runtime_ops.go:91 failed to get OCI connection [ERROR] 2022/10/25 17:01:24.191736 34 container/runtime_ops.go:93 try again
NPU-Exporter Deployed Using a Container
- Run the following command to check the pod of the NPU-Exporter in the Kubernetes cluster. Ensure that STATUS of the pod is Running and READY is 1/1. If the NPU-Exporter is installed on multiple nodes in the cluster, confirm the pod status one by one.
kubectl get pods -n npu-exporter -o wide
Information similar to the following is displayed.
root@ubuntu:~# kubectl get pods -n npu-exporter -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES npu-exporter-4ln8w 1/1 Running 0 36m 192.168.102.109 ubuntu <none> <none> ...
- Run the following command to view NPU-Exporter logs in the Kubernetes cluster:
kubectl logs -n npu-exporter {name_of_the_NPU-Exporter's pod}Information similar to the following is displayed.
root@ubuntu:~# kubectl logs -n npu-exporter npu-exporter-dq24k [INFO] 2022/10/25 17:01:18.610431 1 hwlog@v0.0.10/api.go:96 npu-exporter.log's logger init success [INFO] 2022/10/25 17:01:18.610628 1 npu-exporter/main.go:275 listen on: 0.0.0.0 [INFO] 2022/10/25 17:01:18.610740 1 npu-exporter/main.go:112 npu exporter starting and the version is v3.0.RC3_linux-aarch64 ... [ERROR] 2022/10/25 17:01:24.191525 34 container/runtime_ops.go:91 failed to get OCI connection [ERROR] 2022/10/25 17:01:24.191736 34 container/runtime_ops.go:93 try again [INFO] 2022/10/25 17:01:24.193024 34 collector/npu_collector.go:166 Starting update cache every 5 seconds [INFO] 2022/10/25 17:01:29.315194 34 collector/npu_collector.go:178 update cache,key is npu-exporter-npu-list [INFO] 2022/10/25 17:01:29.315407 34 collector/npu_collector.go:183 update cache,key is npu-exporter-containers-devices [INFO] 2022/10/25 17:01:34.302792 34 collector/npu_collector.go:178 update cache,key is npu-exporter-npu-list [INFO] 2022/10/25 17:01:34.302983 34 collector/npu_collector.go:183 update cache,key is npu-exporter-containers-devices ...
If the following information is displayed continuously, the component is running properly.
... [INFO] 2022/10/25 17:01:29.315194 34 collector/npu_collector.go:178 update cache,key is npu-exporter-npu-list [INFO] 2022/10/25 17:01:29.315407 34 collector/npu_collector.go:183 update cache,key is npu-exporter-containers-devices ...
If the previous log contains the following information, ignore it:
[ERROR] 2022/10/25 17:01:24.191525 34 container/runtime_ops.go:91 failed to get OCI connection [ERROR] 2022/10/25 17:01:24.191736 34 container/runtime_ops.go:93 try again
Parent topic: Confirming Component Status