npu-exporter npu-exporter-rtgpg 0/1 CrashLoopBackOff 2 39s
[INFO] 2023/10/24 09:55:04.454169 1 hwlog/api.go:108 npu-exporter.log's logger init success [INFO] 2023/10/24 09:55:04.454389 1 npu-exporter/main.go:205 listen on: 0.0.0.0 [INFO] 2023/10/24 09:55:04.454607 1 npu-exporter/main.go:325 npu exporter starting and the version is v{version}_linux-aarch64 2023/10/24 09:55:04 command exec failed, &exec.ExitError{ProcessState:(*os.ProcessState)(0x4000495c80), Stderr:[]uint8(nil)} [ERROR] 2023/10/24 09:55:04.458386 1 devmanager/devmanager.go:83 deviceManager init failed, prepare dcmi failed, err: &errors.errorString{s:"cannot found valid driver lib, fromEnv: lib path is invalid, [/usr/local: check uid or mode failed; /usr/local: check uid or mode failed;], fromLdCmd: can't find valid lib"} [ERROR] 2023/10/24 09:55:04.458589 1 collector/npu_collector.go:136 new npu collector failed, error is auto init failed, err: get chip info failed, err: device Manager is nil, may encounter an exception during initialization. You can check the system log to confirm [ERROR] 2023/10/24 09:55:04.458678 1 npu-exporter/main.go:329 register prometheus failed
容器镜像内“/usr/local”目录权限不正确。
docker ps -a | grep npu-exporter
37a084a19207 15bca02e16e9 "/bin/bash -c -- 'um…" 25 seconds ago Exited (0) 24 seconds ago k8s_npu-exporter_npu-exporter-rtgpg_npu-exporter_2fa00320-fd40-4b1a-81d0-145a26a8f4e1_4 2dbb86d6619f k8s.gcr.io/pause:3.2 "/pause" About a minute ago Up About a minute k8s_POD_npu-exporter-rtgpg_npu-exporter_2fa00320-fd40-4b1a-81d0-145a26a8f4e1_0
docker images | grep 15bca02e16e9
回显示例如下。
npu-exporter v{version} 15bca02e16e9 3 minutes ago 93.2MB
docker run -it 15bca02e16e9 bash ll /usr/
total 44 drwxr-xr-x 1 root root 4096 Oct 19 2022 ./ drwxr-xr-x 1 root root 4096 Oct 24 09:58 ../ drwxr-xr-x 2 root root 4096 Oct 19 2022 bin/ drwxr-xr-x 2 root root 4096 Apr 24 2018 games/ drwxr-xr-x 2 root root 4096 Apr 24 2018 include/ drwxr-xr-x 10 root root 4096 Oct 19 2022 lib/ drwxrwxrwx 1 root root 4096 Oct 19 2022 local/ drwxr-xr-x 2 root root 4096 Oct 19 2022 sbin/ drwxr-xr-x 33 root root 4096 Oct 19 2022 share/ drwxr-xr-x 2 root root 4096 Apr 24 2018 src/
root@493a58982af9:/# chmod 755 /usr/local root@493a58982af9:/# ll /usr/
493a58982af9为容器ID。
total 44 drwxr-xr-x 1 root root 4096 Oct 19 2022 ./ drwxr-xr-x 1 root root 4096 Oct 24 09:58 ../ drwxr-xr-x 2 root root 4096 Oct 19 2022 bin/ drwxr-xr-x 2 root root 4096 Apr 24 2018 games/ drwxr-xr-x 2 root root 4096 Apr 24 2018 include/ drwxr-xr-x 10 root root 4096 Oct 19 2022 lib/ drwxr-xr-x 1 root root 4096 Oct 19 2022 local/ drwxr-xr-x 2 root root 4096 Oct 19 2022 sbin/ drwxr-xr-x 33 root root 4096 Oct 19 2022 share/ drwxr-xr-x 2 root root 4096 Apr 24 2018 src/
root@493a58982af9:/# exit
docker commit 493a58982af9 npu-exporter:v{version}
sha256:34a360670e213cc8817b352a055969e620ed15ac7d26dcbO5e391f0a4ad2682a
kubectl get po -A | grep npu-exporter
可以等待容器自动重启或者手动强制重启,查看容器镜像状态。
回显示例如下,表示NPU Exporter的容器镜像已正常运行。
npu-exporter npu-exporter-rtgpg 1/1 Running 7 10m
docker rm 493a58982af9
493a58982af9