NPU Exporter Fails to Check the Dynamic Path, and "check uid or mode failed" Is Recorded in the Log
Symptom
- The kubectl get pod -A | grep npu-exporter command is executed, and the command output indicates that the NPU Exporter container image fails to be started.
npu-exporter npu-exporter-rtgpg 0/1 CrashLoopBackOff 2 39s
- Run the kubectl logs -f -n npu-exporter npu-exporter-rtgpg command to view the error information. The log information is as follows.
[INFO] 2023/10/24 09:55:04.454169 1 hwlog/api.go:108 npu-exporter.log's logger init success [INFO] 2023/10/24 09:55:04.454389 1 npu-exporter/main.go:205 listen on: 0.0.0.0 [INFO] 2023/10/24 09:55:04.454607 1 npu-exporter/main.go:325 npu exporter starting and the version is v{version}_linux-aarch64 2023/10/24 09:55:04 command exec failed, &exec.ExitError{ProcessState:(*os.ProcessState)(0x4000495c80), Stderr:[]uint8(nil)} [ERROR] 2023/10/24 09:55:04.458386 1 devmanager/devmanager.go:83 deviceManager init failed, prepare dcmi failed, err: &errors.errorString{s:"cannot found valid driver lib, fromEnv: lib path is invalid, [/usr/local: check uid or mode failed; /usr/local: check uid or mode failed;], fromLdCmd: can't find valid lib"} [ERROR] 2023/10/24 09:55:04.458589 1 collector/npu_collector.go:136 new npu collector failed, error is auto init failed, err: get chip info failed, err: device Manager is nil, may encounter an exception during initialization. You can check the system log to confirm [ERROR] 2023/10/24 09:55:04.458678 1 npu-exporter/main.go:329 register prometheus failed
Cause Analysis
The permission on the /usr/local directory in the container image is incorrect.
Solution
- Obtain the container image ID in the NPU Exporter installation path.
docker ps -a | grep npu-exporter
In the following command output, 15bca02e16e9 is the ID of the required container image.37a084a19207 15bca02e16e9 "/bin/bash -c -- 'um…" 25 seconds ago Exited (0) 24 seconds ago k8s_npu-exporter_npu-exporter-rtgpg_npu-exporter_2fa00320-fd40-4b1a-81d0-145a26a8f4e1_4 2dbb86d6619f k8s.gcr.io/pause:3.2 "/pause" About a minute ago Up About a minute k8s_POD_npu-exporter-rtgpg_npu-exporter_2fa00320-fd40-4b1a-81d0-145a26a8f4e1_0
- View the image information.
docker images | grep 15bca02e16e9
Command output:
npu-exporter v{version} 15bca02e16e9 3 minutes ago 93.2MB - Check the permission on the faulty directory.
docker run -it 15bca02e16e9 bash ll /usr/
The following information is displayed. The local/ directory is the directory whose permission is incorrect.total 44 drwxr-xr-x 1 root root 4096 Oct 19 2022 ./ drwxr-xr-x 1 root root 4096 Oct 24 09:58 ../ drwxr-xr-x 2 root root 4096 Oct 19 2022 bin/ drwxr-xr-x 2 root root 4096 Apr 24 2018 games/ drwxr-xr-x 2 root root 4096 Apr 24 2018 include/ drwxr-xr-x 10 root root 4096 Oct 19 2022 lib/ drwxrwxrwx 1 root root 4096 Oct 19 2022 local/ drwxr-xr-x 2 root root 4096 Oct 19 2022 sbin/ drwxr-xr-x 33 root root 4096 Oct 19 2022 share/ drwxr-xr-x 2 root root 4096 Apr 24 2018 src/
- Change the directory permission.
root@493a58982af9:/# chmod 755 /usr/local root@493a58982af9:/# ll /usr/
In the preceding commands, 493a58982af9 indicates the container ID.
If the following information is displayed, the permission is correctly set.total 44 drwxr-xr-x 1 root root 4096 Oct 19 2022 ./ drwxr-xr-x 1 root root 4096 Oct 24 09:58 ../ drwxr-xr-x 2 root root 4096 Oct 19 2022 bin/ drwxr-xr-x 2 root root 4096 Apr 24 2018 games/ drwxr-xr-x 2 root root 4096 Apr 24 2018 include/ drwxr-xr-x 10 root root 4096 Oct 19 2022 lib/ drwxr-xr-x 1 root root 4096 Oct 19 2022 local/ drwxr-xr-x 2 root root 4096 Oct 19 2022 sbin/ drwxr-xr-x 33 root root 4096 Oct 19 2022 share/ drwxr-xr-x 2 root root 4096 Apr 24 2018 src/
- Exit the container.
root@493a58982af9:/# exit
- Add a tag consisting of the container ID and image name.
docker commit 493a58982af9 npu-exporter:v{version}Command output:sha256:34a360670e213cc8817b352a055969e620ed15ac7d26dcbO5e391f0a4ad2682a
- Check the container image status of NPU Exporter again.
kubectl get po -A | grep npu-exporter
You can wait for the container to automatically restart or manually and forcibly restart the container to check the container image status.
If the following information is displayed, the NPU Exporter container image is running properly.
npu-exporter npu-exporter-rtgpg 1/1 Running 7 10m
- Delete the created container copy.
docker rm 493a58982af9
Example command output:493a58982af9
Parent topic: Faults During Installation