NPU Exporter Fails to Check the Dynamic Path, and "check uid or mode failed" Is Recorded in the Log

Symptom

  1. The kubectl get pod -A | grep npu-exporter command is executed, and the command output indicates that the NPU Exporter container image fails to be started.
    npu-exporter     npu-exporter-rtgpg                         0/1     CrashLoopBackOff   2          39s
  2. Run the kubectl logs -f -n npu-exporter npu-exporter-rtgpg command to view the error information. The log information is as follows.
    [INFO]     2023/10/24 09:55:04.454169 1       hwlog/api.go:108    npu-exporter.log's logger init success
    [INFO]     2023/10/24 09:55:04.454389 1       npu-exporter/main.go:205    listen on: 0.0.0.0
    [INFO]     2023/10/24 09:55:04.454607 1       npu-exporter/main.go:325    npu exporter starting and the version is v{version}_linux-aarch64
    2023/10/24 09:55:04 command exec failed, &exec.ExitError{ProcessState:(*os.ProcessState)(0x4000495c80), Stderr:[]uint8(nil)}
    [ERROR]    2023/10/24 09:55:04.458386 1       devmanager/devmanager.go:83    deviceManager init failed, prepare dcmi failed, err: &errors.errorString{s:"cannot found valid driver lib, fromEnv: lib path is invalid, [/usr/local: check uid or mode failed; /usr/local: check uid or mode failed;], fromLdCmd: can't find valid lib"}
    [ERROR]    2023/10/24 09:55:04.458589 1       collector/npu_collector.go:136    new npu collector failed, error is auto init failed, err: get chip info failed, err: device Manager is nil, may encounter an exception during initialization. You can check the system log to confirm
    [ERROR]    2023/10/24 09:55:04.458678 1       npu-exporter/main.go:329    register prometheus failed

Cause Analysis

The permission on the /usr/local directory in the container image is incorrect.

Solution

  1. Obtain the container image ID in the NPU Exporter installation path.
    docker ps -a | grep npu-exporter
    In the following command output, 15bca02e16e9 is the ID of the required container image.
    37a084a19207   15bca02e16e9                                                        "/bin/bash -c -- 'um…"   25 seconds ago       Exited (0) 24 seconds ago                                                                                          k8s_npu-exporter_npu-exporter-rtgpg_npu-exporter_2fa00320-fd40-4b1a-81d0-145a26a8f4e1_4
    2dbb86d6619f   k8s.gcr.io/pause:3.2                                                "/pause"                 About a minute ago   Up About a minute                                                                                                  k8s_POD_npu-exporter-rtgpg_npu-exporter_2fa00320-fd40-4b1a-81d0-145a26a8f4e1_0
  2. View the image information.
    docker images | grep 15bca02e16e9

    Command output:

    npu-exporter                                                      v{version}                      15bca02e16e9   3 minutes ago    93.2MB
  3. Check the permission on the faulty directory.
    docker run -it 15bca02e16e9 bash
    ll /usr/
    The following information is displayed. The local/ directory is the directory whose permission is incorrect.
    total 44
    drwxr-xr-x  1 root root 4096 Oct 19  2022 ./
    drwxr-xr-x  1 root root 4096 Oct 24 09:58 ../
    drwxr-xr-x  2 root root 4096 Oct 19  2022 bin/
    drwxr-xr-x  2 root root 4096 Apr 24  2018 games/
    drwxr-xr-x  2 root root 4096 Apr 24  2018 include/
    drwxr-xr-x 10 root root 4096 Oct 19  2022 lib/
    drwxrwxrwx  1 root root 4096 Oct 19  2022 local/
    drwxr-xr-x  2 root root 4096 Oct 19  2022 sbin/
    drwxr-xr-x 33 root root 4096 Oct 19  2022 share/
    drwxr-xr-x  2 root root 4096 Apr 24  2018 src/
  4. Change the directory permission.
    root@493a58982af9:/# chmod 755 /usr/local
    root@493a58982af9:/# ll /usr/

    In the preceding commands, 493a58982af9 indicates the container ID.

    If the following information is displayed, the permission is correctly set.
    total 44
    drwxr-xr-x  1 root root 4096 Oct 19  2022 ./
    drwxr-xr-x  1 root root 4096 Oct 24 09:58 ../
    drwxr-xr-x  2 root root 4096 Oct 19  2022 bin/
    drwxr-xr-x  2 root root 4096 Apr 24  2018 games/
    drwxr-xr-x  2 root root 4096 Apr 24  2018 include/
    drwxr-xr-x 10 root root 4096 Oct 19  2022 lib/
    drwxr-xr-x  1 root root 4096 Oct 19  2022 local/
    drwxr-xr-x  2 root root 4096 Oct 19  2022 sbin/
    drwxr-xr-x 33 root root 4096 Oct 19  2022 share/
    drwxr-xr-x  2 root root 4096 Apr 24  2018 src/
  5. Exit the container.
    root@493a58982af9:/# exit
  6. Add a tag consisting of the container ID and image name.
    docker commit 493a58982af9 npu-exporter:v{version}
    Command output:
    sha256:34a360670e213cc8817b352a055969e620ed15ac7d26dcbO5e391f0a4ad2682a
  7. Check the container image status of NPU Exporter again.
    kubectl get po -A | grep npu-exporter

    You can wait for the container to automatically restart or manually and forcibly restart the container to check the container image status.

    If the following information is displayed, the NPU Exporter container image is running properly.

    npu-exporter     npu-exporter-rtgpg                         1/1     Running   7         10m
  8. Delete the created container copy.
    docker rm 493a58982af9
    Example command output:
    493a58982af9