Pods of NPU Exporter Are in the CrashLoopBackOff State

Symptom

After the kubectl get pod -A -o wide command is executed, the command output shows that the status of some Pods of NPU Exporter is CrashLoopBackOff.

Cause Analysis

If the preceding error occurs, perform the following steps to rectify the fault:

  1. Run the following command to check the error log. Alternatively, check whether the /var/log/mindx-dl/npu-exporter/npu-exporter.log log file contains the deviceManager initialization error. If yes, the NPU device cannot be found.
    kubectl logs -f -n npu-exporter npu-exporter-8l7w2
    Command output:
    [ERROR]    2024/10/28 08:50:46.650662 10      devmanager/devmanager.go:91    deviceManager init failed, prepare dcmi failed, err: dcmi init failed, error code: -8005
    [ERROR]    2024/10/28 08:50:46.652739 10      collector/npu_collector.go:467    new npu collector failed, error is auto init failed, err: get chip info failed, err: device Manager is nil, may encounter an exception during initialization. You can check the system log to confirm
  2. Run the npu-smi info command. (If error code -8005 is returned, initialization fails. In this case, the server is not restarted after the NPU driver and firmware are upgraded.)
    dcmi module initialize failed. ret is -8005

Solution

Check the /var/log/ascend_seclog/ascend_install.log file, which shows that the firmware is upgraded. After the firmware is upgraded, perform restart as prompted.

If the CrashLoopBackOff status is displayed, you can perform the preceding Step 1 to locate logs for fault locating.