Pods of NPU Exporter Are in the CrashLoopBackOff State
Symptom
After the kubectl get pod -A -o wide command is executed, the command output shows that the status of some Pods of NPU Exporter is CrashLoopBackOff.

Cause Analysis
If the preceding error occurs, perform the following steps to rectify the fault:
- Run the following command to check the error log. Alternatively, check whether the /var/log/mindx-dl/npu-exporter/npu-exporter.log log file contains the deviceManager initialization error. If yes, the NPU device cannot be found.
kubectl logs -f -n npu-exporter npu-exporter-8l7w2
Command output:[ERROR] 2024/10/28 08:50:46.650662 10 devmanager/devmanager.go:91 deviceManager init failed, prepare dcmi failed, err: dcmi init failed, error code: -8005 [ERROR] 2024/10/28 08:50:46.652739 10 collector/npu_collector.go:467 new npu collector failed, error is auto init failed, err: get chip info failed, err: device Manager is nil, may encounter an exception during initialization. You can check the system log to confirm
- Run the npu-smi info command. (If error code -8005 is returned, initialization fails. In this case, the server is not restarted after the NPU driver and firmware are upgraded.)
dcmi module initialize failed. ret is -8005
Solution
Check the /var/log/ascend_seclog/ascend_install.log file, which shows that the firmware is upgraded. After the firmware is upgraded, perform restart as prompted.
If the CrashLoopBackOff status is displayed, you can perform the preceding Step 1 to locate logs for fault locating.
Parent topic: Faults During Use