执行kubectl get pod -A -owide命令,发现NPU Exporter组件中部分Pod的状态为CrashLoopBackOff。
如发生以上错误,可参考以下步骤进行错误排查。
kubectl logs -fn npu-exporter npu-exporter-8l7w2
[ERROR] 2024/10/28 08:50:46.650662 10 devmanager/devmanager.go:91 deviceManager init failed, prepare dcmi failed, err: dcmi init failed, error code: -8005 [ERROR] 2024/10/28 08:50:46.652739 10 collector/npu_collector.go:467 new npu collector failed, error is auto init failed, err: get chip info failed, err: device Manager is nil, may encounter an exception during initialization. You can check the system log to confirm
dcmi module initialize failed. ret is -8005
查看/var/log/ascend_seclog/ascend_install.log日志,显示有固件升级。一般固件升级后,请根据相关提示进行重启。
如出现其他的CrashLoopBackOff状态,可通过执行步骤1查看上述日志进行定位。