本章节以对接Prometheus,上报Prometheus数据为例,确认MindCluster NPU Exporter组件是否正常运行。
kubectl get pods -n npu-exporter -o wide
回显示例:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES npu-exporter-4ln8w 1/1 Running 0 36m 192.168.102.109 ubuntu <none> <none> ...
kubectl logs -n npu-exporter {npu-exporter组件的Pod名字}
[INFO] 2023/12/08 07:38:56.551173 1 hwlog/api.go:108 npu-exporter.log's logger init success [INFO] 2023/12/08 07:38:56.551275 1 npu-exporter/main.go:205 listen on: 0.0.0.0 [INFO] 2023/12/08 07:38:56.551369 1 npu-exporter/main.go:325 npu exporter starting and the version is v5.0.0.2_linux-x86_64 [WARN] 2023/12/08 07:38:56.684424 1 npu-exporter/main.go:339 enable unsafe http server [WARN] 2023/12/08 07:39:01.686205 98 container/runtime_ops.go:150 failed to get OCI connection: context deadline exceeded [WARN] 2023/12/08 07:39:01.686311 98 container/runtime_ops.go:152 use backup address to try again [INFO] 2023/12/08 07:39:01.687444 98 collector/npu_collector.go:418 Starting update cache every 5 seconds [WARN] 2023/12/08 07:39:01.688039 157 collector/npu_collector.go:463 get info of npu-exporter-network-info failed: no value found, so use initial net info [INFO] 2023/12/08 07:39:01.744739 157 collector/npu_collector.go:476 update cache,key is npu-exporter-network-info [INFO] 2023/12/08 07:39:01.852413 158 collector/npu_collector.go:499 update cache,key is npu-exporter-containers-devices [INFO] 2023/12/08 07:39:05.055247 148 collector/npu_collector.go:442 update cache,key is npu-exporter-npu-list [INFO] 2023/12/08 07:39:06.688352 157 collector/npu_collector.go:476 update cache,key is npu-exporter-network-info [INFO] 2023/12/08 07:39:06.750876 158 collector/npu_collector.go:499 update cache,key is npu-exporter-containers-devices [INFO] 2023/12/08 07:39:09.843914 148 collector/npu_collector.go:442 update cache,key is npu-exporter-npu-list [INFO] 2023/12/08 07:39:11.688505 157 collector/npu_collector.go:476 update cache,key is npu-exporter-network-info [INFO] 2023/12/08 07:39:11.701081 158 collector/npu_collector.go:499 update cache,key is npu-exporter-containers-devices [INFO] 2023/12/08 07:39:14.859243 148 collector/npu_collector.go:442 update cache,key is npu-exporter-npu-list ...
... [INFO] 2023/12/08 07:39:01.744739 157 collector/npu_collector.go:476 update cache,key is npu-exporter-network-info [INFO] 2023/12/08 07:39:01.852413 158 collector/npu_collector.go:499 update cache,key is npu-exporter-containers-devices [INFO] 2023/12/08 07:39:05.055247 148 collector/npu_collector.go:442 update cache,key is npu-exporter-npu-list ...
此时如果发现之前的日志中有如下内容可忽略。
[WARN] 2023/12/08 07:38:56.684424 1 npu-exporter/main.go:339 enable unsafe http server [WARN] 2023/12/08 07:39:01.686205 98 container/runtime_ops.go:150 failed to get OCI connection: context deadline exceeded [WARN] 2023/12/08 07:39:01.686311 98 container/runtime_ops.go:152 use backup address to try again ... [WARN] 2023/12/08 07:39:01.688039 157 collector/npu_collector.go:463 get info of npu-exporter-network-info failed: no value found, so use initial net info
systemctl status npu-exporter
回显示例:
root@ubuntu:~# systemctl status npu-exporter ● npu-exporter.service - Ascend npu exporter Loaded: loaded (/etc/systemd/system/npu-exporter.service; enabled; vendor preset: enabled) Active: active (running) since Thu 2022-11-17 16:24:41 CST; 3 days ago Main PID: 25121 (npu-exporter) Tasks: 8 (limit: 7372) CGroup: /system.slice/npu-exporter.service └─25121 /usr/local/bin/npu-exporter -ip=127.0.0.1 -port=8082 -logFile=/var/log/mindx-dl/npu-exporter/npu-exporter.log ...
cat /var/log/mindx-dl/npu-exporter/npu-exporter.log
回显示例:
[INFO] 2023/12/08 07:38:56.551173 1 hwlog/api.go:108 npu-exporter.log's logger init success [INFO] 2023/12/08 07:38:56.551275 1 npu-exporter/main.go:205 listen on: 0.0.0.0 [INFO] 2023/12/08 07:38:56.551369 1 npu-exporter/main.go:325 npu exporter starting and the version is v5.0.0.2_linux-x86_64 [WARN] 2023/12/08 07:38:56.684424 1 npu-exporter/main.go:339 enable unsafe http server [WARN] 2023/12/08 07:39:01.686205 98 container/runtime_ops.go:150 failed to get OCI connection: context deadline exceeded [WARN] 2023/12/08 07:39:01.686311 98 container/runtime_ops.go:152 use backup address to try again [INFO] 2023/12/08 07:39:01.687444 98 collector/npu_collector.go:418 Starting update cache every 5 seconds [WARN] 2023/12/08 07:39:01.688039 157 collector/npu_collector.go:463 get info of npu-exporter-network-info failed: no value found, so use initial net info [INFO] 2023/12/08 07:39:01.744739 157 collector/npu_collector.go:476 update cache,key is npu-exporter-network-info [INFO] 2023/12/08 07:39:01.852413 158 collector/npu_collector.go:499 update cache,key is npu-exporter-containers-devices [INFO] 2023/12/08 07:39:05.055247 148 collector/npu_collector.go:442 update cache,key is npu-exporter-npu-list [INFO] 2023/12/08 07:39:06.688352 157 collector/npu_collector.go:476 update cache,key is npu-exporter-network-info [INFO] 2023/12/08 07:39:06.750876 158 collector/npu_collector.go:499 update cache,key is npu-exporter-containers-devices [INFO] 2023/12/08 07:39:09.843914 148 collector/npu_collector.go:442 update cache,key is npu-exporter-npu-list [INFO] 2023/12/08 07:39:11.688505 157 collector/npu_collector.go:476 update cache,key is npu-exporter-network-info [INFO] 2023/12/08 07:39:11.701081 158 collector/npu_collector.go:499 update cache,key is npu-exporter-containers-devices [INFO] 2023/12/08 07:39:14.859243 148 collector/npu_collector.go:442 update cache,key is npu-exporter-npu-list ...
如果持续出现如下打印信息,表示组件运行正常。
... [INFO] 2023/12/08 07:39:01.744739 157 collector/npu_collector.go:476 update cache,key is npu-exporter-network-info [INFO] 2023/12/08 07:39:01.852413 158 collector/npu_collector.go:499 update cache,key is npu-exporter-containers-devices [INFO] 2023/12/08 07:39:05.055247 148 collector/npu_collector.go:442 update cache,key is npu-exporter-npu-list ...
此时如果发现之前的日志中有如下内容可忽略。
[WARN] 2023/12/08 07:38:56.684424 1 npu-exporter/main.go:339 enable unsafe http server [WARN] 2023/12/08 07:39:01.686205 98 container/runtime_ops.go:150 failed to get OCI connection: context deadline exceeded [WARN] 2023/12/08 07:39:01.686311 98 container/runtime_ops.go:152 use backup address to try again ... [WARN] 2023/12/08 07:39:01.688039 157 collector/npu_collector.go:463 get info of npu-exporter-network-info failed: no value found, so use initial net info