本章节以对接Prometheus,上报Prometheus数据为例,确认NPU Exporter组件是否正常运行。
请在任意节点执行以下步骤验证NPU Exporter的安装状态。
kubectl get pods -n npu-exporter -o wide | grep npu-exporter
回显示例:
1 | npu-exporter-4ln8w 1/1 Running 0 36m 192.168.102.109 ubuntu <none> <none> |
kubectl logs -n npu-exporter {npu-exporter组件的Pod名字}
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | [INFO] 2023/12/08 07:38:56.551173 1 hwlog/api.go:108 npu-exporter.log's logger init success [INFO] 2023/12/08 07:38:56.551275 1 npu-exporter/main.go:205 listen on: 0.0.0.0 [INFO] 2023/12/08 07:38:56.551369 1 npu-exporter/main.go:325 npu exporter starting and the version is v7.0.RC1_linux-x86_64 [WARN] 2023/12/08 07:38:56.684424 1 npu-exporter/main.go:339 enable unsafe http server [WARN] 2023/12/08 07:39:01.686205 98 container/runtime_ops.go:150 failed to get OCI connection: context deadline exceeded [WARN] 2023/12/08 07:39:01.686311 98 container/runtime_ops.go:152 use backup address to try again [INFO] 2023/12/08 07:39:01.687444 98 collector/npu_collector.go:418 Starting update cache every 5 seconds [WARN] 2023/12/08 07:39:01.688039 157 collector/npu_collector.go:463 get info of npu-exporter-network-info failed: no value found, so use initial net info [INFO] 2023/12/08 07:39:01.744739 157 collector/npu_collector.go:476 update cache,key is npu-exporter-network-info [INFO] 2023/12/08 07:39:01.852413 158 collector/npu_collector.go:499 update cache,key is npu-exporter-containers-devices [INFO] 2023/12/08 07:39:05.055247 148 collector/npu_collector.go:442 update cache,key is npu-exporter-npu-list [INFO] 2023/12/08 07:39:06.688352 157 collector/npu_collector.go:476 update cache,key is npu-exporter-network-info [INFO] 2023/12/08 07:39:06.750876 158 collector/npu_collector.go:499 update cache,key is npu-exporter-containers-devices [INFO] 2023/12/08 07:39:09.843914 148 collector/npu_collector.go:442 update cache,key is npu-exporter-npu-list [INFO] 2023/12/08 07:39:11.688505 157 collector/npu_collector.go:476 update cache,key is npu-exporter-network-info [INFO] 2023/12/08 07:39:11.701081 158 collector/npu_collector.go:499 update cache,key is npu-exporter-containers-devices [INFO] 2023/12/08 07:39:14.859243 148 collector/npu_collector.go:442 update cache,key is npu-exporter-npu-list ... |
... [INFO] 2023/12/08 07:39:01.744739 157 collector/npu_collector.go:476 update cache,key is npu-exporter-network-info [INFO] 2023/12/08 07:39:01.852413 158 collector/npu_collector.go:499 update cache,key is npu-exporter-containers-devices [INFO] 2023/12/08 07:39:05.055247 148 collector/npu_collector.go:442 update cache,key is npu-exporter-npu-list ...
此时如果发现之前的日志中有如下内容可忽略。
[WARN] 2023/12/08 07:38:56.684424 1 npu-exporter/main.go:339 enable unsafe http server [WARN] 2023/12/08 07:39:01.686205 98 container/runtime_ops.go:150 failed to get OCI connection: context deadline exceeded [WARN] 2023/12/08 07:39:01.686311 98 container/runtime_ops.go:152 use backup address to try again ... [WARN] 2023/12/08 07:39:01.688039 157 collector/npu_collector.go:463 get info of npu-exporter-network-info failed: no value found, so use initial net info
请在安装NPU Exporter的节点执行以下步骤验证组件的安装状态。
systemctl status npu-exporter
回显示例:
1 2 3 4 5 6 7 8 9 | root@ubuntu:~# systemctl status npu-exporter ● npu-exporter.service - Ascend npu exporter Loaded: loaded (/etc/systemd/system/npu-exporter.service; enabled; vendor preset: enabled) Active: active (running) since Thu 2022-11-17 16:24:41 CST; 3 days ago Main PID: 25121 (npu-exporter) Tasks: 8 (limit: 7372) CGroup: /system.slice/npu-exporter.service └─25121 /usr/local/bin/npu-exporter -ip=127.0.0.1 -port=8082 -logFile=/var/log/mindx-dl/npu-exporter/npu-exporter.log ... |
cat /var/log/mindx-dl/npu-exporter/npu-exporter.log
回显示例:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | [INFO] 2023/12/08 07:38:56.551173 1 hwlog/api.go:108 npu-exporter.log's logger init success [INFO] 2023/12/08 07:38:56.551275 1 npu-exporter/main.go:205 listen on: 0.0.0.0 [INFO] 2023/12/08 07:38:56.551369 1 npu-exporter/main.go:325 npu exporter starting and the version is v7.0.RC1_linux-x86_64 [WARN] 2023/12/08 07:38:56.684424 1 npu-exporter/main.go:339 enable unsafe http server [WARN] 2023/12/08 07:39:01.686205 98 container/runtime_ops.go:150 failed to get OCI connection: context deadline exceeded [WARN] 2023/12/08 07:39:01.686311 98 container/runtime_ops.go:152 use backup address to try again [INFO] 2023/12/08 07:39:01.687444 98 collector/npu_collector.go:418 Starting update cache every 5 seconds [WARN] 2023/12/08 07:39:01.688039 157 collector/npu_collector.go:463 get info of npu-exporter-network-info failed: no value found, so use initial net info [INFO] 2023/12/08 07:39:01.744739 157 collector/npu_collector.go:476 update cache,key is npu-exporter-network-info [INFO] 2023/12/08 07:39:01.852413 158 collector/npu_collector.go:499 update cache,key is npu-exporter-containers-devices [INFO] 2023/12/08 07:39:05.055247 148 collector/npu_collector.go:442 update cache,key is npu-exporter-npu-list [INFO] 2023/12/08 07:39:06.688352 157 collector/npu_collector.go:476 update cache,key is npu-exporter-network-info [INFO] 2023/12/08 07:39:06.750876 158 collector/npu_collector.go:499 update cache,key is npu-exporter-containers-devices [INFO] 2023/12/08 07:39:09.843914 148 collector/npu_collector.go:442 update cache,key is npu-exporter-npu-list [INFO] 2023/12/08 07:39:11.688505 157 collector/npu_collector.go:476 update cache,key is npu-exporter-network-info [INFO] 2023/12/08 07:39:11.701081 158 collector/npu_collector.go:499 update cache,key is npu-exporter-containers-devices [INFO] 2023/12/08 07:39:14.859243 148 collector/npu_collector.go:442 update cache,key is npu-exporter-npu-list ... |
如果持续出现如下打印信息,表示组件运行正常。
... [INFO] 2023/12/08 07:39:01.744739 157 collector/npu_collector.go:476 update cache,key is npu-exporter-network-info [INFO] 2023/12/08 07:39:01.852413 158 collector/npu_collector.go:499 update cache,key is npu-exporter-containers-devices [INFO] 2023/12/08 07:39:05.055247 148 collector/npu_collector.go:442 update cache,key is npu-exporter-npu-list ...
此时如果发现之前的日志中有如下内容可忽略。
[WARN] 2023/12/08 07:38:56.684424 1 npu-exporter/main.go:339 enable unsafe http server [WARN] 2023/12/08 07:39:01.686205 98 container/runtime_ops.go:150 failed to get OCI connection: context deadline exceeded [WARN] 2023/12/08 07:39:01.686311 98 container/runtime_ops.go:152 use backup address to try again ... [WARN] 2023/12/08 07:39:01.688039 157 collector/npu_collector.go:463 get info of npu-exporter-network-info failed: no value found, so use initial net info