NPU-Exporter

本章节以对接Prometheus,上报Prometheus数据为例,确认MindCluster NPU Exporter组件是否正常运行。

MindCluster NPU Exporter使用容器部署

  1. 通过如下命令查看K8s集群中MindCluster NPU ExporterPod,需要满足Pod的STATUS为Running,READY为1/1。如果集群中有多个节点安装了MindCluster NPU Exporter,需要一个一个确认。

    kubectl get pods -n npu-exporter -o wide

    回显示例:

    NAME                 READY   STATUS    RESTARTS   AGE   IP                NODE         NOMINATED NODE   READINESS GATES
    npu-exporter-4ln8w   1/1     Running   0          36m   192.168.102.109   ubuntu       <none>           <none>
    ...

  2. 通过如下命令查看K8s集群中MindCluster NPU Exporter的日志。

    kubectl logs -n npu-exporter {npu-exporter组件的Pod名字}
    回显示例:
    [INFO]     2023/12/08 07:38:56.551173 1       hwlog/api.go:108    npu-exporter.log's logger init success
    [INFO]     2023/12/08 07:38:56.551275 1       npu-exporter/main.go:205    listen on: 0.0.0.0
    [INFO]     2023/12/08 07:38:56.551369 1       npu-exporter/main.go:325    npu exporter starting and the version is v5.0.0.2_linux-x86_64
    [WARN]     2023/12/08 07:38:56.684424 1       npu-exporter/main.go:339    enable unsafe http server
    [WARN]     2023/12/08 07:39:01.686205 98      container/runtime_ops.go:150    failed to get OCI connection: context deadline exceeded
    [WARN]     2023/12/08 07:39:01.686311 98      container/runtime_ops.go:152    use backup address to try again
    [INFO]     2023/12/08 07:39:01.687444 98      collector/npu_collector.go:418    Starting update cache every 5 seconds
    [WARN]     2023/12/08 07:39:01.688039 157     collector/npu_collector.go:463    get info of npu-exporter-network-info failed: no value found, so use initial net info
    [INFO]     2023/12/08 07:39:01.744739 157     collector/npu_collector.go:476    update cache,key is npu-exporter-network-info
    [INFO]     2023/12/08 07:39:01.852413 158     collector/npu_collector.go:499    update cache,key is npu-exporter-containers-devices
    [INFO]     2023/12/08 07:39:05.055247 148     collector/npu_collector.go:442    update cache,key is npu-exporter-npu-list
    [INFO]     2023/12/08 07:39:06.688352 157     collector/npu_collector.go:476    update cache,key is npu-exporter-network-info
    [INFO]     2023/12/08 07:39:06.750876 158     collector/npu_collector.go:499    update cache,key is npu-exporter-containers-devices
    [INFO]     2023/12/08 07:39:09.843914 148     collector/npu_collector.go:442    update cache,key is npu-exporter-npu-list
    [INFO]     2023/12/08 07:39:11.688505 157     collector/npu_collector.go:476    update cache,key is npu-exporter-network-info
    [INFO]     2023/12/08 07:39:11.701081 158     collector/npu_collector.go:499    update cache,key is npu-exporter-containers-devices
    [INFO]     2023/12/08 07:39:14.859243 148     collector/npu_collector.go:442    update cache,key is npu-exporter-npu-list
    ...
    如果持续出现如下打印信息,表示组件运行正常。
    ...
    [INFO]     2023/12/08 07:39:01.744739 157     collector/npu_collector.go:476    update cache,key is npu-exporter-network-info
    [INFO]     2023/12/08 07:39:01.852413 158     collector/npu_collector.go:499    update cache,key is npu-exporter-containers-devices
    [INFO]     2023/12/08 07:39:05.055247 148     collector/npu_collector.go:442    update cache,key is npu-exporter-npu-list
    ...

    此时如果发现之前的日志中有如下内容可忽略。

    [WARN]     2023/12/08 07:38:56.684424 1       npu-exporter/main.go:339    enable unsafe http server
    [WARN]     2023/12/08 07:39:01.686205 98      container/runtime_ops.go:150    failed to get OCI connection: context deadline exceeded
    [WARN]     2023/12/08 07:39:01.686311 98      container/runtime_ops.go:152    use backup address to try again
    ...
    [WARN]     2023/12/08 07:39:01.688039 157     collector/npu_collector.go:463    get info of npu-exporter-network-info failed: no value found, so use initial net info

MindCluster NPU Exporter使用二进制部署

  1. 登录部署MindCluster NPU Exporter的节点,使用如下命令,查看组件服务的状态,需要满足组件状态为active (running)。

    systemctl status npu-exporter

    回显示例:

    root@ubuntu:~# systemctl status npu-exporter
    ● npu-exporter.service - Ascend npu exporter
       Loaded: loaded (/etc/systemd/system/npu-exporter.service; enabled; vendor preset: enabled)
       Active: active (running) since Thu 2022-11-17 16:24:41 CST; 3 days ago
     Main PID: 25121 (npu-exporter)
        Tasks: 8 (limit: 7372)
       CGroup: /system.slice/npu-exporter.service
               └─25121 /usr/local/bin/npu-exporter -ip=127.0.0.1 -port=8082 -logFile=/var/log/mindx-dl/npu-exporter/npu-exporter.log
    ...

  2. 查看组件日志。

    cat /var/log/mindx-dl/npu-exporter/npu-exporter.log

    回显示例:

    [INFO]     2023/12/08 07:38:56.551173 1       hwlog/api.go:108    npu-exporter.log's logger init success
    [INFO]     2023/12/08 07:38:56.551275 1       npu-exporter/main.go:205    listen on: 0.0.0.0
    [INFO]     2023/12/08 07:38:56.551369 1       npu-exporter/main.go:325    npu exporter starting and the version is v5.0.0.2_linux-x86_64
    [WARN]     2023/12/08 07:38:56.684424 1       npu-exporter/main.go:339    enable unsafe http server
    [WARN]     2023/12/08 07:39:01.686205 98      container/runtime_ops.go:150    failed to get OCI connection: context deadline exceeded
    [WARN]     2023/12/08 07:39:01.686311 98      container/runtime_ops.go:152    use backup address to try again
    [INFO]     2023/12/08 07:39:01.687444 98      collector/npu_collector.go:418    Starting update cache every 5 seconds
    [WARN]     2023/12/08 07:39:01.688039 157     collector/npu_collector.go:463    get info of npu-exporter-network-info failed: no value found, so use initial net info
    [INFO]     2023/12/08 07:39:01.744739 157     collector/npu_collector.go:476    update cache,key is npu-exporter-network-info
    [INFO]     2023/12/08 07:39:01.852413 158     collector/npu_collector.go:499    update cache,key is npu-exporter-containers-devices
    [INFO]     2023/12/08 07:39:05.055247 148     collector/npu_collector.go:442    update cache,key is npu-exporter-npu-list
    [INFO]     2023/12/08 07:39:06.688352 157     collector/npu_collector.go:476    update cache,key is npu-exporter-network-info
    [INFO]     2023/12/08 07:39:06.750876 158     collector/npu_collector.go:499    update cache,key is npu-exporter-containers-devices
    [INFO]     2023/12/08 07:39:09.843914 148     collector/npu_collector.go:442    update cache,key is npu-exporter-npu-list
    [INFO]     2023/12/08 07:39:11.688505 157     collector/npu_collector.go:476    update cache,key is npu-exporter-network-info
    [INFO]     2023/12/08 07:39:11.701081 158     collector/npu_collector.go:499    update cache,key is npu-exporter-containers-devices
    [INFO]     2023/12/08 07:39:14.859243 148     collector/npu_collector.go:442    update cache,key is npu-exporter-npu-list
    ...

    如果持续出现如下打印信息,表示组件运行正常。

    ...
    [INFO]     2023/12/08 07:39:01.744739 157     collector/npu_collector.go:476    update cache,key is npu-exporter-network-info
    [INFO]     2023/12/08 07:39:01.852413 158     collector/npu_collector.go:499    update cache,key is npu-exporter-containers-devices
    [INFO]     2023/12/08 07:39:05.055247 148     collector/npu_collector.go:442    update cache,key is npu-exporter-npu-list
    ...

    此时如果发现之前的日志中有如下内容可忽略。

    [WARN]     2023/12/08 07:38:56.684424 1       npu-exporter/main.go:339    enable unsafe http server
    [WARN]     2023/12/08 07:39:01.686205 98      container/runtime_ops.go:150    failed to get OCI connection: context deadline exceeded
    [WARN]     2023/12/08 07:39:01.686311 98      container/runtime_ops.go:152    use backup address to try again
    ...
    [WARN]     2023/12/08 07:39:01.688039 157     collector/npu_collector.go:463    get info of npu-exporter-network-info failed: no value found, so use initial net info