NPU-Exporter

NPU-Exporter使用二进制部署

  1. 登录部署NPU-Exporter的节点,使用如下命令,查看组件服务的状态,需要满足组件状态为active (running)。

    systemctl status npu-exporter

    回显示例:

    root@ubuntu:~# systemctl status npu-exporter
    ● npu-exporter.service - Ascend npu exporter
       Loaded: loaded (/etc/systemd/system/npu-exporter.service; enabled; vendor preset: enabled)
       Active: active (running) since Thu 2022-11-17 16:24:41 CST; 3 days ago
     Main PID: 25121 (npu-exporter)
        Tasks: 8 (limit: 7372)
       CGroup: /system.slice/npu-exporter.service
               └─25121 /usr/local/bin/npu-exporter -ip=127.0.0.1 -port=8082 -logFile=/var/log/mindx-dl/npu-exporter/npu-exporter.log
    ...

  2. 查看组件日志。

    cat /var/log/mindx-dl/npu-exporter/npu-exporter.log

    回显示例:

    root@ubuntu:/usr/local/bin# cat /var/log/mindx-dl/npu-exporter/npu-exporter.log
    [INFO]     2022/10/25 17:01:18.610431 1       hwlog@v0.0.10/api.go:96    npu-exporter.log's logger init success
    [INFO]     2022/10/25 17:01:18.610628 1       npu-exporter/main.go:275    listen on: 0.0.0.0
    [INFO]     2022/10/25 17:01:18.610740 1       npu-exporter/main.go:112    npu exporter starting and the version is v5.0.RC1_linux-aarch64
    ...
    [ERROR]    2022/10/25 17:01:24.191525 34      container/runtime_ops.go:91    failed to get OCI connection
    [ERROR]    2022/10/25 17:01:24.191736 34      container/runtime_ops.go:93    try again
    [INFO]     2022/10/25 17:01:24.193024 34      collector/npu_collector.go:166    Starting update cache every 5 seconds
    [INFO]     2022/10/25 17:01:29.315194 34      collector/npu_collector.go:178    update cache,key is npu-exporter-npu-list
    [INFO]     2022/10/25 17:01:29.315407 34      collector/npu_collector.go:183    update cache,key is npu-exporter-containers-devices
    [INFO]     2022/10/25 17:01:34.302792 34      collector/npu_collector.go:178    update cache,key is npu-exporter-npu-list
    [INFO]     2022/10/25 17:01:34.302983 34      collector/npu_collector.go:183    update cache,key is npu-exporter-containers-devices
    ...

    如果持续出现如下打印信息,表示组件运行正常。

    ...
    [INFO]     2022/10/25 17:01:29.315194 34      collector/npu_collector.go:178    update cache,key is npu-exporter-npu-list
    [INFO]     2022/10/25 17:01:29.315407 34      collector/npu_collector.go:183    update cache,key is npu-exporter-containers-devices
    ...

    此时如果发现之前的日志中有如下内容可忽略。

    [ERROR]    2022/10/25 17:01:24.191525 34      container/runtime_ops.go:91    failed to get OCI connection
    [ERROR]    2022/10/25 17:01:24.191736 34      container/runtime_ops.go:93    try again

NPU-Exporter使用容器部署

  1. 通过如下命令查看K8s集群中NPU-Exporter的Pod,需要满足Pod的STATUS为Running,READY为1/1。如果集群中有多个节点安装了NPU-Exporter,需要一个一个确认。

    kubectl get pods -n npu-exporter -o wide

    回显示例:

    root@ubuntu:~# kubectl get pods -n npu-exporter -o wide
    NAME                 READY   STATUS    RESTARTS   AGE   IP                NODE         NOMINATED NODE   READINESS GATES
    npu-exporter-4ln8w   1/1     Running   0          36m   192.168.102.109   ubuntu       <none>           <none>
    ...

  2. 通过如下命令查看K8s集群中NPU-Exporter的日志。

    kubectl logs -n npu-exporter {npu-exporter组件的Pod名字}

    回显示例:

    root@ubuntu:~# kubectl logs -n npu-exporter npu-exporter-dq24k 
    [INFO]     2022/10/25 17:01:18.610431 1       hwlog@v0.0.10/api.go:96    npu-exporter.log's logger init success
    [INFO]     2022/10/25 17:01:18.610628 1       npu-exporter/main.go:275    listen on: 0.0.0.0
    [INFO]     2022/10/25 17:01:18.610740 1       npu-exporter/main.go:112    npu exporter starting and the version is v5.0.RC1_linux-aarch64
    ...
    [ERROR]    2022/10/25 17:01:24.191525 34      container/runtime_ops.go:91    failed to get OCI connection
    [ERROR]    2022/10/25 17:01:24.191736 34      container/runtime_ops.go:93    try again
    [INFO]     2022/10/25 17:01:24.193024 34      collector/npu_collector.go:166    Starting update cache every 5 seconds
    [INFO]     2022/10/25 17:01:29.315194 34      collector/npu_collector.go:178    update cache,key is npu-exporter-npu-list
    [INFO]     2022/10/25 17:01:29.315407 34      collector/npu_collector.go:183    update cache,key is npu-exporter-containers-devices
    [INFO]     2022/10/25 17:01:34.302792 34      collector/npu_collector.go:178    update cache,key is npu-exporter-npu-list
    [INFO]     2022/10/25 17:01:34.302983 34      collector/npu_collector.go:183    update cache,key is npu-exporter-containers-devices
    ...

    如果持续出现如下打印信息,表示组件运行正常。

    ...
    [INFO]     2022/10/25 17:01:29.315194 34      collector/npu_collector.go:178    update cache,key is npu-exporter-npu-list
    [INFO]     2022/10/25 17:01:29.315407 34      collector/npu_collector.go:183    update cache,key is npu-exporter-containers-devices
    ...

    此时如果发现之前的日志中有如下内容可忽略。

    [ERROR]    2022/10/25 17:01:24.191525 34      container/runtime_ops.go:91    failed to get OCI connection
    [ERROR]    2022/10/25 17:01:24.191736 34      container/runtime_ops.go:93    try again