NPU-Exporter

NPU-Exporter Deployed Using Binary Mode

  1. Log in to the node where the NPU-Exporter is deployed and run the following command to check the component status. Ensure that the component status is active (running).
    systemctl status npu-exporter

    Information similar to the following is displayed.

    root@ubuntu:~# systemctl status npu-exporter
    ● npu-exporter.service - Ascend npu exporter
       Loaded: loaded (/etc/systemd/system/npu-exporter.service; enabled; vendor preset: enabled)
       Active: active (running) since Thu 2022-11-17 16:24:41 CST; 3 days ago
     Main PID: 25121 (npu-exporter)
        Tasks: 8 (limit: 7372)
       CGroup: /system.slice/npu-exporter.service
               └─25121 /usr/local/bin/npu-exporter -ip=127.0.0.1 -port=8082 -logFile=/var/log/mindx-dl/npu-exporter/npu-exporter.log
    ...
  2. View component logs.
    cat /var/log/mindx-dl/npu-exporter/npu-exporter.log

    Information similar to the following is displayed.

    root@ubuntu:/usr/local/bin# cat /var/log/mindx-dl/npu-exporter/npu-exporter.log
    [INFO]     2022/10/25 17:01:18.610431 1       hwlog@v0.0.10/api.go:96    npu-exporter.log's logger init success
    [INFO]     2022/10/25 17:01:18.610628 1       npu-exporter/main.go:275    listen on: 0.0.0.0
    [INFO]     2022/10/25 17:01:18.610740 1       npu-exporter/main.go:112    npu exporter starting and the version is v3.0.RC3_linux-aarch64
    ...
    [ERROR]    2022/10/25 17:01:24.191525 34      container/runtime_ops.go:91    failed to get OCI connection
    [ERROR]    2022/10/25 17:01:24.191736 34      container/runtime_ops.go:93    try again
    [INFO]     2022/10/25 17:01:24.193024 34      collector/npu_collector.go:166    Starting update cache every 5 seconds
    [INFO]     2022/10/25 17:01:29.315194 34      collector/npu_collector.go:178    update cache,key is npu-exporter-npu-list
    [INFO]     2022/10/25 17:01:29.315407 34      collector/npu_collector.go:183    update cache,key is npu-exporter-containers-devices
    [INFO]     2022/10/25 17:01:34.302792 34      collector/npu_collector.go:178    update cache,key is npu-exporter-npu-list
    [INFO]     2022/10/25 17:01:34.302983 34      collector/npu_collector.go:183    update cache,key is npu-exporter-containers-devices
    ...

    If the following information is displayed continuously, the component is running properly.

    ...
    [INFO]     2022/10/25 17:01:29.315194 34      collector/npu_collector.go:178    update cache,key is npu-exporter-npu-list
    [INFO]     2022/10/25 17:01:29.315407 34      collector/npu_collector.go:183    update cache,key is npu-exporter-containers-devices
    ...

    If the previous log contains the following information, ignore it:

    [ERROR]    2022/10/25 17:01:24.191525 34      container/runtime_ops.go:91    failed to get OCI connection
    [ERROR]    2022/10/25 17:01:24.191736 34      container/runtime_ops.go:93    try again

NPU-Exporter Deployed Using a Container

  1. Run the following command to check the pod of the NPU-Exporter in the Kubernetes cluster. Ensure that STATUS of the pod is Running and READY is 1/1. If the NPU-Exporter is installed on multiple nodes in the cluster, confirm the pod status one by one.
    kubectl get pods -n npu-exporter -o wide

    Information similar to the following is displayed.

    root@ubuntu:~# kubectl get pods -n npu-exporter -o wide
    NAME                 READY   STATUS    RESTARTS   AGE   IP                NODE         NOMINATED NODE   READINESS GATES
    npu-exporter-4ln8w   1/1     Running   0          36m   192.168.102.109   ubuntu       <none>           <none>
    ...
  2. Run the following command to view NPU-Exporter logs in the Kubernetes cluster:
    kubectl logs -n npu-exporter {name_of_the_NPU-Exporter's pod}

    Information similar to the following is displayed.

    root@ubuntu:~# kubectl logs -n npu-exporter npu-exporter-dq24k 
    [INFO]     2022/10/25 17:01:18.610431 1       hwlog@v0.0.10/api.go:96    npu-exporter.log's logger init success
    [INFO]     2022/10/25 17:01:18.610628 1       npu-exporter/main.go:275    listen on: 0.0.0.0
    [INFO]     2022/10/25 17:01:18.610740 1       npu-exporter/main.go:112    npu exporter starting and the version is v3.0.RC3_linux-aarch64
    ...
    [ERROR]    2022/10/25 17:01:24.191525 34      container/runtime_ops.go:91    failed to get OCI connection
    [ERROR]    2022/10/25 17:01:24.191736 34      container/runtime_ops.go:93    try again
    [INFO]     2022/10/25 17:01:24.193024 34      collector/npu_collector.go:166    Starting update cache every 5 seconds
    [INFO]     2022/10/25 17:01:29.315194 34      collector/npu_collector.go:178    update cache,key is npu-exporter-npu-list
    [INFO]     2022/10/25 17:01:29.315407 34      collector/npu_collector.go:183    update cache,key is npu-exporter-containers-devices
    [INFO]     2022/10/25 17:01:34.302792 34      collector/npu_collector.go:178    update cache,key is npu-exporter-npu-list
    [INFO]     2022/10/25 17:01:34.302983 34      collector/npu_collector.go:183    update cache,key is npu-exporter-containers-devices
    ...

    If the following information is displayed continuously, the component is running properly.

    ...
    [INFO]     2022/10/25 17:01:29.315194 34      collector/npu_collector.go:178    update cache,key is npu-exporter-npu-list
    [INFO]     2022/10/25 17:01:29.315407 34      collector/npu_collector.go:183    update cache,key is npu-exporter-containers-devices
    ...

    If the previous log contains the following information, ignore it:

    [ERROR]    2022/10/25 17:01:24.191525 34      container/runtime_ops.go:91    failed to get OCI connection
    [ERROR]    2022/10/25 17:01:24.191736 34      container/runtime_ops.go:93    try again