NPU Exporter

  • To use resource monitoring, you must install NPU Exporter, which can be interconnected with Prometheus or Telegraf.
  • If resource monitoring is not used, you do not need to install NPU Exporter. In this case, skip this section.

Restrictions

Before installing NPU Exporter, you need to understand related restrictions. For details, see Table 1.

Table 1 Restrictions

Scenario

Restrictions

NPU driver

NPU Exporter periodically calls the related APIs of the NPU driver to detect the NPU status. To upgrade the driver, stop service tasks and then stop container services of NPU Exporter.

NOTE:
To ensure that NPU Exporter can be installed as a non-root user (for example, hwMindX) when its binary package is deployed, use the --install-for-all parameter during driver installation. Example:
./Ascend-hdk-<chip_type>-npu-driver_<version>_linux-<arch>.run --full --install-for-all

Kubernetes version

Before using NPU Exporter, confirm the Kubernetes version in the environment. If the Kubernetes version is 1.24.x or later, install cri-dockerd.

DCMI dynamic library

The permission requirements for the DCMI dynamic library directories are as follows:

The owner of the DCMI dynamic library and its parent directories invoked by NPU Exporter must be root; otherwise, the program cannot run. In addition, group and other do not have the write permission on these files and directories.

The length of the DCMI dynamic library path must be less than 20.

If the dynamic library path is set by setting LD_LIBRARY_PATH, the total length of LD_LIBRARY_PATH cannot exceed 1024.

Atlas 200I SoC A1 core board

To use NPU Exporter on an Atlas 200I SoC A1 core board, ensure that the NPU driver version of the Atlas 200I SoC A1 core board is 23.0.RC2 or later.

To deploy NPU Exporter on an Atlas 200I SoC A1 core board in containerized mode, you need to configure the multi-container sharing mode.

VM

To deploy NPU Exporter on VMs, you need to install systemd in NPU Exporter's image. You are advised to add the RUN apt-get update && apt-get install -y systemd command to Dockerfile to install systemd.

Procedure

NPU Exporter supports two installation modes. You can select either of the following modes as required. This component provides only the HTTP service. To use the more secure HTTPS service, modify the source code for adaptation.

Containerized Installation

  1. Log in to each compute node as the root user.
  2. (Optional) Modify the metricConfiguration.json or pluginConfiguration.json file to configure the collection and reporting of the default or custom metric group.
    1. Go to the directory where the NPU Exporter package is decompressed.
    2. Open the metricConfiguration.json file.
      vi metricConfiguration.json
    3. Press i to enter the insert mode and configure the collection and reporting of the default metric group as required.

      Parameter

      Description

      metricsGroup

      Default metric group name.

      • ddr: DDR information
      • hccs: HCCS information
      • npu: NPU information
      • network: network information
      • pcie: PCIe information
      • roce: RoCE information
      • sio: SIO information
      • vnpu: vNPU information
      • version: version information
      • optical: optical module information
      • hbm: on-chip memory information

      state

      Switch for metric group collection and reporting. The default value is ON.

      • ON: enabled. After it is enabled, metrics of a metric group are collected and reported.
      • OFF: disabled. After it is disabled, metrics of a metric group are not collected and reported.
    4. Press Esc and enter :wq! to save the settings and exit.
    5. Modify the pluginConfiguration.json file by referring to 2.b to 2.d and configure the collection and reporting switch of the custom metric group as required.

      Parameter

      Description

      metricsGroup

      Name of the custom metric group registered with NPU Exporter. For details about how to customize metrics, see Custom Metric Development.

      state

      Switch for metric group collection and reporting. The default value is OFF.

      • ON: enabled. After it is enabled, metrics of a metric group are collected and reported.
      • OFF: disabled. After it is disabled, metrics of a metric group are not collected and reported.
    6. If custom metrics are developed using a plugin, rebuild the binary file.
    7. Create and distribute the image again by referring to Preparing an Image, and then go to 4.
  3. Check whether the NPU Exporter image and version are correct.
    • Docker scenario:
      docker images | grep npu-exporter
      Command output:
      1
      npu-exporter                         v7.3.0              20185c45f1bc        About an hour ago         90.1MB
      
    • containerd scenario:
      ctr -n k8s.io c ls | grep npu-exporter

      Command output:

      docker.io/library/npu-exporter:v7.3.0                                                         application/vnd.docker.distribution.manifest.v2+json      sha256:38fd69ee9f5753e73a55a216d039f6ed4ea8a5de15c0e6b3bb503022db470c7b 91.5 MiB  linux/arm64 
    • If correct, go to 4.
    • If not correct, create the image and distribute it by referring to Preparing an Image.
  4. Copy the YAML file in the directory where the NPU Exporter package is decompressed to any directory on the Kubernetes management node.
  5. Perform the following steps based on the containerized mode in use.
    • containerd scenario: Set containerMode to containerd and modify the following code in bold.

    If the default NPU Exporter startup parameter -containerMode=docker is used, skip this step.

    apiVersion: apps/v1
    kind: DaemonSet
    metadata:
      name: npu-exporter
      namespace: npu-exporter
    spec:
      selector:
        matchLabels:
          app: npu-exporter
    ...
        spec:
    ...
          args: [ "umask 027;npu-exporter -port=8082 -ip=0.0.0.0  -updateTime=5
                     -logFile=/var/log/mindx-dl/npu-exporter/npu-exporter.log -logLevel=0 -containerMode=containerd" ]
    ...
          volumeMounts:
    ...
            - name: docker-shim                                       
              mountPath: /var/run/dockershim.sock
              readOnly: true
            - name: docker                                       # Delete this configuration item only when containerd is used.
              mountPath: /var/run/docker
              readOnly: true
            - name: cri-dockerd                                 
              mountPath: /var/run/cri-dockerd.sock
              readOnly: true
            - name: containerd                             
              mountPath: /run/containerd
              readOnly: true
            - name: isulad                                
              mountPath: /run/isulad.sock
              readOnly: true
    ...
          volumes:
    ...
            - name: docker-shim                             
              hostPath:
                path: /var/run/dockershim.sock
            - name: docker                                # Delete this configuration item only when containerd is used.
              hostPath:
                path: /var/run/docker
            - name: cri-dockerd                           
              hostPath:
                path: /var/run/cri-dockerd.sock
            - name: containerd                            
              hostPath:
                path: /run/containerd
            - name: isulad                               
              hostPath:
                path: /run/isulad.sock
    
    ...
    • Docker scenario: Delete the mount file of the original container runtime, add the mount directory of the dockershim.sock file, and modify the following information in bold.
    If the NPU Exporter startup parameter -containerMode=containerd is used, skip this step.

    This step can resolve data loss of NPU Exporter after kubelet is restarted. A container escape may happen due to an increase of files, such as docker.sock, mounted to the new directory.

    ...
            volumeMounts:
              - name: log-npu-exporter
    ...
              - name: sys
                mountPath: /sys
                readOnly: true
              - name: docker-shim                        # Delete the following fields in bold.
                mountPath: /var/run/dockershim.sock
                readOnly: true
              - name: docker 
                mountPath: /var/run/docker
                readOnly: true
              - name: cri-dockerd 
                mountPath: /var/run/cri-dockerd.sock
                readOnly: true
              - name: sock                  # Add the fields in bold.
                mountPath: /var/run        # Use the actual dockershim.sock file directory.
              - name: containerd  
                mountPath: /run/containerd
    ...
          volumes:
            - name: log-npu-exporter
    ...
            - name: sys
              hostPath:
                path: /sys
            - name: docker-shim                    # Delete the following fields in bold.
              hostPath:   
                path: /var/run/dockershim.sock
            - name: docker 
              hostPath:
                path: /var/run/docker
            - name: cri-dockerd 
              hostPath:
                path: /var/run/cri-dockerd.sock
            - name: sock                 # Add the fields in bold.
              hostPath:
                path: /var/run                    # Use the actual dockershim.sock file directory.
            - name: containerd  
              hostPath:
                path: /run/containerd
     ...
  6. If you do not need to modify other startup parameters of the component, skip this step. Otherwise, modify the NPU Exporter startup parameters in the YAML file based on your requirements. For details about the startup parameters, see Table 2. You can also run the ./npu-exporter -h command to view the parameter descriptions.
  7. Run the following command in the directory where the YAML file of the management node is stored to start NPU Exporter.
    • If Atlas 200I SoC A1 core boards are used in a Kubernetes cluster, run the following command:
      kubectl apply -f npu-exporter-310P-1usoc-v{version}.yaml
    • If nodes except Atlas 200I SoC A1 core boards are used in a Kubernetes cluster, run the following command:
      kubectl apply -f npu-exporter-v{version}.yaml
    Startup example:
    namespace/npu-exporter created
    networkpolicy.networking.K8s.io/exporter-network-policy created
    daemonset.apps/npu-exporter created
    If the error message "Error from server (NotFound): error when creating "npu-exporter-x.x.x.yaml":namespaces "npu-exporter" not found" is displayed during NPU Exporter startup, the namespace of NPU Exporter fails to be created. Run the following command to manually create the namespace:
    kubectl create ns npu-exporter
  8. Run the following command on any node to check whether the component is started:
    kubectl get pod -n npu-exporter

    If Running is displayed in the command output, the component is started successfully. If the status is CrashLoopBackOff, the directory permission may be incorrect. Rectify this fault by referring to NPU Exporter Fails to Check the Dynamic Path, and "check uid or mode failed" Is Recorded in the Log.

    1
    2
    3
    NAME                            READY   STATUS    RESTARTS   AGE
    ...
    npu-exporter-hqpxl        1/1    Running   0        11s
    

Binary-based Installation

When NPU Exporter runs in containerized mode, the privileged container, root user, and socket file mounted with docker-shim or containerd are required. If the container is maliciously used, container escape risks exist. If high security is required, run the component on a physical machine in binary mode.

  • When NPU Exporter is deployed in binary mode, you can use a non-root user (for example, hwMindX) for deployment. Change the permission on the log directory to hwMindX by running chown hwMindX:hwMindX /var/log/mindx-dl/npu-exporter. The command is for reference only.
  • The user hwMindX is used in the following steps.
  1. Log in to a server as the root user.
  2. Upload the NPU Exporter package to any directory (for example, /home/ascend-npu-exporter) on the server and decompress the package.
  3. Copy the metricConfiguration.json and pluginConfiguration.json files in the decompressed NPU Exporter package directory to the /usr/local directory.
  4. (Optional) Modify the metricConfiguration.json or pluginConfiguration.json file to configure the collection and reporting of the default or custom metric group.
    1. Go to the /usr/local directory.
    2. Open the metricConfiguration.json file.
      vi metricConfiguration.json
    3. Press i to enter the insert mode and configure the collection and reporting of the default metric group as required.

      Parameter

      Description

      metricsGroup

      Default metric group name.

      • ddr: DDR information
      • hccs: HCCS information
      • npu: NPU information
      • network: network information
      • pcie: PCIe information
      • roce: RoCE information
      • sio: SIO information
      • vnpu: vNPU information
      • version: version information
      • optical: optical module information
      • hbm: on-chip memory information

      state

      Switch for metric group collection and reporting. The default value is ON.

      • ON: enabled. After it is enabled, metrics of a metric group are collected and reported.
      • OFF: disabled. After it is disabled, metrics of a metric group are not collected and reported.
    4. Press Esc and enter :wq! to save the settings and exit.
    5. Modify the pluginConfiguration.json file by referring to 4.b to 4.d and configure the collection and reporting switch of the custom metric group as required.

      Parameter

      Description

      metricsGroup

      Name of the custom metric group registered with NPU Exporter. For details about how to customize metrics, see Custom Metric Development.

      state

      Switch for metric group collection and reporting. The default value is OFF.

      • ON: enabled. After it is enabled, metrics of a metric group are collected and reported.
      • OFF: disabled. After it is disabled, metrics of a metric group are not collected and reported.
    6. If custom metrics are developed using a plugin, rebuild the binary file.
  5. Create and edit the npu-exporter.service file.
    1. Create the npu-exporter.service file.
      vi /home/ascend-npu-exporter/npu-exporter.service
    2. Write the following information to the npu-exporter.service file.

      [Unit]
      Description=Ascend npu exporter
      Documentation=hiascend.com
      
      [Service]
      ExecStart=/bin/bash -c "/usr/local/bin/npu-exporter -ip=127.0.0.1 -port=8082 -logFile=/var/log/mindx-dl/npu-exporter/npu-exporter.log>/dev/null  2>&1 &"
      Restart=always
      RestartSec=2
      KillMode=process
      Environment="GOGC=50"
      Environment="GOMAXPROCS=2"
      Environment="GODEBUG=madvdontneed=1"
      Type=forking
      User=hwMindX
      Group=hwMindX
      
      [Install]
      WantedBy=multi-user.target

      By default, NPU Exporter listens to only 127.0.0.1. You can modify the startup parameter -ip and the ExecStart field in the npu-exporter.service file to change the IP addresses to be listened to.

    3. Press Esc and enter :wq to save the changes and exit.
  6. Create and edit the npu-exporter.timer file. Configuring a timer to start NPU Exporter after a delay can ensure that the NPU is ready when NPU Exporter is started.
    1. Create the npu-exporter.timer file.
       vi /home/ascend-npu-exporter/npu-exporter.timer
    2. Add the following information to the npu-exporter.timer file.
      [Unit]
      Description=Timer for NPU Exporter Service
      
      [Timer]
      OnBootSec=60s            # Set the delay for starting NPU Exporter. Adjust the value as required.
      Unit=npu-exporter.service
      
      [Install]
      WantedBy=timers.target
    3. Press Esc and enter :wq to save the changes and exit.
  7. If the deployment node is Atlas 200I SoC A1 core board, run the following commands in sequence to add the hwMindX user to the HwBaseUser and HwDmUser user groups on the node. Skip this step if the Atlas 200I SoC A1 core board is not used.
    usermod -a -G HwBaseUser hwMindX
    usermod -a -G HwDmUser hwMindX
  8. Start the NPU Exporter service.
    cd /home/ascend-npu-exporter
    cp npu-exporter /usr/local/bin
    cp npu-exporter.service /etc/systemd/system
    chattr +i /etc/systemd/system/npu-exporter.service
    cp npu-exporter.timer /etc/systemd/system     
    chattr +i /etc/systemd/system/npu-exporter.timer      
    chmod 500 /usr/local/bin/npu-exporter
    chown hwMindX:hwMindX /usr/local/bin/npu-exporter
    chattr +i /usr/local/bin/npu-exporter
    systemctl enable npu-exporter.timer 
    systemctl start npu-exporter
    systemctl start npu-exporter.timer
    To obtain container metrics, you need to temporarily escalate the NPU Exporter privilege so that it can establish connections with the sockets of CRI and OCI:
    chattr -i /usr/local/bin/npu-exporter
    setcap cap_setuid+ep /usr/local/bin/npu-exporter
    chattr +i /usr/local/bin/npu-exporter
    systemctl restart npu-exporter

Parameters

Table 2 NPU Exporter startup parameters

Parameter

Type

Default Value

Description

-port

Integer

8082

Listening port. The value ranges from 1025 to 40000.

-updateTime

Integer

5

Information update period. The value ranges from 1 to 60, in seconds. If this parameter is set to a large value, some containers whose lifetime is shorter than the update period may fail to be reported.

-ip

String

None

This parameter has no default value and must be set.

Listening IP address. You are not advised setting this parameter to 0.0.0.0 on a host with multiple NICs.

-version

Bool

false

Whether to query the NPU Exporter version number.

  • true: queries the version.
  • false: does not query the version.

-concurrency

Integer

5

Traffic limit of the HTTP service. The value ranges from 1 to 512 and defaults to 5.

-logLevel

Integer

0

Log level:

  • -1: debug
  • 0: info
  • 1: warning
  • 2: error
  • 3: critical

-maxAge

Integer

7

Time for backing up logs. The value ranges from 7 to 700, in days.

-logFile

String

/var/log/mindx-dl/npu-exporter/npu-exporter.log

Log file.

NOTE:

If the size of a log file exceeds 20 MB, automatic dump is triggered. The maximum size of a log file cannot be changed. Dumped files are named in the format of "npu-exporter-dump triggering time.log", for example, npu-exporter-2023-10-07T03-38-24.402.log.

-maxBackups

Integer

30

Maximum number of dumped log files that can be retained. The value ranges from 1 to 30.

-containerMode

String

docker

Container runtime type.

  • If this parameter is set to docker, Docker is used as the container runtime in the current environment.
  • If this parameter is set to containerd, containerd is used as the container runtime in the current environment.
  • If this parameter is set to isula, iSula is used as the container runtime in the current environment.

-containerd

String

  • Docker

    unix: /run/docker/containerd/docker-containerd.sock

  • containerd

    unix: ///run/containerd/containerd.sock

  • iSula

    unix: ///run/isulad.sock

Endpoint of the containerd daemon process, which is used to communicate with containerd.

  • If containerMode is set to docker, the default value of this parameter is /run/docker/containerd/docker-containerd.sock. If the connection fails, the system automatically attempts to connect to unix:///run/containerd/containerd.sock and unix:///run/docker/containerd/containerd.sock.
  • If containerMode is set to containerd, the default value of this parameter is /run/containerd/containerd.sock.
  • If containerMode is set to isula, the default value is /run/isulad.sock.

Retain the default configuration, unless you change the path of the sock file of containerd.

You can run the ps aux | grep containerd command to check whether the sock file path of containerd is changed.

-endpoint

String

  • Docker

    unix: ///var/run/dockershim.sock

  • containerd

    unix: ///run/containerd/containerd.sock

  • iSula

    unix: ///run/isulad.sock

Sock address of the CRI server.

  • If containerMode is set to docker, dockershim is connected to obtain the container list. The default value is /var/run/dockershim.sock.
  • If containerMode is set to containerd, the default value is /run/containerd/containerd.sock.
  • If containerMode is set to isula, the default value is /run/isulad.sock.

Generally, retain the default value unless you have changed the sock file path of dockershim or containerd.

If the connection fails, the system automatically attempts to connect to unix:///run/cri-dockerd.sock.

-limitIPConn

Integer

5

Number of TCP connections for each IP address. The value ranges from 1 to 128.

-limitTotalConn

Integer

20

Total number of TCP limits of the program. The value ranges from 1 to 512.

-limitIPReq

String

20/1

Number of requests from each IP address. The value 20/1 indicates that a maximum of 20 requests are allowed per second. A maximum of three digits are supported on both sides of the slash (/).

-cacheSize

Integer

102400

Maximum number of cache keys. The value ranges from 1 to 1024000.

-h or -help

None

None

Help information.

-platform

String

Prometheus

Interconnection platform.

  • Prometheus
  • Telegraf

-poll_interval

Duration (integer)

1

Interval for reporting Telegraf data, in seconds. This parameter takes effect only when the Telegraf platform is connected. That is, this parameter takes effect only when -platform is set to Telegraf.

-profilingTime

Integer

200

PCIe bandwidth collection time. The value ranges from 1 to 2000, in milliseconds.

-hccsBWProfilingTime

Integer

200

Duration for sampling the HCCS link bandwidth. The value ranges from 1 to 1000, in milliseconds.

-deviceResetTimeout

Integer

60

Maximum wait time for the driver to report complete processor information if the detected processor count is insufficient at component startup. The value ranges from 10 to 600, in seconds.

  • For the Atlas A2 training product, Atlas 800I A2 inference server, and A200I A2 Box heterogeneous component, the recommended value is 150 seconds.
  • For the Atlas A3 training product, A200T A3 Box8 SuperPoD Server, and Atlas 800I A3 SuperPoD Server, the recommended value is 360 seconds.

-textMetricsFilePath

String

None

Path of the custom metric file. For details about the restrictions, see Restrictions.