NPU Exporter

To use resource monitoring, you must install NPU Exporter, which can be interconnected with Prometheus or Telegraf.
- When interconnecting with Prometheus, NPU Exporter supports containerized deployment and binary deployment. For details about the differences, see Differences Between Container and Binary Deployment.
- When interconnecting with Telegraf, install NPU Exporter and Telegraf by referring to Working with Telegraf.
If resource monitoring is not used, you do not need to install NPU Exporter. In this case, skip this section.

Restrictions

Before installing NPU Exporter, you need to understand related restrictions. For details, see Table 1.

**Table 1** Restrictions
Scenario	Restrictions
NPU driver	NPU Exporter periodically calls the related APIs of the NPU driver to detect the NPU status. To upgrade the driver, stop service tasks and then stop container services of NPU Exporter. NOTE: To ensure that NPU Exporter can be installed as a non-root user (for example, hwMindX) when its binary package is deployed, use the --install-for-all parameter during driver installation. Example: ./Ascend-hdk-<chip_type>-npu-driver_<version>_linux-<arch>.run --full --install-for-all
Kubernetes version	Before using NPU Exporter, confirm the Kubernetes version in the environment. If the Kubernetes version is 1.24.x or later, install cri-dockerd.
DCMI dynamic library	The permission requirements for the DCMI dynamic library directories are as follows: The owner of the DCMI dynamic library and its parent directories invoked by NPU Exporter must be root; otherwise, the program cannot run. In addition, group and other do not have the write permission on these files and directories.
	The length of the DCMI dynamic library path must be less than 20.
	If the dynamic library path is set by setting LD_LIBRARY_PATH, the total length of LD_LIBRARY_PATH cannot exceed 1024.
Atlas 200I SoC A1 core board	To use NPU Exporter on an Atlas 200I SoC A1 core board, ensure that the NPU driver version of the Atlas 200I SoC A1 core board is 23.0.RC2 or later.
Atlas 200I SoC A1 core board	To deploy NPU Exporter on an Atlas 200I SoC A1 core board in containerized mode, you need to configure the multi-container sharing mode.
VM	To deploy NPU Exporter on VMs, you need to install systemd in NPU Exporter's image. You are advised to add the RUN apt-get update && apt-get install -y systemd command to Dockerfile to install systemd.

Procedure

NPU Exporter supports two installation modes. You can select either of the following modes as required. This component provides only the HTTP service. To use the more secure HTTPS service, modify the source code for adaptation.

(Recommended) Containerized installation. For details, see Containerized Installation.
Binary-based installation on a physical machine (for high security requirements). For details, see Binary-based Installation.

Containerized Installation

(Optional) Modify the metricConfiguration.json or pluginConfiguration.json file to configure the collection and reporting of the default or custom metric group.

Go to the directory where the NPU Exporter package is decompressed.
Open the metricConfiguration.json file.
```
vi metricConfiguration.json
```

Press i to enter the insert mode and configure the collection and reporting of the default metric group as required.

Parameter	Description
metricsGroup	Default metric group name. ddr: DDR information hccs: HCCS information npu: NPU information network: network information pcie: PCIe information roce: RoCE information sio: SIO information vnpu: vNPU information version: version information optical: optical module information hbm: on-chip memory information
state	Switch for metric group collection and reporting. The default value is ON. ON: enabled. After it is enabled, metrics of a metric group are collected and reported. OFF: disabled. After it is disabled, metrics of a metric group are not collected and reported.

Parameter

Description

metricsGroup

Default metric group name.

ddr: DDR information
hccs: HCCS information
npu: NPU information
network: network information
pcie: PCIe information
roce: RoCE information
sio: SIO information
vnpu: vNPU information
version: version information
optical: optical module information
hbm: on-chip memory information

state

Switch for metric group collection and reporting. The default value is ON.

ON: enabled. After it is enabled, metrics of a metric group are collected and reported.
OFF: disabled. After it is disabled, metrics of a metric group are not collected and reported.

Press Esc and enter :wq! to save the settings and exit.

Modify the pluginConfiguration.json file by referring to 2.b to 2.d and configure the collection and reporting switch of the custom metric group as required.

Parameter	Description
metricsGroup	Name of the custom metric group registered with NPU Exporter. For details about how to customize metrics, see Custom Metric Development.
state	Switch for metric group collection and reporting. The default value is OFF. ON: enabled. After it is enabled, metrics of a metric group are collected and reported. OFF: disabled. After it is disabled, metrics of a metric group are not collected and reported.

Parameter

Description

metricsGroup

Name of the custom metric group registered with NPU Exporter. For details about how to customize metrics, see Custom Metric Development.

state

Switch for metric group collection and reporting. The default value is OFF.

ON: enabled. After it is enabled, metrics of a metric group are collected and reported.
OFF: disabled. After it is disabled, metrics of a metric group are not collected and reported.

If custom metrics are developed using a plugin, rebuild the binary file.
Create and distribute the image again by referring to Preparing an Image, and then go to 4.

Check whether the NPU Exporter image and version are correct.

Docker scenario:

docker images | grep npu-exporter

Command output:

npu-exporter                         v7.3.0              20185c45f1bc        About an hour ago         90.1MB

containerd scenario:

ctr -n k8s.io c ls | grep npu-exporter

Command output:

docker.io/library/npu-exporter:v7.3.0                                                         application/vnd.docker.distribution.manifest.v2+json      sha256:38fd69ee9f5753e73a55a216d039f6ed4ea8a5de15c0e6b3bb503022db470c7b 91.5 MiB  linux/arm64

If correct, go to 4.
If not correct, create the image and distribute it by referring to Preparing an Image.

Copy the YAML file in the directory where the NPU Exporter package is decompressed to any directory on the Kubernetes management node.

Perform the following steps based on the containerized mode in use.

containerd scenario: Set containerMode to containerd and modify the following code in bold.

If the default NPU Exporter startup parameter -containerMode=docker is used, skip this step.

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: npu-exporter
  namespace: npu-exporter
spec:
  selector:
    matchLabels:
      app: npu-exporter
...
    spec:
...
      args: [ "umask 027;npu-exporter -port=8082 -ip=0.0.0.0  -updateTime=5
                 -logFile=/var/log/mindx-dl/npu-exporter/npu-exporter.log -logLevel=0 -containerMode=containerd" ]
...
      volumeMounts:
...
        - name: docker-shim                                       
          mountPath: /var/run/dockershim.sock
          readOnly: true
        - name: docker                                       # Delete this configuration item only when containerd is used.
          mountPath: /var/run/docker
          readOnly: true
        - name: cri-dockerd                                 
          mountPath: /var/run/cri-dockerd.sock
          readOnly: true
        - name: containerd                             
          mountPath: /run/containerd
          readOnly: true
        - name: isulad                                
          mountPath: /run/isulad.sock
          readOnly: true
...
      volumes:
...
        - name: docker-shim                             
          hostPath:
            path: /var/run/dockershim.sock
        - name: docker                                # Delete this configuration item only when containerd is used.
          hostPath:
            path: /var/run/docker
        - name: cri-dockerd                           
          hostPath:
            path: /var/run/cri-dockerd.sock
        - name: containerd                            
          hostPath:
            path: /run/containerd
        - name: isulad                               
          hostPath:
            path: /run/isulad.sock

...

Docker scenario: Delete the mount file of the original container runtime, add the mount directory of the dockershim.sock file, and modify the following information in bold.

If the NPU Exporter startup parameter -containerMode=containerd is used, skip this step.

This step can resolve data loss of NPU Exporter after kubelet is restarted. A container escape may happen due to an increase of files, such as docker.sock, mounted to the new directory.

...
        volumeMounts:
          - name: log-npu-exporter
...
          - name: sys
            mountPath: /sys
            readOnly: true
          - name: docker-shim                        # Delete the following fields in bold.
            mountPath: /var/run/dockershim.sock
            readOnly: true
          - name: docker 
            mountPath: /var/run/docker
            readOnly: true
          - name: cri-dockerd 
            mountPath: /var/run/cri-dockerd.sock
            readOnly: true
          - name: sock                  # Add the fields in bold.
            mountPath: /var/run        # Use the actual dockershim.sock file directory.
          - name: containerd  
            mountPath: /run/containerd
...
      volumes:
        - name: log-npu-exporter
...
        - name: sys
          hostPath:
            path: /sys
        - name: docker-shim                    # Delete the following fields in bold.
          hostPath:   
            path: /var/run/dockershim.sock
        - name: docker 
          hostPath:
            path: /var/run/docker
        - name: cri-dockerd 
          hostPath:
            path: /var/run/cri-dockerd.sock
        - name: sock                 # Add the fields in bold.
          hostPath:
            path: /var/run                    # Use the actual dockershim.sock file directory.
        - name: containerd  
          hostPath:
            path: /run/containerd
 ...

If you do not need to modify other startup parameters of the component, skip this step. Otherwise, modify the NPU Exporter startup parameters in the YAML file based on your requirements. For details about the startup parameters, see Table 2. You can also run the ./npu-exporter -h command to view the parameter descriptions.
Run the following command in the directory where the YAML file of the management node is stored to start NPU Exporter.
- If Atlas 200I SoC A1 core boards are used in a Kubernetes cluster, run the following command:
```
kubectl apply -f npu-exporter-310P-1usoc-v{version}.yaml
```
- If nodes except Atlas 200I SoC A1 core boards are used in a Kubernetes cluster, run the following command:
```
kubectl apply -f npu-exporter-v{version}.yaml
```
Startup example:
```
namespace/npu-exporter created
networkpolicy.networking.K8s.io/exporter-network-policy created
daemonset.apps/npu-exporter created
```
If the error message "Error from server (NotFound): error when creating "npu-exporter-x.x.x.yaml":namespaces "npu-exporter" not found" is displayed during NPU Exporter startup, the namespace of NPU Exporter fails to be created. Run the following command to manually create the namespace:
kubectl create ns npu-exporter
Run the following command on any node to check whether the component is started:
```
kubectl get pod -n npu-exporter
```
If Running is displayed in the command output, the component is started successfully. If the status is CrashLoopBackOff, the directory permission may be incorrect. Rectify this fault by referring to NPU Exporter Fails to Check the Dynamic Path, and "check uid or mode failed" Is Recorded in the Log.
1 2 3
NAME READY STATUS RESTARTS AGE ... npu-exporter-hqpxl 1/1 Running 0 11s
- NPU Exporter has requirements on the process environment. If it is running as a container, ensure that the /sys directory and the socket file for communication are mounted to the NPU Exporter container. If the NPU container information is not obtained by calling the Metrics interface of NPU Exporter, the possible cause is that the socket file path is incorrect. Rectify this fault by referring to "connecting to container runtime failed" Is Displayed in Logs.
- If the pod status of the installed component is not Running, refer to Component pods Are Not in the Running State.
- If the pod status of the installed component is ContainerCreating, refer to Cluster Scheduling Component Pods Are in the ContainerCreating State.
- If the component fails to be started, refer to Cluster Scheduling Components Fail to Start and "get sem errno =13" Is Displayed in Logs.
- If the component is started successfully, but the corresponding pod cannot be found, refer to YAML File for Starting a Component Is Successfully Executed, But the pod Corresponding to the Component Is Not Displayed.

Binary-based Installation

When NPU Exporter runs in containerized mode, the privileged container, root user, and socket file mounted with docker-shim or containerd are required. If the container is maliciously used, container escape risks exist. If high security is required, run the component on a physical machine in binary mode.

When NPU Exporter is deployed in binary mode, you can use a non-root user (for example, hwMindX) for deployment. Change the permission on the log directory to hwMindX by running chown hwMindX:hwMindX /var/log/mindx-dl/npu-exporter. The command is for reference only.
The user hwMindX is used in the following steps.

Log in to a server as the root user.
Upload the NPU Exporter package to any directory (for example, /home/ascend-npu-exporter) on the server and decompress the package.
Copy the metricConfiguration.json and pluginConfiguration.json files in the decompressed NPU Exporter package directory to the /usr/local directory.

(Optional) Modify the metricConfiguration.json or pluginConfiguration.json file to configure the collection and reporting of the default or custom metric group.

Go to the /usr/local directory.
Open the metricConfiguration.json file.
```
vi metricConfiguration.json
```

Press i to enter the insert mode and configure the collection and reporting of the default metric group as required.

Parameter	Description
metricsGroup	Default metric group name. ddr: DDR information hccs: HCCS information npu: NPU information network: network information pcie: PCIe information roce: RoCE information sio: SIO information vnpu: vNPU information version: version information optical: optical module information hbm: on-chip memory information
state	Switch for metric group collection and reporting. The default value is ON. ON: enabled. After it is enabled, metrics of a metric group are collected and reported. OFF: disabled. After it is disabled, metrics of a metric group are not collected and reported.

Parameter

Description

metricsGroup

Default metric group name.

ddr: DDR information
hccs: HCCS information
npu: NPU information
network: network information
pcie: PCIe information
roce: RoCE information
sio: SIO information
vnpu: vNPU information
version: version information
optical: optical module information
hbm: on-chip memory information

state

Switch for metric group collection and reporting. The default value is ON.

ON: enabled. After it is enabled, metrics of a metric group are collected and reported.
OFF: disabled. After it is disabled, metrics of a metric group are not collected and reported.

Press Esc and enter :wq! to save the settings and exit.

Modify the pluginConfiguration.json file by referring to 4.b to 4.d and configure the collection and reporting switch of the custom metric group as required.

Parameter	Description
metricsGroup	Name of the custom metric group registered with NPU Exporter. For details about how to customize metrics, see Custom Metric Development.
state	Switch for metric group collection and reporting. The default value is OFF. ON: enabled. After it is enabled, metrics of a metric group are collected and reported. OFF: disabled. After it is disabled, metrics of a metric group are not collected and reported.

Parameter

Description

metricsGroup

Name of the custom metric group registered with NPU Exporter. For details about how to customize metrics, see Custom Metric Development.

state

Switch for metric group collection and reporting. The default value is OFF.

ON: enabled. After it is enabled, metrics of a metric group are collected and reported.
OFF: disabled. After it is disabled, metrics of a metric group are not collected and reported.

If custom metrics are developed using a plugin, rebuild the binary file.

Create and edit the npu-exporter.service file.

Create the npu-exporter.service file.

vi /home/ascend-npu-exporter/npu-exporter.service

Write the following information to the npu-exporter.service file.

[Unit]
Description=Ascend npu exporter
Documentation=hiascend.com

[Service]
ExecStart=/bin/bash -c "/usr/local/bin/npu-exporter -ip=127.0.0.1 -port=8082 -logFile=/var/log/mindx-dl/npu-exporter/npu-exporter.log>/dev/null  2>&1 &"
Restart=always
RestartSec=2
KillMode=process
Environment="GOGC=50"
Environment="GOMAXPROCS=2"
Environment="GODEBUG=madvdontneed=1"
Type=forking
User=hwMindX
Group=hwMindX

[Install]
WantedBy=multi-user.target

By default, NPU Exporter listens to only 127.0.0.1. You can modify the startup parameter -ip and the ExecStart field in the npu-exporter.service file to change the IP addresses to be listened to.

Press Esc and enter :wq to save the changes and exit.

Create and edit the npu-exporter.timer file. Configuring a timer to start NPU Exporter after a delay can ensure that the NPU is ready when NPU Exporter is started.
1. Create the npu-exporter.timer file.
```
 vi /home/ascend-npu-exporter/npu-exporter.timer
```
2. Add the following information to the npu-exporter.timer file.
```
[Unit]
Description=Timer for NPU Exporter Service

[Timer]
OnBootSec=60s            # Set the delay for starting NPU Exporter. Adjust the value as required.
Unit=npu-exporter.service

[Install]
WantedBy=timers.target
```
3. Press Esc and enter :wq to save the changes and exit.
If the deployment node is Atlas 200I SoC A1 core board, run the following commands in sequence to add the hwMindX user to the HwBaseUser and HwDmUser user groups on the node. Skip this step if the Atlas 200I SoC A1 core board is not used.
```
usermod -a -G HwBaseUser hwMindX
usermod -a -G HwDmUser hwMindX
```

Start the NPU Exporter service.

cd /home/ascend-npu-exporter
cp npu-exporter /usr/local/bin
cp npu-exporter.service /etc/systemd/system
chattr +i /etc/systemd/system/npu-exporter.service
cp npu-exporter.timer /etc/systemd/system     
chattr +i /etc/systemd/system/npu-exporter.timer      
chmod 500 /usr/local/bin/npu-exporter
chown hwMindX:hwMindX /usr/local/bin/npu-exporter
chattr +i /usr/local/bin/npu-exporter
systemctl enable npu-exporter.timer 
systemctl start npu-exporter
systemctl start npu-exporter.timer

To obtain container metrics, you need to temporarily escalate the NPU Exporter privilege so that it can establish connections with the sockets of CRI and OCI:

chattr -i /usr/local/bin/npu-exporter
setcap cap_setuid+ep /usr/local/bin/npu-exporter
chattr +i /usr/local/bin/npu-exporter
systemctl restart npu-exporter

Parameters

**Table 2** NPU Exporter startup parameters
Parameter	Type	Default Value	Description
-port	Integer	8082	Listening port. The value ranges from 1025 to 40000.
-updateTime	Integer	5	Information update period. The value ranges from 1 to 60, in seconds. If this parameter is set to a large value, some containers whose lifetime is shorter than the update period may fail to be reported.
-ip	String	None	This parameter has no default value and must be set. Listening IP address. You are not advised setting this parameter to 0.0.0.0 on a host with multiple NICs.
-version	Bool	false	Whether to query the NPU Exporter version number. true: queries the version. false: does not query the version.
-concurrency	Integer	5	Traffic limit of the HTTP service. The value ranges from 1 to 512 and defaults to 5.
-logLevel	Integer	0	Log level: -1: debug 0: info 1: warning 2: error 3: critical
-maxAge	Integer	7	Time for backing up logs. The value ranges from 7 to 700, in days.
-logFile	String	/var/log/mindx-dl/npu-exporter/npu-exporter.log	Log file. NOTE: If the size of a log file exceeds 20 MB, automatic dump is triggered. The maximum size of a log file cannot be changed. Dumped files are named in the format of "npu-exporter-dump triggering time.log", for example, npu-exporter-2023-10-07T03-38-24.402.log.
-maxBackups	Integer	30	Maximum number of dumped log files that can be retained. The value ranges from 1 to 30.
-containerMode	String	docker	Container runtime type. If this parameter is set to docker, Docker is used as the container runtime in the current environment. If this parameter is set to containerd, containerd is used as the container runtime in the current environment. If this parameter is set to isula, iSula is used as the container runtime in the current environment.
-containerd	String	Docker unix: /run/docker/containerd/docker-containerd.sock containerd unix: ///run/containerd/containerd.sock iSula unix: ///run/isulad.sock	Endpoint of the containerd daemon process, which is used to communicate with containerd. If containerMode is set to docker, the default value of this parameter is /run/docker/containerd/docker-containerd.sock. If the connection fails, the system automatically attempts to connect to unix:///run/containerd/containerd.sock and unix:///run/docker/containerd/containerd.sock. If containerMode is set to containerd, the default value of this parameter is /run/containerd/containerd.sock. If containerMode is set to isula, the default value is /run/isulad.sock. Retain the default configuration, unless you change the path of the sock file of containerd. You can run the ps aux \| grep containerd command to check whether the sock file path of containerd is changed.
-endpoint	String	Docker unix: ///var/run/dockershim.sock containerd unix: ///run/containerd/containerd.sock iSula unix: ///run/isulad.sock	Sock address of the CRI server. If containerMode is set to docker, dockershim is connected to obtain the container list. The default value is /var/run/dockershim.sock. If containerMode is set to containerd, the default value is /run/containerd/containerd.sock. If containerMode is set to isula, the default value is /run/isulad.sock. Generally, retain the default value unless you have changed the sock file path of dockershim or containerd. If the connection fails, the system automatically attempts to connect to unix:///run/cri-dockerd.sock.
-limitIPConn	Integer	5	Number of TCP connections for each IP address. The value ranges from 1 to 128.
-limitTotalConn	Integer	20	Total number of TCP limits of the program. The value ranges from 1 to 512.
-limitIPReq	String	20/1	Number of requests from each IP address. The value 20/1 indicates that a maximum of 20 requests are allowed per second. A maximum of three digits are supported on both sides of the slash (/).
-cacheSize	Integer	102400	Maximum number of cache keys. The value ranges from 1 to 1024000.
-h or -help	None	None	Help information.
-platform	String	Prometheus	Interconnection platform. Prometheus Telegraf
-poll_interval	Duration (integer)	1	Interval for reporting Telegraf data, in seconds. This parameter takes effect only when the Telegraf platform is connected. That is, this parameter takes effect only when -platform is set to Telegraf.
-profilingTime	Integer	200	PCIe bandwidth collection time. The value ranges from 1 to 2000, in milliseconds.
-hccsBWProfilingTime	Integer	200	Duration for sampling the HCCS link bandwidth. The value ranges from 1 to 1000, in milliseconds.
-deviceResetTimeout	Integer	60	Maximum wait time for the driver to report complete processor information if the detected processor count is insufficient at component startup. The value ranges from 10 to 600, in seconds. For the Atlas A2 training product, Atlas 800I A2 inference server, and A200I A2 Box heterogeneous component, the recommended value is 150 seconds. For the Atlas A3 training product, A200T A3 Box8 SuperPoD Server, and Atlas 800I A3 SuperPoD Server, the recommended value is 360 seconds.
-textMetricsFilePath	String	None	Path of the custom metric file. For details about the restrictions, see Restrictions.

Parent topic: Manual Installation