NPU Exporter
- To use resource monitoring, you must install NPU Exporter, which can be interconnected with Prometheus or Telegraf.
- When interconnecting with Prometheus, NPU Exporter supports containerized deployment and binary deployment. For details about the differences, see Differences Between Container and Binary Deployment.
- When interconnecting with Telegraf, install NPU Exporter and Telegraf by referring to Working with Telegraf.
- If resource monitoring is not used, you do not need to install NPU Exporter. In this case, skip this section.
Restrictions
Before installing NPU Exporter, you need to understand related restrictions. For details, see Table 1.
Procedure
NPU Exporter supports two installation modes. You can select either of the following modes as required. This component provides only the HTTP service. To use the more secure HTTPS service, modify the source code for adaptation.
- (Recommended) Containerized installation. For details, see Containerized Installation.
- Binary-based installation on a physical machine (for high security requirements). For details, see Binary-based Installation.
Containerized Installation
- Log in to each compute node as the root user.
- (Optional) Modify the metricConfiguration.json or pluginConfiguration.json file to configure the collection and reporting of the default or custom metric group.
- Go to the directory where the NPU Exporter package is decompressed.
- Open the metricConfiguration.json file.
vi metricConfiguration.json
Press i to enter the insert mode and configure the collection and reporting of the default metric group as required.
Parameter
Description
metricsGroup
Default metric group name.
- ddr: DDR information
- hccs: HCCS information
- npu: NPU information
- network: network information
- pcie: PCIe information
- roce: RoCE information
- sio: SIO information
- vnpu: vNPU information
- version: version information
- optical: optical module information
- hbm: on-chip memory information
state
Switch for metric group collection and reporting. The default value is ON.
- ON: enabled. After it is enabled, metrics of a metric group are collected and reported.
- OFF: disabled. After it is disabled, metrics of a metric group are not collected and reported.
- Press Esc and enter :wq! to save the settings and exit.
- Modify the pluginConfiguration.json file by referring to 2.b to 2.d and configure the collection and reporting switch of the custom metric group as required.
Parameter
Description
metricsGroup
Name of the custom metric group registered with NPU Exporter. For details about how to customize metrics, see Custom Metric Development.
state
Switch for metric group collection and reporting. The default value is OFF.
- ON: enabled. After it is enabled, metrics of a metric group are collected and reported.
- OFF: disabled. After it is disabled, metrics of a metric group are not collected and reported.
- If custom metrics are developed using a plugin, rebuild the binary file.
- Create and distribute the image again by referring to Preparing an Image, and then go to 4.
- Check whether the NPU Exporter image and version are correct.
- Docker scenario:
docker images | grep npu-exporter
Command output:1npu-exporter v7.3.0 20185c45f1bc About an hour ago 90.1MB
- containerd scenario:
ctr -n k8s.io c ls | grep npu-exporter
Command output:
docker.io/library/npu-exporter:v7.3.0 application/vnd.docker.distribution.manifest.v2+json sha256:38fd69ee9f5753e73a55a216d039f6ed4ea8a5de15c0e6b3bb503022db470c7b 91.5 MiB linux/arm64
- If correct, go to 4.
- If not correct, create the image and distribute it by referring to Preparing an Image.
- Docker scenario:
- Copy the YAML file in the directory where the NPU Exporter package is decompressed to any directory on the Kubernetes management node.
- Perform the following steps based on the containerized mode in use.
- containerd scenario: Set containerMode to containerd and modify the following code in bold.
If the default NPU Exporter startup parameter -containerMode=docker is used, skip this step.
apiVersion: apps/v1 kind: DaemonSet metadata: name: npu-exporter namespace: npu-exporter spec: selector: matchLabels: app: npu-exporter ... spec: ... args: [ "umask 027;npu-exporter -port=8082 -ip=0.0.0.0 -updateTime=5 -logFile=/var/log/mindx-dl/npu-exporter/npu-exporter.log -logLevel=0 -containerMode=containerd" ] ... volumeMounts: ... - name: docker-shim mountPath: /var/run/dockershim.sock readOnly: true - name: docker # Delete this configuration item only when containerd is used. mountPath: /var/run/docker readOnly: true - name: cri-dockerd mountPath: /var/run/cri-dockerd.sock readOnly: true - name: containerd mountPath: /run/containerd readOnly: true - name: isulad mountPath: /run/isulad.sock readOnly: true ... volumes: ... - name: docker-shim hostPath: path: /var/run/dockershim.sock - name: docker # Delete this configuration item only when containerd is used. hostPath: path: /var/run/docker - name: cri-dockerd hostPath: path: /var/run/cri-dockerd.sock - name: containerd hostPath: path: /run/containerd - name: isulad hostPath: path: /run/isulad.sock ...- Docker scenario: Delete the mount file of the original container runtime, add the mount directory of the dockershim.sock file, and modify the following information in bold.
If the NPU Exporter startup parameter -containerMode=containerd is used, skip this step.
This step can resolve data loss of NPU Exporter after kubelet is restarted. A container escape may happen due to an increase of files, such as docker.sock, mounted to the new directory.
... volumeMounts: - name: log-npu-exporter ... - name: sys mountPath: /sys readOnly: true - name: docker-shim # Delete the following fields in bold. mountPath: /var/run/dockershim.sock readOnly: true - name: docker mountPath: /var/run/docker readOnly: true - name: cri-dockerd mountPath: /var/run/cri-dockerd.sock readOnly: true - name: sock # Add the fields in bold. mountPath: /var/run # Use the actual dockershim.sock file directory. - name: containerd mountPath: /run/containerd ... volumes: - name: log-npu-exporter ... - name: sys hostPath: path: /sys - name: docker-shim # Delete the following fields in bold. hostPath: path: /var/run/dockershim.sock - name: docker hostPath: path: /var/run/docker - name: cri-dockerd hostPath: path: /var/run/cri-dockerd.sock - name: sock # Add the fields in bold. hostPath: path: /var/run # Use the actual dockershim.sock file directory. - name: containerd hostPath: path: /run/containerd ... - If you do not need to modify other startup parameters of the component, skip this step. Otherwise, modify the NPU Exporter startup parameters in the YAML file based on your requirements. For details about the startup parameters, see Table 2. You can also run the ./npu-exporter -h command to view the parameter descriptions.
- Run the following command in the directory where the YAML file of the management node is stored to start NPU Exporter.
- If Atlas 200I SoC A1 core boards are used in a Kubernetes cluster, run the following command:
kubectl apply -f npu-exporter-310P-1usoc-v{version}.yaml - If nodes except Atlas 200I SoC A1 core boards are used in a Kubernetes cluster, run the following command:
kubectl apply -f npu-exporter-v{version}.yaml
Startup example:namespace/npu-exporter created networkpolicy.networking.K8s.io/exporter-network-policy created daemonset.apps/npu-exporter created
If the error message "Error from server (NotFound): error when creating "npu-exporter-x.x.x.yaml":namespaces "npu-exporter" not found" is displayed during NPU Exporter startup, the namespace of NPU Exporter fails to be created. Run the following command to manually create the namespace:kubectl create ns npu-exporter
- If Atlas 200I SoC A1 core boards are used in a Kubernetes cluster, run the following command:
- Run the following command on any node to check whether the component is started:
kubectl get pod -n npu-exporter
If Running is displayed in the command output, the component is started successfully. If the status is CrashLoopBackOff, the directory permission may be incorrect. Rectify this fault by referring to NPU Exporter Fails to Check the Dynamic Path, and "check uid or mode failed" Is Recorded in the Log.
1 2 3
NAME READY STATUS RESTARTS AGE ... npu-exporter-hqpxl 1/1 Running 0 11s
- NPU Exporter has requirements on the process environment. If it is running as a container, ensure that the /sys directory and the socket file for communication are mounted to the NPU Exporter container. If the NPU container information is not obtained by calling the Metrics interface of NPU Exporter, the possible cause is that the socket file path is incorrect. Rectify this fault by referring to "connecting to container runtime failed" Is Displayed in Logs.
- If the pod status of the installed component is not Running, refer to Component pods Are Not in the Running State.
- If the pod status of the installed component is ContainerCreating, refer to Cluster Scheduling Component Pods Are in the ContainerCreating State.
- If the component fails to be started, refer to Cluster Scheduling Components Fail to Start and "get sem errno =13" Is Displayed in Logs.
- If the component is started successfully, but the corresponding pod cannot be found, refer to YAML File for Starting a Component Is Successfully Executed, But the pod Corresponding to the Component Is Not Displayed.
Binary-based Installation
When NPU Exporter runs in containerized mode, the privileged container, root user, and socket file mounted with docker-shim or containerd are required. If the container is maliciously used, container escape risks exist. If high security is required, run the component on a physical machine in binary mode.
- When NPU Exporter is deployed in binary mode, you can use a non-root user (for example, hwMindX) for deployment. Change the permission on the log directory to hwMindX by running chown hwMindX:hwMindX /var/log/mindx-dl/npu-exporter. The command is for reference only.
- The user hwMindX is used in the following steps.
- Log in to a server as the root user.
- Upload the NPU Exporter package to any directory (for example, /home/ascend-npu-exporter) on the server and decompress the package.
- Copy the metricConfiguration.json and pluginConfiguration.json files in the decompressed NPU Exporter package directory to the /usr/local directory.
- (Optional) Modify the metricConfiguration.json or pluginConfiguration.json file to configure the collection and reporting of the default or custom metric group.
- Go to the /usr/local directory.
- Open the metricConfiguration.json file.
vi metricConfiguration.json
Press i to enter the insert mode and configure the collection and reporting of the default metric group as required.
Parameter
Description
metricsGroup
Default metric group name.
- ddr: DDR information
- hccs: HCCS information
- npu: NPU information
- network: network information
- pcie: PCIe information
- roce: RoCE information
- sio: SIO information
- vnpu: vNPU information
- version: version information
- optical: optical module information
- hbm: on-chip memory information
state
Switch for metric group collection and reporting. The default value is ON.
- ON: enabled. After it is enabled, metrics of a metric group are collected and reported.
- OFF: disabled. After it is disabled, metrics of a metric group are not collected and reported.
- Press Esc and enter :wq! to save the settings and exit.
- Modify the pluginConfiguration.json file by referring to 4.b to 4.d and configure the collection and reporting switch of the custom metric group as required.
Parameter
Description
metricsGroup
Name of the custom metric group registered with NPU Exporter. For details about how to customize metrics, see Custom Metric Development.
state
Switch for metric group collection and reporting. The default value is OFF.
- ON: enabled. After it is enabled, metrics of a metric group are collected and reported.
- OFF: disabled. After it is disabled, metrics of a metric group are not collected and reported.
- If custom metrics are developed using a plugin, rebuild the binary file.
- Create and edit the npu-exporter.service file.
- Create the npu-exporter.service file.
vi /home/ascend-npu-exporter/npu-exporter.service
Write the following information to the npu-exporter.service file.
[Unit] Description=Ascend npu exporter Documentation=hiascend.com [Service] ExecStart=/bin/bash -c "/usr/local/bin/npu-exporter -ip=127.0.0.1 -port=8082 -logFile=/var/log/mindx-dl/npu-exporter/npu-exporter.log>/dev/null 2>&1 &" Restart=always RestartSec=2 KillMode=process Environment="GOGC=50" Environment="GOMAXPROCS=2" Environment="GODEBUG=madvdontneed=1" Type=forking User=hwMindX Group=hwMindX [Install] WantedBy=multi-user.target
By default, NPU Exporter listens to only 127.0.0.1. You can modify the startup parameter -ip and the ExecStart field in the npu-exporter.service file to change the IP addresses to be listened to.
- Press Esc and enter :wq to save the changes and exit.
- Create the npu-exporter.service file.
- Create and edit the npu-exporter.timer file. Configuring a timer to start NPU Exporter after a delay can ensure that the NPU is ready when NPU Exporter is started.
- Create the npu-exporter.timer file.
vi /home/ascend-npu-exporter/npu-exporter.timer
- Add the following information to the npu-exporter.timer file.
[Unit] Description=Timer for NPU Exporter Service [Timer] OnBootSec=60s # Set the delay for starting NPU Exporter. Adjust the value as required. Unit=npu-exporter.service [Install] WantedBy=timers.target
- Press Esc and enter :wq to save the changes and exit.
- Create the npu-exporter.timer file.
- If the deployment node is Atlas 200I SoC A1 core board, run the following commands in sequence to add the hwMindX user to the HwBaseUser and HwDmUser user groups on the node. Skip this step if the Atlas 200I SoC A1 core board is not used.
usermod -a -G HwBaseUser hwMindX usermod -a -G HwDmUser hwMindX
- Start the NPU Exporter service.
cd /home/ascend-npu-exporter cp npu-exporter /usr/local/bin cp npu-exporter.service /etc/systemd/system chattr +i /etc/systemd/system/npu-exporter.service cp npu-exporter.timer /etc/systemd/system chattr +i /etc/systemd/system/npu-exporter.timer chmod 500 /usr/local/bin/npu-exporter chown hwMindX:hwMindX /usr/local/bin/npu-exporter chattr +i /usr/local/bin/npu-exporter systemctl enable npu-exporter.timer systemctl start npu-exporter systemctl start npu-exporter.timer
To obtain container metrics, you need to temporarily escalate the NPU Exporter privilege so that it can establish connections with the sockets of CRI and OCI:chattr -i /usr/local/bin/npu-exporter setcap cap_setuid+ep /usr/local/bin/npu-exporter chattr +i /usr/local/bin/npu-exporter systemctl restart npu-exporter
Parameters
Parameter |
Type |
Default Value |
Description |
|---|---|---|---|
-port |
Integer |
8082 |
Listening port. The value ranges from 1025 to 40000. |
-updateTime |
Integer |
5 |
Information update period. The value ranges from 1 to 60, in seconds. If this parameter is set to a large value, some containers whose lifetime is shorter than the update period may fail to be reported. |
-ip |
String |
None |
This parameter has no default value and must be set. Listening IP address. You are not advised setting this parameter to 0.0.0.0 on a host with multiple NICs. |
-version |
Bool |
false |
Whether to query the NPU Exporter version number.
|
-concurrency |
Integer |
5 |
Traffic limit of the HTTP service. The value ranges from 1 to 512 and defaults to 5. |
-logLevel |
Integer |
0 |
Log level:
|
-maxAge |
Integer |
7 |
Time for backing up logs. The value ranges from 7 to 700, in days. |
-logFile |
String |
/var/log/mindx-dl/npu-exporter/npu-exporter.log |
Log file. NOTE:
If the size of a log file exceeds 20 MB, automatic dump is triggered. The maximum size of a log file cannot be changed. Dumped files are named in the format of "npu-exporter-dump triggering time.log", for example, npu-exporter-2023-10-07T03-38-24.402.log. |
-maxBackups |
Integer |
30 |
Maximum number of dumped log files that can be retained. The value ranges from 1 to 30. |
-containerMode |
String |
docker |
Container runtime type.
|
-containerd |
String |
|
Endpoint of the containerd daemon process, which is used to communicate with containerd.
Retain the default configuration, unless you change the path of the sock file of containerd. You can run the ps aux | grep containerd command to check whether the sock file path of containerd is changed. |
-endpoint |
String |
|
Sock address of the CRI server.
Generally, retain the default value unless you have changed the sock file path of dockershim or containerd. If the connection fails, the system automatically attempts to connect to unix:///run/cri-dockerd.sock. |
-limitIPConn |
Integer |
5 |
Number of TCP connections for each IP address. The value ranges from 1 to 128. |
-limitTotalConn |
Integer |
20 |
Total number of TCP limits of the program. The value ranges from 1 to 512. |
-limitIPReq |
String |
20/1 |
Number of requests from each IP address. The value 20/1 indicates that a maximum of 20 requests are allowed per second. A maximum of three digits are supported on both sides of the slash (/). |
-cacheSize |
Integer |
102400 |
Maximum number of cache keys. The value ranges from 1 to 1024000. |
-h or -help |
None |
None |
Help information. |
-platform |
String |
Prometheus |
Interconnection platform.
|
-poll_interval |
Duration (integer) |
1 |
Interval for reporting Telegraf data, in seconds. This parameter takes effect only when the Telegraf platform is connected. That is, this parameter takes effect only when -platform is set to Telegraf. |
-profilingTime |
Integer |
200 |
PCIe bandwidth collection time. The value ranges from 1 to 2000, in milliseconds. |
-hccsBWProfilingTime |
Integer |
200 |
Duration for sampling the HCCS link bandwidth. The value ranges from 1 to 1000, in milliseconds. |
-deviceResetTimeout |
Integer |
60 |
Maximum wait time for the driver to report complete processor information if the detected processor count is insufficient at component startup. The value ranges from 10 to 600, in seconds.
|
-textMetricsFilePath |
String |
None |
Path of the custom metric file. For details about the restrictions, see Restrictions. |