Starting the NPU-Exporter
Constraints
- The NPU-Exporter periodically calls the related APIs of the NPU driver to detect the NPU status. To upgrade the driver, stop the NPU-Exporter first.
- The owner of the DCMI dynamic library and its parent directories invoked by the NPU-Exporter must be root; otherwise, the program cannot run. In addition, group and other do not have the write permission on these files and directories.
- The length of the DCMI dynamic library path must be less than 20.
- If the dynamic library path is set by setting LD_LIBRARY_PATH, the total length of LD_LIBRARY_PATH cannot exceed 1024.
- To deploy the NPU-Exporter in a container on the Atlas 200I Soc A1 core board node, you need to configure the multi-container sharing mode. For details, see the Atlas 200I SoC A1 Core Board NPU Driver and Firmware Installation Guide.
Procedure
The NPU-Exporter supports two installation modes. You can select either of the following modes as required.
- (Recommended) Binary-based installation on a physical machine (for high security requirements). For details, see Binary-based Installation.
- Container-based installation. For details, see Container-based Installation.
Binary-based Installation
The NPU-Exporter component is used in a privileged container by the root user, and mounts the socket file of docker-shim or containerd. If the container is maliciously used, container escape occurs. You are advised to import the certificate and start it using the binary service on the physical machine.
- Log in to the server as the root user, upload the software package to any directory (for example, /home/ascend-npu-exporter) on the server where the component is to be installed, and decompress it.
- Refer to the following to create the npu-exporter.service file and save it to the /home/ascend-npu-exporter directory where the software package is decompressed.
[Unit] Description=Ascend npu exporter Documentation=hiascend.com [Service] ExecStart=/bin/bash -c "/usr/local/bin/npu-exporter -ip=127.0.0.1 -port=8082 -logFile=/var/log/mindx-dl/npu-exporter/npu-exporter.log>/dev/null 2>&1 &" Restart=always RestartSec=2 KillMode=process Environment="LD_LIBRARY_PATH=/usr/local/kmc" Environment="GOGC=50" Environment="GOMAXPROCS=2" Environment="GODEBUG=madvdontneed=1" Type=forking User=hwMindX Group=hwMindX [Install] WantedBy=multi-user.target
- By default, the NPU-Exporter listens to only 127.0.0.1 and provides the HTTPS service. If the HTTPS certificate has not been imported, modify the startup parameters to enable the component to start in HTTP mode. For details about the startup parameters that can be modified, see Table 1. Then, modify the ExecStart field in the npu-exporter.service file.
... [Service] ExecStart=/bin/bash -c "/usr/local/bin/npu-exporter -ip=127.0.0.1 -port=8082 -logFile=/var/log/mindx-dl/npu-exporter/npu-exporter.log &" Restart=always RestartSec=2 ...
- Skip this step if the deployment node is not an Atlas 200I Soc A1 core board node. Otherwise, you need to add the hwMindX user to the HwBaseUser and HwDmUser user groups on all nodes of this type by running the following commands:
usermod -a -G HwBaseUser hwMindX usermod -a -G HwDmUser hwMindX
- Run the following command to start the NPU-Exporter service:
cd /home/ascend-npu-exporter mkdir /usr/local/kmc cp -r lib/* /usr/local/kmc cp npu-exporter /usr/local/bin cp npu-exporter.service /etc/systemd/system chattr +i /etc/systemd/system/npu-exporter.service chmod 444 /usr/local/kmc/* chmod 755 /usr/local/kmc chmod 500 /usr/local/bin/npu-exporter chown hwMindX:hwMindX /usr/local/bin/npu-exporter chattr +i /usr/local/bin/npu-exporter systemctl enable npu-exporter systemctl start npu-exporter
- To obtain container metrics, you need to temporarily escalate the NPU-Exporter privilege so that the NPU-Exporter can establish connections with the sockets of CRI and OCI:
chattr -i /usr/local/bin/npu-exporter setcap cap_setuid+ep /usr/local/bin/npu-exporter chattr +i /usr/local/bin/npu-exporter tee /etc/ld.so.conf.d/ascend_dl_so.conf <<- EOF /usr/local/kmc EOF ldconfig systemctl restart npu-exporter
- The lib directory of the dynamic library file on which the encryption component depends contains the libcrypto.so dynamic library, which may conflict with the built-in system library in some environments. If an OpenSSL-related error occurs during the installation, rectify the fault by referring to Troubleshooting.
- To obtain container metrics, you need to temporarily escalate the NPU-Exporter privilege so that the NPU-Exporter can establish connections with the sockets of CRI and OCI:
Container-based Installation
- Log in to each worker node as the root user and run the following command to check whether the NPU-Exporter image and version are correct:
docker images | grep npu-exporter
Example:
root@ubuntu:~# docker images|grep npu-exporter npu-exporter v3.0.0 20185c45f1bc About an hour ago 90.1MB
- If yes, go to 2.
- If no, create an image and distribute it. For details, see Creating an Image.
- Copy the YAML file obtained from the software package to the Kubernetes master node.
- The NPU-Exporter provides the HTTPS service by default. If the component is started by the HTTPS service, skip this step. If the HTTPS certificate has not been imported, perform the following operations to start the component in HTTP mode.
- Atlas 200I Soc A1 core board nodeChange the NPU-Exporter startup parameters in the startup script run_for_310P_1usoc.sh and set -enableHTTP to true.
... /usr/local/bin/npu-exporter -port=8082 -ip=0.0.0.0 -updateTime=5 -enableHTTP=true -logFile=/var/log/mindx-dl/npu-exporter/npu-exporter.log -logLevel=0 -containerMode=docker -endpoint=/run/dockershim.sock -containerd=/run/docker/containerd/containerd.sock
After the modification, you need to create images again on all Atlas 200I Soc A1 core board nodes or create the image of one Atlas 200I Soc A1 core board node and distribute it to other nodes.
In addition, delete the following information in bold from the npu-exporter-310P-1usoc-*.yaml file:
apiVersion: apps/v1 kind: DaemonSet metadata: name: npu-exporter-310p-1usoc namespace: npu-exporter spec: selector: matchLabels: app: npu-exporter ... spec: ... volumeMounts: ... - name: kmckeystore # Delete the following information in bold. mountPath: /etc/mindx-dl/kmc_primary_store - name: kmckeybak mountPath: /etc/mindx-dl/.config - name: kmc-exporter mountPath: /etc/mindx-dl/npu-exporter ... volumes: ... - name: kmckeystore hostPath: path: /etc/mindx-dl/kmc_primary_store type: Directory - name: kmckeybak hostPath: path: /etc/mindx-dl/.config type: Directory - name: kmc-exporter hostPath: path: /etc/mindx-dl/npu-exporter type: Directory ... - Other nodesModify the NPU-Exporter startup parameters in the npu-exporter-*.yaml file and add enableHTTP=true as follows.
... command: [ "/bin/bash", "-c", "--"] args: [ "umask 027;npu-exporter -port=8082 -ip=0.0.0.0 -updateTime=5 -enableHTTP=true -logFile=/var/log/mindx-dl/npu-exporter/npu-exporter.log -logLevel=0 -containerMode=containerd" ] ....In addition, delete the following information in bold from the npu-exporter-*.yaml file:
apiVersion: apps/v1 kind: DaemonSet metadata: name: npu-exporter namespace: npu-exporter spec: selector: matchLabels: app: npu-exporter ... spec: ... volumeMounts: ... - name: kmckeystore # Delete the following information in bold. mountPath: /etc/mindx-dl/kmc_primary_store - name: kmckeybak mountPath: /etc/mindx-dl/.config - name: kmc-exporter mountPath: /etc/mindx-dl/npu-exporter ... volumes: ... - name: kmckeystore hostPath: path: /etc/mindx-dl/kmc_primary_store type: Directory - name: kmckeybak hostPath: path: /etc/mindx-dl/.config type: Directory - name: kmc-exporter hostPath: path: /etc/mindx-dl/npu-exporter type: Directory ...
- Atlas 200I Soc A1 core board node
- If the default NPU-Exporter startup parameter -containerMode=docker is used, skip this step. To set -containerMode to containerd, delete the following information in bold in the YAML file.
apiVersion: apps/v1 kind: DaemonSet metadata: name: npu-exporter namespace: npu-exporter spec: selector: matchLabels: app: npu-exporter ... spec: ... volumeMounts: ... - name: docker-shim # delete when only use containerd # Delete the following information in bold. mountPath: /var/run/dockershim.sock readOnly: true - name: docker # delete when only use containerd mountPath: /var/run/docker readOnly: true ... volumes: ... - name: docker-shim # delete when only use containerd hostPath: path: /var/run/dockershim.sock - name: docker # delete when only use containerd hostPath: path: /var/run/docker ... - If you do not need to modify other startup parameters of the component, skip this step. Otherwise, modify the startup parameters of NPU-Exporter in the YAML file based on your requirements. For details about the startup parameters, see Table 1. You can also run the ./npu-exporter -h command to view the parameter descriptions.
- Run the following command to start the NPU-Exporter.
kubectl apply -f npu-exporter-*.yaml
- If the Atlas 200I Soc A1 core board is used in the Kubernetes cluster, run the following command:
kubectl apply -f npu-exporter-310P-1usoc-*.yaml
- If other types of nodes are used in the Kubernetes cluster, run the following command:
kubectl apply -f npu-exporter-*.yaml
If the Kubernetes cluster uses both Atlas 200I Soc A1 core boards and other types of nodes, run the corresponding commands separately.
The following is an example:root@ubuntu:/home/ascend-npu-exporter# kubectl apply -f npu-exporter-v3.0.0.yaml namespace/npu-exporter unchanged networkpolicy.networking.K8s.io/exporter-network-policy unchanged daemonset.apps/npu-exporter created root@ubuntu:/home/ascend-npu-exporter# kubectl get pod -n npu-exporter NAME READY STATUS RESTARTS AGE ... npu-exporter-hqpxl 1/1 Running 0 11s ...
The NPU-Exporter has requirements on the process environment. If the NPU-Exporter is running as a container, ensure that the /sys directory and the socket file for communication are mounted to the NPU-Exporter container.
- If the Atlas 200I Soc A1 core board is used in the Kubernetes cluster, run the following command:
Parameters
Parameter |
Type |
Default Value |
Description |
|---|---|---|---|
-port |
int |
8082 |
Listening port. The value ranges from 1025 to 40000. |
-updateTime |
int |
5 |
Information update period. The value ranges from 1 to 60, in seconds. If this parameter is set to a large value, some containers whose lifetime is shorter than the update period may fail to be reported. |
-ip |
string |
None |
This parameter has no default value and must be set. Listening IP address. You are not advised setting this parameter to 0.0.0.0 on a host with multiple NICs. |
-enableHTTP |
bool |
false |
Whether to enable HTTP. NOTE:
If this parameter is set to true, security risks may exist. You are advised to use the default value false. |
-version |
bool |
false |
Whether to print the program version number. |
-tlsSuites |
int |
1 |
TLS encryption suite configuration.
|
-concurrency |
int |
5 |
Traffic limit of the HTTP/HTTPS service. The default value is 5 concurrences. The value ranges from 1 to 512. |
-logLevel |
int |
0 |
Log level.
|
-maxAge |
int |
7 |
Log backup time limit. The value ranges from 7 to 700, in days. |
-logFile |
string |
/var/log/mindx-dl/npu-exporter/npu-exporter.log |
Log file. NOTE:
If the size of a log file exceeds 20 MB, automatic dump is triggered. The maximum size of a log file cannot be changed. |
-maxBackups |
int |
30 |
Maximum number of dumped log files that can be retained. The value range is (0, 30]. |
-containerMode |
string |
docker |
Container runtime type.
|
-containerd |
string |
unix:///var/run/docker/containerd/docker-containerd.sock or unix:///run/containerd/containerd.sock |
Endpoint of the containerd daemon process, which is used to communicate with containerd.
Generally, retain the default value. If you have changed the sock file path of containerd, you can run ps aux | grep containerd to query. |
-endpoint |
string |
unix:///var/run/dockershim.sock or unix:///run/containerd/containerd.sock |
Sock address of the CRI server.
Generally, retain the default value unless you have changed the sock file path of dockershim or containerd. |
-checkInterval |
int |
1 |
Interval for checking the certificate validity, in days. The value ranges from 1 to 7. |
-warningDays |
int |
100 |
Time when a warning log is generated before the certificate expires, in days. The value ranges from 10 to 365. |
-limitIPConn |
int |
5 |
Number of TCP connections for each IP address. The value ranges from 1 to 128. |
-limitTotalConn |
int |
20 |
Number of TCP connections for the program. The value ranges from 1 to 512. |
-limitIPReq |
string |
20/1 |
Number of requests from each IP address. The value 20/1 indicates that a maximum of 20 requests are allowed per second. A maximum of three digits are supported on both sides of the slash (/). |
-cacheSize |
int |
102400 |
Number of cache keys. The value ranges from 1 to 1024000. |
-h |
None |
N/A |
Help information. |