Starting the NPU-Exporter

Constraints

  • The NPU-Exporter periodically calls the related APIs of the NPU driver to detect the NPU status. To upgrade the driver, stop the NPU-Exporter first.
  • The owner of the DCMI dynamic library and its parent directories invoked by the NPU-Exporter must be root; otherwise, the program cannot run. In addition, group and other do not have the write permission on these files and directories.
  • The length of the DCMI dynamic library path must be less than 20.
  • If the dynamic library path is set by setting LD_LIBRARY_PATH, the total length of LD_LIBRARY_PATH cannot exceed 1024.
  • To deploy the NPU-Exporter in a container on the Atlas 200I Soc A1 core board node, you need to configure the multi-container sharing mode. For details, see the Atlas 200I SoC A1 Core Board NPU Driver and Firmware Installation Guide.

Procedure

The NPU-Exporter supports two installation modes. You can select either of the following modes as required.

Binary-based Installation

The NPU-Exporter component is used in a privileged container by the root user, and mounts the socket file of docker-shim or containerd. If the container is maliciously used, container escape occurs. You are advised to import the certificate and start it using the binary service on the physical machine.

  1. Log in to the server as the root user, upload the software package to any directory (for example, /home/ascend-npu-exporter) on the server where the component is to be installed, and decompress it.
  2. Refer to the following to create the npu-exporter.service file and save it to the /home/ascend-npu-exporter directory where the software package is decompressed.
    [Unit]
    Description=Ascend npu exporter
    Documentation=hiascend.com
    
    [Service]
    ExecStart=/bin/bash -c "/usr/local/bin/npu-exporter -ip=127.0.0.1 -port=8082 -logFile=/var/log/mindx-dl/npu-exporter/npu-exporter.log>/dev/null  2>&1 &"
    Restart=always
    RestartSec=2
    KillMode=process
    Environment="LD_LIBRARY_PATH=/usr/local/kmc"
    Environment="GOGC=50"
    Environment="GOMAXPROCS=2"
    Environment="GODEBUG=madvdontneed=1"
    Type=forking
    User=hwMindX
    Group=hwMindX
    
    [Install]
    WantedBy=multi-user.target
  3. By default, the NPU-Exporter listens to only 127.0.0.1 and provides the HTTPS service. If the HTTPS certificate has not been imported, modify the startup parameters to enable the component to start in HTTP mode. For details about the startup parameters that can be modified, see Table 1. Then, modify the ExecStart field in the npu-exporter.service file.
    ...
    [Service]
    ExecStart=/bin/bash -c "/usr/local/bin/npu-exporter -ip=127.0.0.1 -port=8082 -logFile=/var/log/mindx-dl/npu-exporter/npu-exporter.log &"
    Restart=always
    RestartSec=2
    ...
  4. Skip this step if the deployment node is not an Atlas 200I Soc A1 core board node. Otherwise, you need to add the hwMindX user to the HwBaseUser and HwDmUser user groups on all nodes of this type by running the following commands:
    usermod -a -G HwBaseUser hwMindX
    usermod -a -G HwDmUser hwMindX
  5. Run the following command to start the NPU-Exporter service:
    cd /home/ascend-npu-exporter
    mkdir /usr/local/kmc
    cp -r lib/* /usr/local/kmc
    cp npu-exporter /usr/local/bin
    cp npu-exporter.service /etc/systemd/system
    chattr +i /etc/systemd/system/npu-exporter.service
    chmod 444 /usr/local/kmc/*
    chmod 755 /usr/local/kmc
    chmod 500 /usr/local/bin/npu-exporter
    chown hwMindX:hwMindX /usr/local/bin/npu-exporter
    chattr +i /usr/local/bin/npu-exporter
    systemctl enable npu-exporter
    systemctl start npu-exporter
    • To obtain container metrics, you need to temporarily escalate the NPU-Exporter privilege so that the NPU-Exporter can establish connections with the sockets of CRI and OCI:
      chattr -i /usr/local/bin/npu-exporter
      setcap cap_setuid+ep /usr/local/bin/npu-exporter
      chattr +i /usr/local/bin/npu-exporter
      tee /etc/ld.so.conf.d/ascend_dl_so.conf <<- EOF
      /usr/local/kmc
      EOF
      ldconfig
      systemctl restart npu-exporter
    • The lib directory of the dynamic library file on which the encryption component depends contains the libcrypto.so dynamic library, which may conflict with the built-in system library in some environments. If an OpenSSL-related error occurs during the installation, rectify the fault by referring to Troubleshooting.

Container-based Installation

  1. Log in to each worker node as the root user and run the following command to check whether the NPU-Exporter image and version are correct:
    docker images | grep npu-exporter

    Example:

    root@ubuntu:~# docker images|grep npu-exporter 
    npu-exporter                         v3.0.0              20185c45f1bc        About an hour ago         90.1MB
    • If yes, go to 2.
    • If no, create an image and distribute it. For details, see Creating an Image.
  2. Copy the YAML file obtained from the software package to the Kubernetes master node.
  3. The NPU-Exporter provides the HTTPS service by default. If the component is started by the HTTPS service, skip this step. If the HTTPS certificate has not been imported, perform the following operations to start the component in HTTP mode.
    • Atlas 200I Soc A1 core board node
      Change the NPU-Exporter startup parameters in the startup script run_for_310P_1usoc.sh and set -enableHTTP to true.
      ...
      /usr/local/bin/npu-exporter -port=8082 -ip=0.0.0.0 -updateTime=5 -enableHTTP=true -logFile=/var/log/mindx-dl/npu-exporter/npu-exporter.log -logLevel=0 -containerMode=docker -endpoint=/run/dockershim.sock -containerd=/run/docker/containerd/containerd.sock

      After the modification, you need to create images again on all Atlas 200I Soc A1 core board nodes or create the image of one Atlas 200I Soc A1 core board node and distribute it to other nodes.

      In addition, delete the following information in bold from the npu-exporter-310P-1usoc-*.yaml file:

      apiVersion: apps/v1
      kind: DaemonSet
      metadata:
        name: npu-exporter-310p-1usoc
        namespace: npu-exporter
      spec:
        selector:
          matchLabels:
            app: npu-exporter
      ...
          spec:
      ...
              volumeMounts:
      ...
                - name: kmckeystore                                    # Delete the following information in bold.
                  mountPath: /etc/mindx-dl/kmc_primary_store
                - name: kmckeybak
                  mountPath: /etc/mindx-dl/.config
                - name: kmc-exporter
                  mountPath: /etc/mindx-dl/npu-exporter
      ...
            volumes:
      ...
              - name: kmckeystore
                hostPath:
                  path: /etc/mindx-dl/kmc_primary_store
                  type: Directory
              - name: kmckeybak
                hostPath:
                  path: /etc/mindx-dl/.config
                  type: Directory
              - name: kmc-exporter
                hostPath:
                  path: /etc/mindx-dl/npu-exporter
                  type: Directory
      ...
    • Other nodes
      Modify the NPU-Exporter startup parameters in the npu-exporter-*.yaml file and add enableHTTP=true as follows.
      ...
              command: [ "/bin/bash", "-c", "--"]
              args: [ "umask 027;npu-exporter -port=8082 -ip=0.0.0.0  -updateTime=5 -enableHTTP=true
                       -logFile=/var/log/mindx-dl/npu-exporter/npu-exporter.log -logLevel=0 -containerMode=containerd" ]
      .... 

      In addition, delete the following information in bold from the npu-exporter-*.yaml file:

      apiVersion: apps/v1
      kind: DaemonSet
      metadata:
        name: npu-exporter
        namespace: npu-exporter
      spec:
        selector:
          matchLabels:
            app: npu-exporter
      ...
          spec:
      ...
              volumeMounts:
      ...
                - name: kmckeystore                                    # Delete the following information in bold.
                  mountPath: /etc/mindx-dl/kmc_primary_store
                - name: kmckeybak
                  mountPath: /etc/mindx-dl/.config
                - name: kmc-exporter
                  mountPath: /etc/mindx-dl/npu-exporter
      ...
            volumes:
      ...
              - name: kmckeystore
                hostPath:
                  path: /etc/mindx-dl/kmc_primary_store
                  type: Directory
              - name: kmckeybak
                hostPath:
                  path: /etc/mindx-dl/.config
                  type: Directory
              - name: kmc-exporter
                hostPath:
                  path: /etc/mindx-dl/npu-exporter
                  type: Directory
      ...
  4. If the default NPU-Exporter startup parameter -containerMode=docker is used, skip this step. To set -containerMode to containerd, delete the following information in bold in the YAML file.
    apiVersion: apps/v1
    kind: DaemonSet
    metadata:
      name: npu-exporter
      namespace: npu-exporter
    spec:
      selector:
        matchLabels:
          app: npu-exporter
    ...
        spec:
    ...
            volumeMounts:
    ...
              - name: docker-shim  # delete when only use containerd   # Delete the following information in bold.
                mountPath: /var/run/dockershim.sock
                readOnly: true
              - name: docker  # delete when only use containerd
                mountPath: /var/run/docker
                readOnly: true
    ...
          volumes:
    ...
            - name: docker-shim # delete when only use containerd
              hostPath:
                path: /var/run/dockershim.sock
            - name: docker  # delete when only use containerd
              hostPath:
                path: /var/run/docker
    ...
  5. If you do not need to modify other startup parameters of the component, skip this step. Otherwise, modify the startup parameters of NPU-Exporter in the YAML file based on your requirements. For details about the startup parameters, see Table 1. You can also run the ./npu-exporter -h command to view the parameter descriptions.
  6. Run the following command to start the NPU-Exporter.
    kubectl apply -f npu-exporter-*.yaml
    • If the Atlas 200I Soc A1 core board is used in the Kubernetes cluster, run the following command:
      kubectl apply -f npu-exporter-310P-1usoc-*.yaml
    • If other types of nodes are used in the Kubernetes cluster, run the following command:
      kubectl apply -f npu-exporter-*.yaml

    If the Kubernetes cluster uses both Atlas 200I Soc A1 core boards and other types of nodes, run the corresponding commands separately.

    The following is an example:
    root@ubuntu:/home/ascend-npu-exporter# kubectl apply -f npu-exporter-v3.0.0.yaml 
    namespace/npu-exporter unchanged
    networkpolicy.networking.K8s.io/exporter-network-policy unchanged
    daemonset.apps/npu-exporter created
    root@ubuntu:/home/ascend-npu-exporter# kubectl get pod -n npu-exporter
    NAME                            READY   STATUS    RESTARTS   AGE
    ...
    npu-exporter-hqpxl              1/1     Running   0          11s
    ...

    The NPU-Exporter has requirements on the process environment. If the NPU-Exporter is running as a container, ensure that the /sys directory and the socket file for communication are mounted to the NPU-Exporter container.

Parameters

Table 1 NPU-Exporter startup parameters

Parameter

Type

Default Value

Description

-port

int

8082

Listening port. The value ranges from 1025 to 40000.

-updateTime

int

5

Information update period. The value ranges from 1 to 60, in seconds. If this parameter is set to a large value, some containers whose lifetime is shorter than the update period may fail to be reported.

-ip

string

None

This parameter has no default value and must be set.

Listening IP address. You are not advised setting this parameter to 0.0.0.0 on a host with multiple NICs.

-enableHTTP

bool

false

Whether to enable HTTP.

NOTE:

If this parameter is set to true, security risks may exist. You are advised to use the default value false.

-version

bool

false

Whether to print the program version number.

-tlsSuites

int

1

TLS encryption suite configuration.

  • 0: TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256
  • 1: TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
    NOTE:

    Invalid values are reset to the default value. They take effect only when certificates in the corresponding formats are used.

-concurrency

int

5

Traffic limit of the HTTP/HTTPS service. The default value is 5 concurrences. The value ranges from 1 to 512.

-logLevel

int

0

Log level.

  • -1: debug
  • 0: info
  • 1: warning
  • 2: error
  • 3: critical

-maxAge

int

7

Log backup time limit. The value ranges from 7 to 700, in days.

-logFile

string

/var/log/mindx-dl/npu-exporter/npu-exporter.log

Log file.

NOTE:

If the size of a log file exceeds 20 MB, automatic dump is triggered. The maximum size of a log file cannot be changed.

-maxBackups

int

30

Maximum number of dumped log files that can be retained. The value range is (0, 30].

-containerMode

string

docker

Container runtime type.

  • If this parameter is set to docker, Docker is used as the container runtime in the current environment.
  • If this parameter is set to containerd, containerd is used as the container runtime in the current environment.

-containerd

string

unix:///var/run/docker/containerd/docker-containerd.sock

or

unix:///run/containerd/containerd.sock

Endpoint of the containerd daemon process, which is used to communicate with containerd.

  • If containerMode is set to docker, the default value of this parameter is /var/run/docker/containerd/docker-containerd.sock. If the connection fails, the system automatically attempts to connect to the unix:///run/containerd/containerd.sock.
  • If containerMode is set to containerd, the default value of this parameter is /run/containerd/containerd.sock.

Generally, retain the default value. If you have changed the sock file path of containerd, you can run

ps aux | grep containerd to query.

-endpoint

string

unix:///var/run/dockershim.sock

or

unix:///run/containerd/containerd.sock

Sock address of the CRI server.

  • If containerMode is set to docker, dockershim is connected to obtain the container list. The default value is /var/run/dockershim.sock.
  • If containerMode is set to containerd, the default value is /run/containerd/containerd.sock.

Generally, retain the default value unless you have changed the sock file path of dockershim or containerd.

-checkInterval

int

1

Interval for checking the certificate validity, in days. The value ranges from 1 to 7.

-warningDays

int

100

Time when a warning log is generated before the certificate expires, in days. The value ranges from 10 to 365.

-limitIPConn

int

5

Number of TCP connections for each IP address. The value ranges from 1 to 128.

-limitTotalConn

int

20

Number of TCP connections for the program. The value ranges from 1 to 512.

-limitIPReq

string

20/1

Number of requests from each IP address. The value 20/1 indicates that a maximum of 20 requests are allowed per second. A maximum of three digits are supported on both sides of the slash (/).

-cacheSize

int

102400

Number of cache keys. The value ranges from 1 to 1024000.

-h

None

N/A

Help information.