Starting the Ascend Device Plugin

Constraints

  • The Ascend Device Plugin periodically calls APIs related to the NPU driver. To upgrade the driver, stop the Ascend Device Plugin first.
  • The owner of the DCMI dynamic library and its parent directories invoked by the Ascend Device Plugin must be root; otherwise, the program cannot run. In addition, group and other do not have the write permission on these files and directories.
  • The length of the DCMI dynamic library path must be less than 20.
  • If the dynamic library path is set by setting LD_LIBRARY_PATH, the total length of LD_LIBRARY_PATH cannot exceed 1024.
  • To deploy the Ascend Device Plugin in a container on the Atlas 200I Soc A1 core board node, you need to configure the multi-container sharing mode. For details, see the Atlas 200I SoC A1 Core Board NPU Driver and Firmware Installation Guide.

Procedure

The Ascend Device Plugin supports two installation modes. You can select either of them as required.

Binary-based Installation

The Ascend Device Plugin component is used in a privileged container by the root user. If the container is maliciously used, container escape occurs. You are advised to use the binary service to start the KubeConfig file on the physical machine after it is imported from the node.

  1. Log in to the server as the root user, upload the software package to any directory (for example, /home/ascend-device-plugin) on the server where the component is to be installed, and decompress it.
  2. Create the device-plugin.service file and modify the NODE_NAME environment variable.
    [Unit]
    Description=Ascend K8s device plugin 
    Documentation=hiascend.com
    
    [Service]
    ExecStart=/bin/bash -c "/usr/local/bin/device-plugin -volcanoType=true -logFile=/var/log/mindx-dl/devicePlugin/devicePlugin.log>/dev/null  2>&1 &"
    Restart=always
    RestartSec=2
    KillMode=process
    Environment="LD_LIBRARY_PATH=/usr/local/kmc"
    Environment="GOGC=50"
    Environment="GOMAXPROCS=2"
    Environment="GODEBUG=madvdontneed=1"
    Environment="NODE_NAME=<Current Kubernetes node name>
    Type=forking
    User=hwMindX
    Group=hwMindX
    
    [Install]
    WantedBy=multi-user.target
  3. To modify startup parameters, modify the ExecStart field in device-plugin.service. For details, see Table 2.
    ...
    [Service]
    ExecStart=/bin/bash -c "/usr/local/bin/device-plugin  -volcanoType=true   -logFile=/var/log/mindx-dl/devicePlugin/devicePlugin.log &"
    Restart=always
    RestartSec=2
    ...

    If the deployment node is an Atlas 200I Soc A1 core board, set -useAscendDocker to false in the ExecStart field.

  4. Skip this step if the deployment node is not an Atlas 200I Soc A1 core board node. Otherwise, you need to add the hwMindX user to the HwBaseUser and HwDmUser user groups on all nodes of this type by running the following commands:
    usermod -a -G HwBaseUser hwMindX
    usermod -a -G HwDmUser hwMindX
  5. Run the following commands to enable the Device Plugin service:
    cd /home/ascend-device-plugin
    mkdir /usr/local/kmc
    cp -r lib/* /usr/local/kmc
    cp device-plugin /usr/local/bin
    cp device-plugin.service /etc/systemd/system
    chmod 444 /usr/local/kmc/*
    chmod 755 /usr/local/kmc
    chmod 500 /usr/local/bin/device-plugin
    chown hwMindX:hwMindX /usr/local/bin/device-plugin
    setcap CAP_CHOWN,CAP_DAC_OVERRIDE+ep /usr/local/bin/device-plugin
    chattr +i /usr/local/bin/device-plugin
    tee /etc/ld.so.conf.d/ascend_dl_so.conf <<- EOF
    /usr/local/kmc
    EOF
    ldconfig
    chattr +i /etc/systemd/system/device-plugin.service
    systemctl enable device-plugin
    systemctl start device-plugin

    The Ascend Device Plugin needs to access the host path /var/lib/kubelet/device-plugins/, so you need to change the owner of the created sock to root. In this case, add the CAP_CHOWN capability to the corresponding command.

    The lib directory of the dynamic library file on which the encryption component depends contains the libcrypto.so dynamic library, which may conflict with the built-in system library in some environments. If an OpenSSL-related error occurs during the installation, rectify the fault by referring to Troubleshooting.

Container-based Installation

  1. Log in to each worker node as the root user and run the following command to check whether the image and version are correct:
    docker images | grep K8sdeviceplugin

    Example:

    root@ubuntu:~# docker images|grep K8sdeviceplugin 
    ascend-K8sdeviceplugin               v3.0.0              29eec79eb693        About an hour ago   105MB
    • If yes, go to 2.
    • If no, create an image and distribute it. For details, see Creating an Image.
  2. Copy the YAML file in the software package to the Kubernetes master node. Note that the YAML file must match the specific processor model. For details, see the following table.
    Table 1 YAML files of the Ascend Device Plugin

    YAML File

    Description

    device-plugin-310-{version}.yaml

    Configuration file used when the Volcano is not used on the Ascend 310 devices

    device-plugin-310P-{version}.yaml

    Configuration file used when the Volcano is not used on the Ascend 310P devices

    device-plugin-910-{version}.yaml

    Configuration file used when the Volcano is not used on the Ascend 910 devices

    device-plugin-310P-1usoc-{version}.yaml

    Configuration file used when the Volcano is not used on the Atlas 200I SoC A1 core boards

    device-plugin-volcano-{version}.yaml

    Configuration file used when the Volcano is used on the Ascend 910 devices

    device-plugin-310-volcano-{version}.yaml

    Configuration file used when the Volcano is used on the Ascend 310 devices

    device-plugin-310P-volcano-{version}.yaml

    Configuration file used when the Volcano is used on the Ascend 310P devices

    device-plugin-310P-1usoc-volcano-v3.0.0.yaml

    Configuration file used when the Volcano is used on the Atlas 200I SoC A1 core boards

  3. Skip this step if you do not need to modify the component startup parameters. Otherwise, modify the startup parameters of the Ascend Device Plugin based on your requirements. For details about the startup parameters, see Table 2. You can run the ./device-plugin -h command to view the parameter descriptions.
    • On the Atlas 200I Soc A1 core board node, modify the startup parameters of the Ascend Device Plugin in the startup script run_for_310P_1usoc.sh. After the modification, create images on all Atlas 200I Soc A1 core board nodes, or create an image on a local node and distribute the image to other Atlas 200I Soc A1 core board nodes.

      If the deployment node is the Atlas 200I Soc A1 core board, set useAscendDocker to false in the startup parameters.

    • For other types of nodes, modify the startup parameters of the Ascend Device Plugin in the corresponding startup YAML file.
  4. If Volcano is not used as the scheduler (the startup parameter volcanoType of the Ascend Device Plugin is set to false), skip this step. If Volcano is used (volcanoType is set to true), the Ascend Device Plugin needs to operate Kubernetes resources. Therefore, you need to authorize the Ascend Device Plugin in either of the following ways:
    • If you do not import the KubeConfig file and use the ServiceAccount created in the YAML file, skip this step.
    • If the authorized KubeConfig file is imported, add the mounting configuration of security hardening-related paths to the component startup YAML file.
      The following is a YAML startup file that enables security hardening on the Ascend 910 AI Processor node (used together with the Volcano and HCCL-Controller). The methods of configuring the mounting paths of security hardening in YAML files on other types of processor nodes are the same.
      ...
      apiVersion: apps/v1
      kind: DaemonSet
      metadata:
        name: ascend-device-plugin-daemonset-910
        namespace: kube-system
      spec:
        selector:
          matchLabels:
            name: ascend-device-plugin-ds
      ...
          spec:
      ...
            containers:
              - image: ascend-K8sdeviceplugin:v2.0.3  # Replace the current version.
      ...
                volumeMounts:
                  
      ...
                  - name: kmckeystore
                    mountPath: /etc/mindx-dl/kmc_primary_store
                  - name: kmckeybak
                    mountPath: /etc/mindx-dl/.config
                  - name: kmc-deviceplugin
                    mountPath: /etc/mindx-dl/device-plugin
      ...
            volumes:
      
      ...
              - name: kmckeystore
                hostPath:
                  path: /etc/mindx-dl/kmc_primary_store
              - name: kmckeybak
                hostPath:
                  path: /etc/mindx-dl/.config
              - name: kmc-deviceplugin
                hostPath:
                  path: /etc/mindx-dl/device-plugin
      ...
  5. Run the following commands on the Kubernetes master node to start the corresponding services:
    • Ascend 310 AI Processor nodes exist in the Kubernetes cluster. (The Ascend Device Plugin works independently without the Volcano.)
      kubectl apply -f device-plugin-310-*.yaml
    • Ascend 310P AI Processor nodes exist in the Kubernetes cluster. (The Ascend Device Plugin works independently without the Volcano.)
      kubectl apply -f device-plugin-310P-*.yaml
    • Ascend 910 AI Processor nodes exist in the Kubernetes cluster (The Ascend Device Plugin works independently without the Volcano and HCCL-Controller).
      kubectl apply -f device-plugin-910-*.yaml
    • Ascend 310 AI Processor nodes exist in the Kubernetes cluster. (The Ascend Device Plugin works together with the Volcano.))
      kubectl apply -f device-plugin-310-volcano-*.yaml
    • Ascend 310P AI Processor nodes exist in the Kubernetes cluster. (The Ascend Device Plugin works together with the Volcano.)
      kubectl apply -f device-plugin-310P-volcano-*.yaml
    • Ascend 910 AI Processor nodes exist in the Kubernetes cluster. (The Ascend Device Plugin works together with the Volcano and HCCL-Controller.)
      kubectl apply -f device-plugin-volcano-*.yaml
    • Atlas 200I Soc A1 core board nodes exist in the Kubernetes cluster. (The Ascend Device Plugin works together with the Volcano).
      kubectl apply -f device-plugin-310P-1usoc-volcano-*.yaml
    • Atlas 200I Soc A1 core board nodes exist in the Kubernetes cluster. (The Ascend Device Plugin works independently without the Volcano.)
      kubectl apply -f device-plugin-310P-1usoc-*.yaml

    If the Kubernetes cluster uses multiple types of Ascend AI Processors, run the corresponding command for each type.

    The following is an example:

    root@ubuntu:/home/ascend-device-plugin# kubectl apply -f device-plugin-volcano-v3.0.0.yaml
    serviceaccount/ascend-device-plugin-sa created
    clusterrole.rbac.authorization.K8s.io/pods-node-ascend-device-plugin-role created
    clusterrolebinding.rbac.authorization.K8s.io/pods-node-ascend-device-plugin-rolebinding created
    daemonset.apps/ascend-device-plugin-daemonset created
    root@ubuntu:/home/ascend-device-plugin# kubectl get pod -n kube-system
    NAME                                       READY   STATUS    RESTARTS   AGE
    ...
    ascend-device-plugin-daemonset-d5ctz       1/1     Running   0          11s
    ...

Parameters

Table 2 Ascend Device Plugin startup parameters

Parameter

Type

Default Value

Description

-mode

string

None

Running mode of the Ascend Device Plugin. If this parameter is not specified, the running mode is automatically specified based on the NPU type. When the running mode is automatically specified, if any of the following modes is not found, the plugin fails to be started.

  • ascend310: running in Ascend 310 AI Processor mode
  • ascend310P: running in Ascend 310P AI Processor mode
  • ascend910: running in Ascend 910 AI Processor mode
NOTE:

In MindX 3.0.0, the running mode can only be automatically obtained and cannot be specified. If the running mode fails to be obtained, the plugin cannot be started. In versions later than MindX 3.0.0, this parameter is not provided for the Ascend Device Plugin. Pay attention to the software version when using this parameter.

-fdFlag

bool

false

Edge scenario flag, indicating whether to manage devices with FusionDirector.

-edgeLogFile

string

/var/alog/AtlasEdge_log/devicePlugin.log

Log file in the edge scenario. This parameter is valid only when fdFlag is set to true.

NOTE:

If the size of a log file exceeds 20 MB, automatic dump is triggered. The maximum size of a log file cannot be changed.

-useAscendDocker

bool

true

Indicates whether to use Ascend Docker Runtime. If CPU binding of Kubernetes has been enabled, set useAscendDocker to false no matter whether the Ascend Docker Runtime is used.

NOTE:

For details about how to install Ascend Docker Runtime, see Installing the Ascend Docker Runtime.

-volcanoType

bool

false

Whether to use the Volcano for scheduling. Currently, Ascend 910, Ascend 310P, and Ascend 310 AI Processors are supported.

-kubeConfig

string

/etc/mindx-dl/device-plugin/.config/config6

This parameter is valid only when volcanoType is set to true.

Defaults to the path for storing the encrypted KubeConfig file. The KubeConfig file in a user-defined path is also supported. If the configuration file does not exist in the default path, InClusterConfig is enabled.

NOTE:

This file must be encrypted using the certificate import tool. A plaintext file is not supported.

-presetVirtualDevice

bool

true

Whether to enable the static computing power allocation. Currently, the Ascend 910 and Ascend 310P AI Processors are supported, and the value can only be true.

-version

bool

false

Version of the Ascend Device Plugin.

-listWatchPeriod

int

5

Health check period. The value ranges from 3 to 60, in seconds.

-autoStowing

bool

true

Whether to automatically manage recovered devices. This parameter is valid only when volcanoType is set to true.

  • true: The recovered devices will be automatically managed.
  • false: The recovered devices will not be automatically managed.
NOTE:

If a device is faulty, it is automatically isolated from Kubernetes. If the device recovers, it is automatically added to the Kubernetes cluster resource pool by default. If the device is unstable, set this parameter to false. In this case, you need to manually manage it.

  • Run the following command to add the processors whose health status is restored from unhealthy to healthy to the resource pool:
    kubectl label nodes node_name huawei.com/Ascend910-Recover-
  • Run the following command to add the processors whose parameter plane network health status is restored from unhealthy to healthy to the resource pool:
    kubectl label nodes node_name huawei.com/Ascend910-NetworkRecover-

-logLevel

int

0

Log level.

  • -1: debug
  • 0: info
  • 1: warning
  • 2: error
  • 3: critical

-maxAge

int

7

Log backup time limit. The value ranges from 7 to 700, in days.

-logFile

string

/var/log/mindx-dl/devicePlugin/devicePlugin.log

Log file in non-edge scenarios. This parameter is valid only when fdFlag is set to false.

NOTE:

If the size of a log file exceeds 20 MB, automatic dump is triggered. The maximum size of a log file cannot be changed.

-maxBackups

int

30

Maximum number of dumped log files that can be retained. The value range is (0, 30].

-h

None

N/A

Help information.