Ascend Device Plugin

  • Ascend Device Plugin must be installed on the compute node when you need to use functions of full NPU scheduling, static vNPU scheduling, dynamic vNPU scheduling, resumable training, elastic training, recovery of inference card faults, or rescheduling upon inference card faults.
  • If you need only containerization and resource monitoring functions, you do not need to install Ascend Device Plugin. In this case, skip this section.

Restrictions

Before installing Ascend Device Plugin, you need to understand related restrictions. For details, see Table 1.

Table 1 Restrictions

Scenario

Restrictions

NPU driver

Ascend Device Plugin periodically calls NPU-related interfaces. To upgrade the driver, stop service tasks and then stop container services of Ascend Device Plugin.

Used together with Ascend Docker Runtime

The requirements for the component installation sequence are as follows:

When Ascend Device Plugin is running in containerized mode, it automatically identifies whether Ascend Docker Runtime is installed. Ascend Device Plugin can correctly identify the Ascend Docker Runtime installation status only after Ascend Docker Runtime is installed.

If Ascend Device Plugin is deployed on an Atlas 200I SoC A1 core board, you do not need to install Ascend Docker Runtime.

The component version requirements are as follows:

This function requires that versions of Ascend Docker Runtime and Ascend Device Plugin be the same and be 5.0.RC1 or later. After installing or uninstalling Ascend Docker Runtime, you need to restart the container engine to correctly identify Ascend Device Plugin.

Ascend Device Plugin and Ascend Docker Runtime cannot be used together in the following scenarios:
  • Mixed insertion
  • Atlas 200I SoC A1 core board

DCMI dynamic library

The permission requirements for the DCMI dynamic library directories are as follows:

The owner of the DCMI dynamic library and its parent directories invoked by Ascend Device Plugin must be root; otherwise, the program cannot run. In addition, group and other do not have the write permission on these files and directories.

The DCMI dynamic library path depth must be less than 20.

If the dynamic library path is set by setting LD_LIBRARY_PATH, the total length of LD_LIBRARY_PATH cannot exceed 1024.

Atlas 200I SoC A1 core board

To deploy Ascend Device Plugin on an Atlas 200I SoC A1 core board in containerized mode, you need to configure the multi-container sharing mode.

To use Ascend Device Plugin on an Atlas 200I SoC A1 core board, note the following version mapping:
  • Ascend Device Plugin 5.0.RC2 must be used with a 23.0.RC2 driver or later of an Atlas 200I SoC A1 core board.
  • Ascend Device Plugin of versions earlier than 5.0.RC2 can be used only with drivers earlier than 23.0.RC2 of an Atlas 200I SoC A1 core board.

VM scenario

To deploy Ascend Device Plugin on VMs, you need to install systemd in Ascend Device Plugin's image. You are advised to add the RUN apt-get update && apt-get install -y systemd command to Dockerfile to install systemd.

Restart scenario

After Ascend Device Plugin is installed, if the basic NPU information is modified, for example, the device IP address, you need to restart Ascend Device Plugin. Otherwise, Ascend Device Plugin cannot correctly identify the NPU information.

Procedure

  1. Log in to each compute node as the root user and check whether the image and version are correct.
    docker images | grep k8sdeviceplugin

    Command output:

    1
    ascend-k8sdeviceplugin               v7.3.0              29eec79eb693        About an hour ago   105MB
    
  2. Copy the YAML file in the directory where the Ascend Device Plugin package is decompressed to any directory on the Kubernetes management node. Note that you need to use the YAML file that adapts to the specific processor model. To prevent exceptions in the automatic identification of Ascend Docker Runtime, do not modify the DaemonSet.metadata.name field in the YAML file. For details, see the following table.
    Table 2 YAML files of Ascend Device Plugin

    YAML File

    Description

    device-plugin-310-v{version}.yaml

    Configuration file used when Volcano is not used on an inference server (equipped with Atlas 300I inference cards).

    device-plugin-310-volcano-v{version}.yaml

    Configuration file used when Volcano is used on an inference server (equipped with Atlas 300I inference cards).

    device-plugin-310P-1usoc-v{version}.yaml

    Configuration file used when Volcano is not used on Atlas 200I SoC A1 core boards

    device-plugin-310P-1usoc-volcano-v{version}.yaml

    Configuration file used when Volcano is used on Atlas 200I SoC A1 core boards

    device-plugin-310P-v{version}.yaml

    Configuration file used when Volcano is not used on Atlas inference product

    device-plugin-310P-volcano-v{version}.yaml

    Configuration file used when Volcano is used on Atlas inference product

    device-plugin-910-v{version}.yaml

    Configuration file used when Volcano is not used on Atlas training product, Atlas A2 training product, Atlas A3 training product, Atlas 800I A2 inference server, or A200I A2 Box heterogeneous component

    device-plugin-volcano-v{version}.yaml

    Configuration file used when Volcano is used on Atlas training product, Atlas A2 training product, Atlas A3 training product, Atlas 800I A2 inference server, or A200I A2 Box heterogeneous component

  3. Skip this step if you do not need to modify the component startup parameters. Otherwise, modify the startup parameters of Ascend Device Plugin based on your requirements. For details about the startup parameters, see Table 3. You can run the ./device-plugin -h command to view the parameter descriptions.
    • On the Atlas 200I SoC A1 core board, modify the Ascend Device Plugin startup parameters in the startup script run_for_310P_1usoc.sh. After the modification, create images on all Atlas 200I SoC A1 core board nodes, or create an image on a local node and distribute the image to other Atlas 200I SoC A1 core board nodes.

      If Volcano is not used as the scheduler, you need to modify the Ascend Device Plugin's startup parameter in the run_for_310P_1usoc.sh file when starting Ascend Device Plugin. That is, set -volcanoType to false.

    • For other types of nodes, modify the Ascend Device Plugin's startup parameters in the corresponding startup YAML file.
  4. (Optional) When resumable training (including process-level recovery) or elastic training is used, modify the startup YAML file of Ascend Device Plugin based on the fault handling mode.
    ...
          containers:
          - image: ascend-k8sdeviceplugin:v7.3.0
            name: device-plugin-01
            resources:
              requests:
                memory: 500Mi
                cpu: 500m
              limits:
                memory: 500Mi
                cpu: 500m
            command: [ "/bin/bash", "-c", "--"]
            args: [ "device-plugin  
                     -useAscendDocker=true 
                     -volcanoType=true                    # Volcano must be used in the rescheduling scenario.
                     -autoStowing=true                    # Whether to enable automatic management. The default value is true. If this parameter is set to false, automatic management is disabled. In this case, after the processor health status changes from unhealthy to healthy, or the network fault on the processor parameter plane is recovered, the processor will not be automatically added to the schedulable resource pool. This parameter applies only to Atlas training product.
                     -listWatchPeriod=5                   # Set the health status check period. The value range is [3, 1800], in seconds.
                     -hotReset=2 # When process-level recovery is used, set hotReset to 2 to enable offline recovery.
                     -logFile=/var/log/mindx-dl/devicePlugin/devicePlugin.log 
                     -logLevel=0" ]
            securityContext:
              privileged: true
              readOnlyRootFilesystem: true
    ...
  5. (Optional) Configure the hot reset function when recovery of inference card faults is enabled.
          containers:
          - image: ascend-k8sdeviceplugin:v7.3.0
            name: device-plugin-01
            resources:
              requests:
                memory: 500Mi
                cpu: 500m
              limits:
                memory: 500Mi
                cpu: 500m
            command: [ "/bin/bash", "-c", "--"]
            args: [ "device-plugin  
    ...
                     -hotReset=0 # Enable the hot reset function when recovery of inference card faults is enabled.
                     -logFile=/var/log/mindx-dl/devicePlugin/devicePlugin.log 
                     -logLevel=0" ]
    ...
  6. (Optional) If you need to change the default port of kubelet, modify the startup YAML file of Ascend Device Plugin. Example:
      env:
         - name: NODE_NAME
           valueFrom:
             fieldRef:
               fieldPath: spec.nodeName
         - name: HOST_IP
           valueFrom:
             fieldRef:
               fieldPath: status.hostIP
         - name: KUBELET_PORT   #  Notify Ascend Device Plugin of the default kubelet port number on the current node. If the default kubelet port number is not customized, this field does not need to be passed.
           value: "10251"      
    volumes:
       - name: device-plugin
         hostPath:
           path: /var/lib/kubelet/device-plugins
    ...
  7. Run the following command, respectively, in the corresponding YAML directory on the Kubernetes management node to start Ascend Device Plugin.
    • Nodes of Atlas training product, Atlas A2 training product, Atlas A3 training product, Atlas 800I A2 inference server, or A200I A2 Box heterogeneous component exist in a Kubernetes cluster. (Volcano is used together to support virtual instances. By default, static virtualization is enabled in YAML.)
      kubectl apply -f device-plugin-volcano-v{version}.yaml
    • Nodes of Atlas training product, Atlas A2 training product, Atlas A3 training product, Atlas 800I A2 inference server, or A200I A2 Box heterogeneous component exist in a Kubernetes cluster. (Ascend Device Plugin works independently without Volcano.)
      kubectl apply -f device-plugin-910-v{version}.yaml
    • Nodes of inference servers (equipped with Atlas 300I inference cards) exist in a Kubernetes cluster. (Volcano is used together.)
      kubectl apply -f device-plugin-310-volcano-v{version}.yaml
    • Nodes of inference servers (equipped with Atlas 300I inference cards) exist in a Kubernetes cluster. (Ascend Device Plugin works independently without Volcano)
      kubectl apply -f device-plugin-310-v{version}.yaml
    • Nodes of Atlas inference product exist in a Kubernetes cluster. (Volcano is used together to support virtual instances. By default, static virtualization is enabled in YAML.)
      kubectl apply -f device-plugin-310P-volcano-v{version}.yaml
    • Nodes of Atlas inference product exist in a Kubernetes cluster. (Ascend Device Plugin works independently without Volcano.)
      kubectl apply -f device-plugin-310P-v{version}.yaml
    • Nodes of Atlas 200I SoC A1 core boards exist in a Kubernetes cluster. (Volcano is used together.)
      kubectl apply -f device-plugin-310P-1usoc-volcano-v{version}.yaml
    • Nodes of Atlas 200I SoC A1 core boards exist in a Kubernetes cluster. (Ascend Device Plugin works independently without Volcano)
      kubectl apply -f device-plugin-310P-1usoc-v{version}.yaml

    If multiple types of Ascend AI processors are used in a Kubernetes cluster, run the corresponding command of each type.

    Startup example:

    serviceaccount/ascend-device-plugin-sa created
    clusterrole.rbac.authorization.K8s.io/pods-node-ascend-device-plugin-role created
    clusterrolebinding.rbac.authorization.K8s.io/pods-node-ascend-device-plugin-rolebinding created
    daemonset.apps/ascend-device-plugin-daemonset created
  8. Run the following command on the Kubernetes management node to check whether the component is started:
    kubectl get pod -n kube-system

    If Running is displayed in the command output, the component is started successfully.

    1
    2
    3
    4
    NAME                                        READY   STATUS    RESTARTS   AGE
    ...
    ascend-device-plugin-daemonset-d5ctz  1/1   Running   0        11s
    ...
    

Parameters

Table 3 Ascend Device Plugin startup parameters

Parameter

Type

Default Value

Description

-fdFlag

Bool

false

Edge scenario flag, indicating whether to manage devices with FusionDirector.

  • true: FusionDirector is used.
  • false: FusionDirector is not used.

-shareDevCount

UINT

1

Whether to enable the device sharing function. The value ranges from 1 to 100.

The default value is 1, indicating that device sharing is disabled. If the value is an integer ranging from 2 to 100, it indicates the number of shared devices virtualized by a single processor.

The following devices are supported. This parameter is invalid for other devices and does not affect the component startup.

  • Atlas 500 A2 edge station
  • Atlas 200I A2 accelerator module
  • Atlas 200I DK A2 developer kit
  • Atlas 300I Pro inference card
  • Atlas 300V video analysis card
  • Atlas 300V Pro video analysis card
NOTE:

Pay attention to the following points if you use aforesaid Atlas inference product:

  • The device sharing function cannot be used when features such as static vNPU scheduling, dynamic vNPU scheduling, recovery of inference card faults, and rescheduling upon inference card faults are used.
  • The number of requested resources for a single job must be 1. Multi-processor and across-processor scenarios are not supported.
  • Enable the sharing mode for the driver by setting device-share to true.

-edgeLogFile

String

/var/alog/AtlasEdge_log/devicePlugin.log

Log file in the edge scenario. This parameter is valid only when fdFlag is set to true.

NOTE:

If the size of a log file exceeds 20 MB, automatic dump is triggered. The maximum size of a log file cannot be changed.

-useAscendDocker

Bool

true

Whether the container engine uses Ascend Docker Runtime. The default value is true. To enable the CPU core binding function of Kubernetes, you need to uninstall Ascend Docker Runtime and restart the container engine. The options are as follows:

  • true: Ascend Docker Runtime is used.
  • false: Ascend Docker Runtime is not used.
NOTE:

MindCluster 5.0.RC1 and later versions support only automatic acquisition of the running mode.

-use310PMixedInsert

Bool

false

Whether to use the mixed-insertion mode.

  • true: Mixed-insertion mode is used.
  • false: Mixed-insertion mode is not used.
NOTE:
  • Only the Atlas 300I Pro inference card, Atlas 300V video analysis card, and Atlas 300V Pro video analysis card are allowed for mixed insertion on servers.
  • The Volcano scheduling mode is not supported by a server in mixed-insertion mode.
  • The virtual instance is not supported by a server in mixed-insertion mode.
  • Rescheduling upon faults is not supported by a server in mixed-insertion mode.
  • Ascend Docker Runtime is not supported by a server in mixed-insertion mode.
  • In non-mixed insertion mode, the resource name reported by Kubernetes remains unchanged.
    • In non-mixed insertion mode, the reported resource name is in the format of "huawei.com/Ascend310P".
    • In mixed-insertion mode, the reported resource name is in the format of "huawei.com/Ascend310P-V", "huawei.com/Ascend310P-VPro", or "huawei.com/Ascend310P-IPro".

-volcanoType

Bool

false

Whether to use Volcano for scheduling, which is supported by Atlas training product, Atlas A2 training product, Atlas inference product, and inference servers (equipped with Atlas 300I inference cards).

  • true: Volcano is used.
  • false: Volcano is not used.

-presetVirtualDevice

Bool

true

Virtualization function switch.
  • true: static virtualization; supported by Atlas training product and Atlas inference product.
  • false: dynamic virtualization Currently, only dynamic virtualization of Atlas inference productis supported. You need to enable Volcano at the same time, that is, set -volcanoType to true.

-version

Bool

false

Whether to query the Ascend Device Plugin version number.

  • true: queries the version.
  • false: does not query the version.

-listWatchPeriod

Integer

5

Health check period. The value range is [3, 1800], in seconds.

NOTE:

The following items are checked in each period, and the check results are written into the ConfigMap.

  • If the device information does not change and the ConfigMap has not been updated within 5 minutes of the last update, no further ConfigMap update occurs.
  • If the ConfigMap has been updated for more than 5 minutes, the ConfigMap is updated again regardless of whether the device information changes.

-autoStowing

Bool

true

Whether to automatically manage recovered devices. This parameter is valid only when volcanoType is set to true.

  • true: The recovered devices are automatically managed.
  • false: The recovered devices are not automatically managed.
NOTE:

If a device is faulty, it is automatically isolated from Kubernetes. If the device recovers, it is automatically added to the Kubernetes cluster resource pool by default. If the device is unstable, set this parameter to false. In this case, you need to manually manage it.

  • Run the following command to add the processors whose health status is restored from unhealthy to healthy to the resource pool:
    kubectl label nodes node_name huawei.com/Ascend910-Recover-
  • Run the following command to add the processors whose parameter plane network health status is restored from unhealthy to healthy to the resource pool:
    kubectl label nodes node_name huawei.com/Ascend910-NetworkRecover-

-logLevel

Integer

0

Log level:

  • -1: debug
  • 0: info
  • 1: warning
  • 2: error
  • 3: critical

-maxAge

Integer

7

Time limit for backing up logs. The value ranges from 7 to 700, in days.

-logFile

String

/var/log/mindx-dl/devicePlugin/devicePlugin.log

Log file in non-edge scenarios. This parameter is valid only when fdFlag is set to false.

NOTE:

If the size of a log file exceeds 20 MB, automatic dump is triggered. The maximum size of a log file cannot be changed. Dumped files are named in the format of "devicePlugin-dump triggering time.log", for example, devicePlugin-2023-10-07T03-38-24.402.log.

-hotReset

Integer

-1

Whether to enable device hot reset. After this function is enabled, if a processor is faulty, Ascend Device Plugin conducts a hot reset to restore it.
  • -1: disables processor reset.
  • 0: resets inference devices.
  • 1: resets training devices online.
  • 2: resets training/inference devices offline.
NOTE:

The value 1 cannot be used because the function has become unavailable. Set this parameter to other values.

This parameter supports the following training devices:
  • Atlas 800 training server (model 9000) (fully populated with NPUs)
  • Atlas 800 training server (model 9010) (fully populated with NPUs)
  • Atlas 900T PoD Lite
  • Atlas 900 PoD (model 9000)
  • Atlas 800T A2 training server
  • Atlas 900 A2 PoD cluster basic unit
  • Atlas 900 A3 SuperPoD
  • Atlas 800T A3 SuperPoD Server
This parameter supports the following inference devices:
  • Atlas 300I Pro inference card
  • Atlas 300V video analysis card
  • Atlas 300V Pro video analysis card
  • Atlas 300I Duo inference card

  • Atlas 300I inference card (model 3000) (entire card)
  • Atlas 300I inference card (model 3010)
  • Atlas 800I A2 inference server
  • A200I A2 Box heterogeneous component
  • Atlas 800I A3 SuperPoD Server
NOTE:
  • For the Atlas 300I Duo inference card, only card-based reset is supported. That is, the card's two processors are reset at the same time.
  • There are two hot reset modes for the Atlas 800I A2 inference server, and only one hot reset mode can be used on a single Atlas 800I A2 inference server. The cluster scheduling components automatically identify the hot reset mode to be used.
    • Mode 1: If no HCCS ring exists on the server, when an NPU is faulty during inference, Ascend Device Plugin waits until the NPU is idle and resets it.
    • Mode 2: If an HCCS ring exists on the server, when one or more NPUs are faulty during inference, Ascend Device Plugin waits until all NPUs on the ring are idle and resets them at a time.

-linkdownTimeout

Integer

30

Network linkdown timeout interval, the value ranges from 1 to 30, in seconds.

NOTE:

You are advised to set this parameter to the value of HCCL_RDMA_TIMEOUT configured in the training script. For multiple tasks, you are advised to set this parameter to the minimum value of HCCL_RDMA_TIMEOUT in the multi-task scenario.

-enableSlowNode

Bool

false

Whether to enable slow node detection (deterioration diagnosis).

  • true: enabled.
  • false: disabled.
    NOTE:

    For details about degradation diagnosis, see "Deterioration Diagnosis" in iMaster CCAE Product Documentation.

-dealWatchHandler

Bool

false

Whether to update local pod informer cache when the informer link ends due to an exception.

  • true: updates pod informer cache.
  • false: does not update pod informer cache.

-checkCachedPods

Bool

true

Whether to periodically check pods in the cache. The default value is true. If the pod in the cache is not updated for more than one hour, Ascend Device Plugin checks the api-server to view the pod status.

  • true: The check is performed.
  • false: The check is not performed.

-maxBackups

Integer

30

Maximum number of dumped log files that can be retained. The value ranges from 1 to 30.

-thirdPartyScanDelay

Integer

300

Scanning delay after Ascend Device Plugin is started.

After Ascend Device Plugin fails to automatically reset a processor, it writes the failure information to the node annotation. The third-party platform can reset the processor based on this information. Then, Ascend Device Plugin waits for a period of time specified by this parameter to scan devices again.

This parameter is supported only by Atlas 800T A3 SuperPoD Server.

The unit is second.

-deviceResetTimeout

Integer

60

Maximum wait time for the driver to report complete processor information if the detected processor count is insufficient at component startup. The value ranges from 10 to 600, in seconds.

  • For the Atlas A2 training product, Atlas 800I A2 inference server, and A200I A2 Box heterogeneous component, the recommended value is 150 seconds.
  • For the Atlas A3 training product, A200T A3 Box8 SuperPoD Server, and Atlas 800I A3 SuperPoD Server, the recommended value is 360 seconds.

-h or -help

None

None

Help information.