Ascend Device Plugin
- Ascend Device Plugin must be installed on the compute node when you need to use functions of full NPU scheduling, static vNPU scheduling, dynamic vNPU scheduling, resumable training, elastic training, recovery of inference card faults, or rescheduling upon inference card faults.
- If you need only containerization and resource monitoring functions, you do not need to install Ascend Device Plugin. In this case, skip this section.
Restrictions
Before installing Ascend Device Plugin, you need to understand related restrictions. For details, see Table 1.
Scenario |
Restrictions |
|---|---|
NPU driver |
Ascend Device Plugin periodically calls NPU-related interfaces. To upgrade the driver, stop service tasks and then stop container services of Ascend Device Plugin. |
Used together with Ascend Docker Runtime |
The requirements for the component installation sequence are as follows: When Ascend Device Plugin is running in containerized mode, it automatically identifies whether Ascend Docker Runtime is installed. Ascend Device Plugin can correctly identify the Ascend Docker Runtime installation status only after Ascend Docker Runtime is installed. If Ascend Device Plugin is deployed on an Atlas 200I SoC A1 core board, you do not need to install Ascend Docker Runtime. |
The component version requirements are as follows: This function requires that versions of Ascend Docker Runtime and Ascend Device Plugin be the same and be 5.0.RC1 or later. After installing or uninstalling Ascend Docker Runtime, you need to restart the container engine to correctly identify Ascend Device Plugin. |
|
Ascend Device Plugin and Ascend Docker Runtime cannot be used together in the following scenarios:
|
|
DCMI dynamic library |
The permission requirements for the DCMI dynamic library directories are as follows: The owner of the DCMI dynamic library and its parent directories invoked by Ascend Device Plugin must be root; otherwise, the program cannot run. In addition, group and other do not have the write permission on these files and directories. |
The DCMI dynamic library path depth must be less than 20. |
|
If the dynamic library path is set by setting LD_LIBRARY_PATH, the total length of LD_LIBRARY_PATH cannot exceed 1024. |
|
Atlas 200I SoC A1 core board |
To deploy Ascend Device Plugin on an Atlas 200I SoC A1 core board in containerized mode, you need to configure the multi-container sharing mode. |
To use Ascend Device Plugin on an Atlas 200I SoC A1 core board, note the following version mapping:
|
|
VM scenario |
To deploy Ascend Device Plugin on VMs, you need to install systemd in Ascend Device Plugin's image. You are advised to add the RUN apt-get update && apt-get install -y systemd command to Dockerfile to install systemd. |
Restart scenario |
After Ascend Device Plugin is installed, if the basic NPU information is modified, for example, the device IP address, you need to restart Ascend Device Plugin. Otherwise, Ascend Device Plugin cannot correctly identify the NPU information. |
Procedure
- Log in to each compute node as the root user and check whether the image and version are correct.
docker images | grep k8sdeviceplugin
Command output:
1ascend-k8sdeviceplugin v7.3.0 29eec79eb693 About an hour ago 105MB
- If correct, go to Step 2.
- If not correct, create the image and distribute it by referring to Preparing an Image.
- Copy the YAML file in the directory where the Ascend Device Plugin package is decompressed to any directory on the Kubernetes management node. Note that you need to use the YAML file that adapts to the specific processor model. To prevent exceptions in the automatic identification of Ascend Docker Runtime, do not modify the DaemonSet.metadata.name field in the YAML file. For details, see the following table.
Table 2 YAML files of Ascend Device Plugin YAML File
Description
device-plugin-310-v{version}.yaml
Configuration file used when Volcano is not used on an inference server (equipped with Atlas 300I inference cards).
device-plugin-310-volcano-v{version}.yaml
Configuration file used when Volcano is used on an inference server (equipped with Atlas 300I inference cards).
device-plugin-310P-1usoc-v{version}.yaml
Configuration file used when Volcano is not used on Atlas 200I SoC A1 core boards
device-plugin-310P-1usoc-volcano-v{version}.yaml
Configuration file used when Volcano is used on Atlas 200I SoC A1 core boards
device-plugin-310P-v{version}.yaml
Configuration file used when Volcano is not used on Atlas inference product
device-plugin-310P-volcano-v{version}.yaml
Configuration file used when Volcano is used on Atlas inference product
device-plugin-910-v{version}.yaml
Configuration file used when Volcano is not used on Atlas training product,
Atlas A2 training product ,Atlas A3 training product , Atlas 800I A2 inference server, or A200I A2 Box heterogeneous componentdevice-plugin-volcano-v{version}.yaml
Configuration file used when Volcano is used on Atlas training product,
Atlas A2 training product ,Atlas A3 training product , Atlas 800I A2 inference server, or A200I A2 Box heterogeneous component - Skip this step if you do not need to modify the component startup parameters. Otherwise, modify the startup parameters of Ascend Device Plugin based on your requirements. For details about the startup parameters, see Table 3. You can run the ./device-plugin -h command to view the parameter descriptions.
- On the Atlas 200I SoC A1 core board, modify the Ascend Device Plugin startup parameters in the startup script run_for_310P_1usoc.sh. After the modification, create images on all Atlas 200I SoC A1 core board nodes, or create an image on a local node and distribute the image to other Atlas 200I SoC A1 core board nodes.
If Volcano is not used as the scheduler, you need to modify the Ascend Device Plugin's startup parameter in the run_for_310P_1usoc.sh file when starting Ascend Device Plugin. That is, set -volcanoType to false.
- For other types of nodes, modify the Ascend Device Plugin's startup parameters in the corresponding startup YAML file.
- On the Atlas 200I SoC A1 core board, modify the Ascend Device Plugin startup parameters in the startup script run_for_310P_1usoc.sh. After the modification, create images on all Atlas 200I SoC A1 core board nodes, or create an image on a local node and distribute the image to other Atlas 200I SoC A1 core board nodes.
- (Optional) When resumable training (including process-level recovery) or elastic training is used, modify the startup YAML file of Ascend Device Plugin based on the fault handling mode.
... containers: - image: ascend-k8sdeviceplugin:v7.3.0 name: device-plugin-01 resources: requests: memory: 500Mi cpu: 500m limits: memory: 500Mi cpu: 500m command: [ "/bin/bash", "-c", "--"] args: [ "device-plugin -useAscendDocker=true -volcanoType=true # Volcano must be used in the rescheduling scenario. -autoStowing=true # Whether to enable automatic management. The default value is true. If this parameter is set to false, automatic management is disabled. In this case, after the processor health status changes from unhealthy to healthy, or the network fault on the processor parameter plane is recovered, the processor will not be automatically added to the schedulable resource pool. This parameter applies only to Atlas training product. -listWatchPeriod=5 # Set the health status check period. The value range is [3, 1800], in seconds. -hotReset=2 # When process-level recovery is used, set hotReset to 2 to enable offline recovery. -logFile=/var/log/mindx-dl/devicePlugin/devicePlugin.log -logLevel=0" ] securityContext: privileged: true readOnlyRootFilesystem: true ... - (Optional) Configure the hot reset function when recovery of inference card faults is enabled.
containers: - image: ascend-k8sdeviceplugin:v7.3.0 name: device-plugin-01 resources: requests: memory: 500Mi cpu: 500m limits: memory: 500Mi cpu: 500m command: [ "/bin/bash", "-c", "--"] args: [ "device-plugin ... -hotReset=0 # Enable the hot reset function when recovery of inference card faults is enabled. -logFile=/var/log/mindx-dl/devicePlugin/devicePlugin.log -logLevel=0" ] ... - (Optional) If you need to change the default port of kubelet, modify the startup YAML file of Ascend Device Plugin. Example:
env: - name: NODE_NAME valueFrom: fieldRef: fieldPath: spec.nodeName - name: HOST_IP valueFrom: fieldRef: fieldPath: status.hostIP - name: KUBELET_PORT # Notify Ascend Device Plugin of the default kubelet port number on the current node. If the default kubelet port number is not customized, this field does not need to be passed. value: "10251" volumes: - name: device-plugin hostPath: path: /var/lib/kubelet/device-plugins ... - Run the following command, respectively, in the corresponding YAML directory on the Kubernetes management node to start Ascend Device Plugin.
- Nodes of Atlas training product,
Atlas A2 training product ,Atlas A3 training product , Atlas 800I A2 inference server, or A200I A2 Box heterogeneous component exist in a Kubernetes cluster. (Volcano is used together to support virtual instances. By default, static virtualization is enabled in YAML.)kubectl apply -f device-plugin-volcano-v{version}.yaml - Nodes of Atlas training product,
Atlas A2 training product ,Atlas A3 training product , Atlas 800I A2 inference server, or A200I A2 Box heterogeneous component exist in a Kubernetes cluster. (Ascend Device Plugin works independently without Volcano.)kubectl apply -f device-plugin-910-v{version}.yaml - Nodes of inference servers (equipped with Atlas 300I inference cards) exist in a Kubernetes cluster. (Volcano is used together.)
kubectl apply -f device-plugin-310-volcano-v{version}.yaml - Nodes of inference servers (equipped with Atlas 300I inference cards) exist in a Kubernetes cluster. (Ascend Device Plugin works independently without Volcano)
kubectl apply -f device-plugin-310-v{version}.yaml - Nodes of Atlas inference product exist in a Kubernetes cluster. (Volcano is used together to support virtual instances. By default, static virtualization is enabled in YAML.)
kubectl apply -f device-plugin-310P-volcano-v{version}.yaml - Nodes of Atlas inference product exist in a Kubernetes cluster. (Ascend Device Plugin works independently without Volcano.)
kubectl apply -f device-plugin-310P-v{version}.yaml - Nodes of Atlas 200I SoC A1 core boards exist in a Kubernetes cluster. (Volcano is used together.)
kubectl apply -f device-plugin-310P-1usoc-volcano-v{version}.yaml - Nodes of Atlas 200I SoC A1 core boards exist in a Kubernetes cluster. (Ascend Device Plugin works independently without Volcano)
kubectl apply -f device-plugin-310P-1usoc-v{version}.yaml
If multiple types of Ascend AI processors are used in a Kubernetes cluster, run the corresponding command of each type.
Startup example:
serviceaccount/ascend-device-plugin-sa created clusterrole.rbac.authorization.K8s.io/pods-node-ascend-device-plugin-role created clusterrolebinding.rbac.authorization.K8s.io/pods-node-ascend-device-plugin-rolebinding created daemonset.apps/ascend-device-plugin-daemonset created
- Nodes of Atlas training product,
- Run the following command on the Kubernetes management node to check whether the component is started:
kubectl get pod -n kube-system
If Running is displayed in the command output, the component is started successfully.
1 2 3 4
NAME READY STATUS RESTARTS AGE ... ascend-device-plugin-daemonset-d5ctz 1/1 Running 0 11s ...
- After the component is installed, if the pod status of the component is not Running, refer to Component pods Are Not in the Running State.
- After the component is installed, if the pod status of the component is ContainerCreating, refer to Cluster Scheduling Component Pods Are in the ContainerCreating State.
- If the component fails to be started, refer to Cluster Scheduling Components Fail to Start and "get sem errno =13" Is Displayed in Logs.
- If the component is started successfully, but the corresponding pod cannot be found, refer to YAML File for Starting a Component Is Successfully Executed, But the pod Corresponding to the Component Is Not Displayed.
Parameters
Parameter |
Type |
Default Value |
Description |
|---|---|---|---|
-fdFlag |
Bool |
false |
Edge scenario flag, indicating whether to manage devices with FusionDirector.
|
-shareDevCount |
UINT |
1 |
Whether to enable the device sharing function. The value ranges from 1 to 100. The default value is 1, indicating that device sharing is disabled. If the value is an integer ranging from 2 to 100, it indicates the number of shared devices virtualized by a single processor. The following devices are supported. This parameter is invalid for other devices and does not affect the component startup.
NOTE:
Pay attention to the following points if you use aforesaid Atlas inference product:
|
-edgeLogFile |
String |
/var/alog/AtlasEdge_log/devicePlugin.log |
Log file in the edge scenario. This parameter is valid only when fdFlag is set to true. NOTE:
If the size of a log file exceeds 20 MB, automatic dump is triggered. The maximum size of a log file cannot be changed. |
-useAscendDocker |
Bool |
true |
Whether the container engine uses Ascend Docker Runtime. The default value is true. To enable the CPU core binding function of Kubernetes, you need to uninstall Ascend Docker Runtime and restart the container engine. The options are as follows:
NOTE:
MindCluster 5.0.RC1 and later versions support only automatic acquisition of the running mode. |
-use310PMixedInsert |
Bool |
false |
Whether to use the mixed-insertion mode.
NOTE:
|
-volcanoType |
Bool |
false |
Whether to use Volcano for scheduling, which is supported by Atlas training product,
|
-presetVirtualDevice |
Bool |
true |
Virtualization function switch.
|
-version |
Bool |
false |
Whether to query the Ascend Device Plugin version number.
|
-listWatchPeriod |
Integer |
5 |
Health check period. The value range is [3, 1800], in seconds. NOTE:
The following items are checked in each period, and the check results are written into the ConfigMap.
|
-autoStowing |
Bool |
true |
Whether to automatically manage recovered devices. This parameter is valid only when volcanoType is set to true.
NOTE:
If a device is faulty, it is automatically isolated from Kubernetes. If the device recovers, it is automatically added to the Kubernetes cluster resource pool by default. If the device is unstable, set this parameter to false. In this case, you need to manually manage it.
|
-logLevel |
Integer |
0 |
Log level:
|
-maxAge |
Integer |
7 |
Time limit for backing up logs. The value ranges from 7 to 700, in days. |
-logFile |
String |
/var/log/mindx-dl/devicePlugin/devicePlugin.log |
Log file in non-edge scenarios. This parameter is valid only when fdFlag is set to false. NOTE:
If the size of a log file exceeds 20 MB, automatic dump is triggered. The maximum size of a log file cannot be changed. Dumped files are named in the format of "devicePlugin-dump triggering time.log", for example, devicePlugin-2023-10-07T03-38-24.402.log. |
-hotReset |
Integer |
-1 |
Whether to enable device hot reset. After this function is enabled, if a processor is faulty, Ascend Device Plugin conducts a hot reset to restore it.
NOTE:
The value 1 cannot be used because the function has become unavailable. Set this parameter to other values. This parameter supports the following training devices:
This parameter supports the following inference devices:
NOTE:
|
-linkdownTimeout |
Integer |
30 |
Network linkdown timeout interval, the value ranges from 1 to 30, in seconds. NOTE:
You are advised to set this parameter to the value of HCCL_RDMA_TIMEOUT configured in the training script. For multiple tasks, you are advised to set this parameter to the minimum value of HCCL_RDMA_TIMEOUT in the multi-task scenario. |
-enableSlowNode |
Bool |
false |
Whether to enable slow node detection (deterioration diagnosis).
|
-dealWatchHandler |
Bool |
false |
Whether to update local pod informer cache when the informer link ends due to an exception.
|
-checkCachedPods |
Bool |
true |
Whether to periodically check pods in the cache. The default value is true. If the pod in the cache is not updated for more than one hour, Ascend Device Plugin checks the api-server to view the pod status.
|
-maxBackups |
Integer |
30 |
Maximum number of dumped log files that can be retained. The value ranges from 1 to 30. |
-thirdPartyScanDelay |
Integer |
300 |
Scanning delay after Ascend Device Plugin is started. After Ascend Device Plugin fails to automatically reset a processor, it writes the failure information to the node annotation. The third-party platform can reset the processor based on this information. Then, Ascend Device Plugin waits for a period of time specified by this parameter to scan devices again. This parameter is supported only by Atlas 800T A3 SuperPoD Server. The unit is second. |
-deviceResetTimeout |
Integer |
60 |
Maximum wait time for the driver to report complete processor information if the detected processor count is insufficient at component startup. The value ranges from 10 to 600, in seconds.
|
-h or -help |
None |
None |
Help information. |