Failed to Run an Inference Service As a Common User When Dynamic Virtualization Is Enabled
Symptom
Atlas inference product and Ascend Device Plugin are deployed on Volcano, and the dynamic virtualization function is enabled. After a job is delivered, a virtual device is successfully created, but the inference job fails to be executed.
Cause Analysis
If an inference job container is run by a common user, the following problems may occur. As a result, the common user fails to access the vNPU device in the root group, and the inference service container fails to run.
- The vNPU created on the physical machine by calling the interface of the privileged container belongs to the root group. The vNPU is invisible in the /dev directory of the privileged container.
- After a vNPU is created by the driver interface, its owner group is root by default. In the privileged container, the owner group of the newly created vNPU cannot be changed to a non-root group.
Solution
Mount /dev to the startup YAML file of Ascend Device Plugin. The following is an example:
command: [ "/bin/bash", "-c", "--"]
args: [ "device-plugin -useAscendDocker=true -volcanoType=true
-logFile=/var/log/mindx-dl/devicePlugin/devicePlugin.log -logLevel=0" ]
securityContext:
privileged: true
readOnlyRootFilesystem: true
imagePullPolicy: Never
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
... # Several fields are omitted here.
- name: tmp
mountPath: /tmp
- name: dev
mountPath: /dev # Mount /dev.
env:
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
... # Several fields are omitted here.
- name: tmp
hostPath:
path: /tmp
- name: dev # Mount /dev.
hostPath:
path: /dev
Parent topic: Faults During Use