Failed to Run an Inference Service As a Common User When Dynamic Virtualization Is Enabled

Symptom

Atlas inference product and Ascend Device Plugin are deployed on Volcano, and the dynamic virtualization function is enabled. After a job is delivered, a virtual device is successfully created, but the inference job fails to be executed.

Cause Analysis

If an inference job container is run by a common user, the following problems may occur. As a result, the common user fails to access the vNPU device in the root group, and the inference service container fails to run.

  • The vNPU created on the physical machine by calling the interface of the privileged container belongs to the root group. The vNPU is invisible in the /dev directory of the privileged container.
  • After a vNPU is created by the driver interface, its owner group is root by default. In the privileged container, the owner group of the newly created vNPU cannot be changed to a non-root group.

Solution

Mount /dev to the startup YAML file of Ascend Device Plugin. The following is an example:
command: [ "/bin/bash", "-c", "--"]
         args: [ "device-plugin  -useAscendDocker=true -volcanoType=true
                  -logFile=/var/log/mindx-dl/devicePlugin/devicePlugin.log -logLevel=0" ]
         securityContext:
           privileged: true
           readOnlyRootFilesystem: true
         imagePullPolicy: Never
         volumeMounts:
           - name: device-plugin
             mountPath: /var/lib/kubelet/device-plugins
...                                         # Several fields are omitted here.
           - name: tmp
             mountPath: /tmp
           - name: dev
            mountPath: /dev            # Mount /dev.
         env:
           - name: NODE_NAME
             valueFrom:
               fieldRef:
                 fieldPath: spec.nodeName
       volumes:
         - name: device-plugin
           hostPath:
             path: /var/lib/kubelet/device-plugins
...                                     # Several fields are omitted here.
         - name: tmp
           hostPath:
             path: /tmp
        - name: dev                   # Mount /dev.
          hostPath:
            path: /dev