Failed to Use npu-smi info After CPU Core Binding Is Configured for Kubernetes

Symptom

On the Atlas 800 inference server (model 3000) (ARM) running CentOS 7.6, Kubernetes (version 1.12) is used to schedule NPU-related services. The parameters --kube-reserved=cpu=2, memory=250Mi, --cpu-manager-policy=static, and --feature-gates=CPUManager=true of Kubernetes are configured. After core binding is enabled for Kubernetes, the following error message is displayed when the npu-smi info command is executed.

Cause Analysis

By default, pods created by kubelet use CPU resources of physical machines based on CFS quotas. When the CPU manager is started, if none cpu manager policy is displayed, it is returned directly. For the static CPU manager, a goroutine is started for reconcilement.

In the reconcileState method, the ConfigMap continuously obtains the CPU set of the container through the GetCPUSetOrDefault method and updates the CPU set of the container.
The GetCPUSetOrDefault method obtains the CPU set based on the container ID. If the pod is a guaranteed pod, the CPU set of the container is returned. Otherwise, the default value is returned. All CPU cores except the CPU cores allocated to the guaranteed pod are counted in the default value. Therefore, all containers periodically and dynamically update the CPU set value during running.

The Ascend Docker Runtime mounts the .so file of the Ascend driver library and NPU device information to the container in file mounting mode, and adds the cgroup access permission. During the mounting, the information is not synchronized to the Docker Engine. When the runC uses the device Set method during the update process, the device-related files are adjusted based on the content in the cgroup configuration of the container. During this process, the cgroup access permission added during mounting is lost.

Solution

The Ascend Device Plugin can independently mount /dev/davinciX, /dev/davinci_manager, and /dev/devmm_svm to containers. You can use Ascend Device Plugin mounting to overwrite Ascend Docker Runtime mounting to ensure that the cgroup access permission is not lost. You can modify the driver parameter -useAscendDocker of the Ascend Device Plugin to false and reinstall the Ascend Device Plugin.

Parent topic: Faults During Use