Failed to Use npu-smi info After CPU Core Binding Is Configured for Kubernetes

Symptom

The Atlas 800 3000 server (ARM) is used, the OS is CentOS 7.6, and the Kubernetes (version 1.12) is used to schedule NPU-related services. The parameters --kube-reserved=cpu=2, memory=250Mi, --cpu-manager-policy=static, and --feature-gates=CPUManager=true of Kubernetes are configured. After core binding is enabled for Kubernetes, the following error message is displayed when the npu-smi info command is executed:

Causes

By default, pods created by the kubelet allocate CPU resources of physical machines based on CFS quotas. When the CPU manager is started, if no CPU manager policy is configured, it is returned directly. For the static CPU manager, a goroutine is started for reconcilement. In the reconcileState method, the configmap continuously obtains the CPU set of the container through the GetCPUSetOrDefault method and updates the CPU set of the container. The GetCPUSetOrDefault method obtains the CPU set based on the container ID. If the pod is a guaranteed pod, the CPU set of the container is returned. Otherwise, the default value is returned. All CPU cores except the CPU cores allocated to the guaranteed pod are counted in the default value. Therefore, all containers periodically and dynamically update the CPU set value during running.

The Ascend Docker Runtime mounts the .so file of the Ascend driver library and NPU device information to the container in file mounting mode, and adds the cgroup access permission. During the mounting, the information is not synchronized to the Docker engine. When the runc uses the device Set method during the update process, the device-related files are adjusted based on the content in the cgroup configuration of the container. During this process, the cgroup access permission added during mounting is lost.

Solution

The Ascend Device Plugin can independently mount /dev/davinciX, /dev/davinci_manager, and /dev/devmm_svm to containers. You can use Ascend Device Plugin mounting to overwrite the Ascend Docker Runtime mounting to ensure that the cgroup access permission is not lost. Set Ascend Device Plugin driver parameter -useAscendDocker=false and reinstall Ascend Device Plugin.

Parent topic: Ascend Device Plugin