"npu is busy, check again" Is Displayed in Pod Logs

Symptom

After the kubectl logs -n vcjob default-test-pytorch-master-0 -f command is executed to view pod logs, the error message "npu is busy, check again" is displayed.

Cause Analysis

  • Cause 1: NPUs are occupied by another container.
  • Cause 2: NPU devices are mounted to another container. Even if they are not used, an error is reported.

Solution

  • (Recommended) Solution 1: Stop the container with NPUs mounted.
  1. Query all running containers.
    docker ps
  2. Check whether NPUs have been mounted to the specified container.
    docker inspect Container ID | grep davinci
  3. If NPUs have been mounted to the specified container, stop the container.
    docker stop Container ID
  • Solution 2: Specify the privileged mode in the YAML file. This method is applicable to the scenario where other containers do not use the NPU but only mount them.

Add the following fields to the job YAML file. The following is an example.