No Training Job Is Executed on the Worker Node and New Training Jobs Cannot Be Delivered

Symptom

No training job is executed in the current environment, and new training pods cannot be created. The pods remain in the Pending state.

Cause Analysis

In the previous use of the environment, the corresponding container was not cleared after a training job was deleted. Consequently, the training job is suspended and cannot be started again.

Solution

  1. View the logs of Ascend Device Plugin. Example:
    kubectl logs -f -n kube-system ascend-device-plugin-daemonset-910-njj49
  2. Search for the key log containerd used chips.
  3. Determine the container ID that occupies cards based on key in the map log and check whether the container exists in Docker and Containerd.
    • Docker
      The following uses container 2d758ae3968b as an example to describe how to delete a container from Docker.
      1. Check whether the container exists on the node.
        docker ps
      2. Stop the container if it is running.
        docker stop 2d758ae3968b
      3. Run the docker ps command to check whether the container is stopped.
      4. Check whether there is any residual container.
        docker ps -a | grep 2d758ae3968b
      5. (Optional) Delete the residual container if any.
        docker rm 2d758ae3968b
      6. (Optional) Check whether the container is deleted.
        docker ps -a | grep 2d758ae3968b
    • Containerd
      The following uses the container test-containerd-1 as an example to describe how to delete a container from Containerd.
      1. Check whether the container exists on the node.
        ctr tasks list|grep test-containerd-1
      2. Stop the container if it is running.
        ctr tasks kill test-containerd-1
      3. Check whether there is any residual container.
        ctr containers list|grep test-containerd-1
      4. (Optional) Delete the residual container if any.
        ctr containers delete test-containerd-1
      5. (Optional) Check whether the residual container is deleted.
        ctr tasks list |grep test-containerd-1
  4. Retrieve the key log containerd used chips of Ascend Device Plugin again. The printed information is empty, indicating that the residual container that occupies cards has been deleted.
  5. Run the kubectl get pod -A command to check the job pod status to confirm that the job is running properly.