Pod Status Is Inconsistent After a Job Is Rescheduled

Symptom

In a multi-node cluster environment, a distributed training job is delivered, resources are sufficient, job rescheduling is enabled, and fault-scheduling is set to grace. In this case, if a fault occurs and rescheduling is triggered, there is a possibility that a pod is in the Running state and a pod is in the Pending state after job rescheduling. Even if the fault is rectified, the state of the pod in Pending remains unchanged.

Cause Analysis

  1. This problem is caused by the incorrect determination of the number of jobs in the open source code.

    When the pod is terminated, if the script in the container does not return a non-zero value (that is, 0 is returned), the pod status is Success due to the Kubernetes mechanism. At this time, the pod is destroying resources, and Volcano considers that the pod (not restarted) has been restarted and its status is Success. In this case, the Gang scheduling is triggered, and the Volcano that is restarted first enters the creation phase. When another pod restarts after termination, the pod enters the Pending state due to insufficient resources.

  2. The resumable training script is not used.
  3. Cluster resources are insufficient.

Solution

  • Run the following command to manually delete the running pod:
    kubectl delete pod -n pod_namespace pod_name
  • Delete the job and deliver it again.