Pod Status Is Inconsistent After a Job Is Rescheduled in the MindX DL Scenario

Symptom

In a multi-node cluster environment, a distributed training job is delivered, resources are sufficient, job rescheduling is enabled, and fault-scheduling is set to grace. In this case, if a fault occurs and rescheduling is triggered, there is a possibility that a pod is in the Running state and a pod is in the Pending state after job rescheduling. Even if the fault is rectified, the state of the pod in Pending state remains unchanged.

Causes

This problem is caused by the incorrect determination of the number of jobs in the open source code.
1. When a pod is terminated, if the script in the container does not return a non-zero value (that is, 0 is returned), the pod status is Success (Kubernetes mechanism). At this time, the pod is destroying resources. The Volcano scheduler considers that the pod (not restarted) has been restarted and its status is Success. In this case, the Volcano gang scheduling plugin mechanism is triggered. The pod that is restarted first enters the creation phase. When the pod termination ends and the pod is restarted, the pod enters the Pending state due to insufficient resources.
The resumable training script is not used.
Cluster resources are insufficient.

Solution

Manually delete the running pod (command: kubectl delete pod -n namespace where the pod is located pod name).
Delete the job and deliver it again.

Parent topic: Troubleshooting