Rescheduling Fails Because Different Pods of the Same Job Are Configured with Different nodeSelectors

Symptom

Different pods of the same job are configured with different nodeSelectors. For example, the nodeSelector of the master pod is as follows:

The nodeSelector of the worker pod is as follows:

The labels of the nodes in the cluster are as follows. During the first scheduling, the master pod is scheduled to the node-69-77 node, and the worker pod is scheduled to the worker-69-87 node.

In this case, the worker-69-87 node is faulty, and pod-level rescheduling is performed.

However, the scheduling cannot be completed due to insufficient resources. Then, job-level rescheduling is performed. After the fault is rectified, the resources are sufficient, but the task pod still cannot be scheduled.

Cause Analysis

During rescheduling, the pod that is created first is scheduled first based on the Volcano scheduling logic. In the current scenario, the worker pod is deleted and created first, and is scheduled first. In this case, both nodes meet the requirements. In addition, in the rescheduling logic, the score of the node where a fault has occurred is reduced during node selection. Therefore, the worker pod is scheduled to the worker-69-87 node where the original master pod is running. The master pod contains nodeSelector "masterselector: dls-master-node". Therefore, the worker-69-87 node cannot be selected. According to the consistency scheduling principle of Volcano, the entire task cannot be scheduled.

Solution

  1. In the logical SuperPoD affinity scheduling scenario, the scheduling parameters of all pods (mainly nodeSelector) in the same task must be the same.
  2. In the non-logical SuperPoD affinity scheduling scenario, you are advised to keep the scheduling parameters consistent. If different nodeSelectors are required for pods in the actual scenario, the exclusive effect must be achieved. That is, different pods can be scheduled only to the corresponding node resource pools. For example, in the preceding scenario, you can add a nodeSelector to the worker pod to ensure that the worker pod cannot be scheduled to the node required by the master pod.