When Volcano and Ascend Operator Are Used, Status of All Pods of a Faulty Job on the Service Plane Becomes Failed and the Job Cannot Trigger Unconditional Retry-Based Rescheduling

Symptom

When the Volcano and Ascend Operator components are used, the status of all Pods of a faulty job on the service plane becomes Failed, and the job cannot be rescheduled through unconditional retries.

Cause Analysis

If the status of all pods of a faulty job on the service plane becomes Failed, Volcano sets the job status to Failed. By default, the job will not trigger unconditional retry-based rescheduling.

Solution

You can modify the Volcano source code and job YAML file to enable unconditional retry-based rescheduling even when the status of all pods becomes Failed.

  1. Modify the source code pkg/controllers/job/state/running.go of the open-source Volcano and add IgnoreAction.
    func (ps *runningState) Execute(action v1alpha1.Action) error {
         switch action {
         case v1alpha1.RestartJobAction:
              return KillJob(ps.job, PodRetainPhaseNone, func(status *vcbatch.JobStatus) bool{
                  status.State.Phase = vcbatch.Restarting
                  status.RetryCount++
                  return true
              })
         case v1alpha1.AbortJobAction:
              return KillJob(ps.job, PodRetainPhasesoft, func(status *vcbatch.JobStatus) bool {
                  status.State.Phase = vcbatch.Aborting
                  return true
              })
         case v1alpha1.TerminateJobAction:
              return KillJob(ps.job, PodRetainPhasesoft, func(status *vcbatch.JobStatus) bool {
                  status.State.Phase = vcbatch.Terminating
                  return true
              })
         case v1alpha1.CompleteJobAction:
              return KillJob(ps.job, PodRetainPhaseSoft, func(status *vcbatch.JobStatus) bool {
                  status.State.Phase = vcbatch.Completing
                  return true
              })
         case v1alpha1.IgnoreAction:        // Add the case v1alpha1.IgnoreAction code.
              return nil
    default:
  2. Modify the source code vendor/volcano.sh/apis/pkg/apis/bus/v1alpha1/actions.go of the open-source Volcano and add IgnoreAction.
    IgnoreAction Action = "Ignore"