When Volcano and Ascend Operator Are Used, Status of All Pods of a Faulty Job on the Service Plane Becomes Failed and the Job Cannot Trigger Unconditional Retry-Based Rescheduling
Symptom
When the Volcano and Ascend Operator components are used, the status of all Pods of a faulty job on the service plane becomes Failed, and the job cannot be rescheduled through unconditional retries.
Cause Analysis
If the status of all pods of a faulty job on the service plane becomes Failed, Volcano sets the job status to Failed. By default, the job will not trigger unconditional retry-based rescheduling.
Solution
You can modify the Volcano source code and job YAML file to enable unconditional retry-based rescheduling even when the status of all pods becomes Failed.
- Modify the source code pkg/controllers/job/state/running.go of the open-source Volcano and add IgnoreAction.
func (ps *runningState) Execute(action v1alpha1.Action) error { switch action { case v1alpha1.RestartJobAction: return KillJob(ps.job, PodRetainPhaseNone, func(status *vcbatch.JobStatus) bool{ status.State.Phase = vcbatch.Restarting status.RetryCount++ return true }) case v1alpha1.AbortJobAction: return KillJob(ps.job, PodRetainPhasesoft, func(status *vcbatch.JobStatus) bool { status.State.Phase = vcbatch.Aborting return true }) case v1alpha1.TerminateJobAction: return KillJob(ps.job, PodRetainPhasesoft, func(status *vcbatch.JobStatus) bool { status.State.Phase = vcbatch.Terminating return true }) case v1alpha1.CompleteJobAction: return KillJob(ps.job, PodRetainPhaseSoft, func(status *vcbatch.JobStatus) bool { status.State.Phase = vcbatch.Completing return true }) case v1alpha1.IgnoreAction: // Add the case v1alpha1.IgnoreAction code. return nil default: - Modify the source code vendor/volcano.sh/apis/pkg/apis/bus/v1alpha1/actions.go of the open-source Volcano and add IgnoreAction.
IgnoreAction Action = "Ignore"
Parent topic: Faults During Use