Service Plane Faults

Resumable training allows the Volcano scheduler to detect and process job failures caused by service plane faults. A service plane fault occurs when the container exits abnormally after all training processes in the container exit abnormally. As a result, the pod status changes to Failed. In scenarios where Ascend Operator is used, only some pods of a job can be faulty when a service plane fault occurs. If the status of all pods of a job changes to Failed in a few seconds, the job will not be rescheduled and will be deemed as failed.

Figure 1 shows the principle of detecting a service plane fault.

Figure 1 Detection principle

The scheduler continuously queries the pod status of each job in polling mode to detect service plane faults and report them. You can handle service plane faults based on specific service requirements. For a service plane fault detected by resumable training, Volcano checks whether to enable unconditional retry. If this function is enabled, the system reschedules the training job to the node that does not trigger rescheduling and re-executes it. Then the number of retry times decreases by 1. If the number of retry times is 0 or the unconditional retry function is disabled, the system does not handle the service container fault.

To use unconditional retry, you need to configure the following parameters in the job YAML file: fault-retry-times, restartPolicy, and policies. For details about the parameters, see YAML Parameters.
When Ascend Operator is used, if you want to reschedule a job after the status of all job pods changes to Failed, see When Volcano and Ascend Operator Are Used, Status of All Pods of a Faulty Job on the Service Plane Becomes Failed and the Job Cannot Trigger Unconditional Retry-Based Rescheduling.

Watchdog Fault Detection

If an NPU job is executed abnormally (the service plane is faulty), the normal NPU may fail to communicate with the faulty NPU. Consequently, the collective communication of all normal NPUs enters a timeout state. The collective communication for the job exits only after a timeout exception occurs, which is 30 minutes by default. If watchdog (with unconditional retry upon service plane faults enabled) is enabled, the faulty NPU can be isolated after the exception occurs, and the job can be rescheduled to a healthy NPU. In this way, the job can exit within 6 minutes.

Watchdog can be used only by the PyTorch framework of the Atlas A2 training product when job execution exceptions occur on NPUs.

Required Components

To ensure the normal use of service plane fault detection, install Volcano and Ascend Operator.

Supported Fault Handling Types

Include job-level rescheduling, pod-level rescheduling, process-level rescheduling, and graceful fault tolerance.

Parent topic: Fault Detection