Training Process Exits Due to an Error, and the Pod Status Is Not Error, Failing to Trigger Service Plane Rescheduling

Symptom

After the training is executed, the training process exits with an error, and the pod status is Completed, not Error.

Cause Analysis

The pod status is determined by the exit code. If the exit code is 0, the pod status is Completed. If the exit code is not 0, the pod status is Error. The training process is started by the train_start.sh script. The return code of the Python program received in train_start.sh is incorrect. As a result, the exit code is 0 and the pod status is Completed.

Service plane rescheduling can be triggered only when the exit code is not 0.

Solution

The content of the train_start.sh script varies. The following is only an example. Modify the script based on the actual situation.

  1. Check the train_start.sh script and verify the code for receiving the return code of the Python program.

    In the figure, a pipe (|) is added after the Python program, and the print and log saving actions are performed. $? receives the execution result of the tee command. No error is reported during tee command execution. Therefore, the return code received by $? is 0. The process exits with the exit code 0. As a result, after the training process exits with an error, the pod status is Completed instead of Error, and service plane rescheduling cannot be triggered.

  2. Modify the code for receiving the exit code of the Python program and use PIPESTATUS[0] to obtain the execution result of the command before the pipe (|), that is, the exit code of the Python program. After modification: