Training Logs Are Overwritten After Task Rescheduling

Symptom

If a fault occurs during the execution of a training job, the training logs are overwritten after the faulty pod task is rescheduled (resumable training). As a result, the training logs of the last training job are lost.

Cause Analysis

The log generation logic varies with training frameworks. Some frameworks allow log overwrites, causing the previous training logs to be lost.

Solution

Modify the log creation part in the training script (train_start.sh) to use date to obtain the timestamp. After each rescheduling, a log path with the timestamp is generated. The training log path TRAIN_LOG_PATH is used as an example.

timestamp=$(date +"%Y%m%d%H%M%S")
mkdir /job/code/alllogs/$MINDX_TASK_ID/trainlogs/$XDL_IP-$RANK-$timestamp    # Create a path for storing training logs.
export TRAIN_LOG_PATH=/job/code/alllogs/$MINDX_TASK_ID/trainlogs/$XDL_IP-$RANK-$timestamp   # Set the path for storing training logs.

Parent topic: Faults During Use