Training Logs Are Overwritten After Task Rescheduling
Symptom
If a fault occurs during the execution of a training job, the training logs are overwritten after the faulty pod task is rescheduled (resumable training). As a result, the training logs of the last training job are lost.
Cause Analysis
The log generation logic varies with training frameworks. Some frameworks allow log overwrites, causing the previous training logs to be lost.
Solution
Modify the log creation part in the training script (train_start.sh) to use date to obtain the timestamp. After each rescheduling, a log path with the timestamp is generated. The training log path TRAIN_LOG_PATH is used as an example.
timestamp=$(date +"%Y%m%d%H%M%S") mkdir /job/code/alllogs/$MINDX_TASK_ID/trainlogs/$XDL_IP-$RANK-$timestamp # Create a path for storing training logs. export TRAIN_LOG_PATH=/job/code/alllogs/$MINDX_TASK_ID/trainlogs/$XDL_IP-$RANK-$timestamp # Set the path for storing training logs.
Parent topic: Faults During Use