Adaptation Example for PyTorch (MindSpeed-LLM)
- Set up the training environment and start the training. For details, see Adaptation Example for PyTorch (MindSpeed-LLM).
- Enable process-level online recovery. For details, see Configuring Process-Level Online Recovery.
- Add the following information in bold to the QWEN3_for_PyTorch_2.7_code/mindspeed_llm/training/training.py code to inject a fault. The new code obtains the iteration position and fault rank based on the environment variable RAISE_UCE_ERROR_STEP_AND_RANK.
import os import ast ... GLB_CNT = 0 def train(forward_step_func, model, optimizer, opt_param_scheduler, train_data_iterator, valid_data_iterator, process_non_loss_data_func, config): """Train the model function.""" args = get_args() timers = get_timers() ... while iteration < args.train_iters: … num_microbatches = get_num_microbatches() update_num_microbatches(args.consumed_train_samples, consistency_check=True) global GLB_CNT cur_rank = torch.distributed.get_rank() uce_env = os.getenv("RAISE_UCE_ERROR_STEP_AND_RANK", "{}") uce_step_rank = ast.literal_eval(uce_env) if iteration in uce_step_rank and cur_rank == uce_step_rank[iteration] and GLB_CNT < iteration: GLB_CNT = iteration print(f"############# rank:{cur_rank} start UCE error #############") raise RuntimeError('UCE ERROR') args.curr_iteration = iteration … - Modify the startup script QWEN3_for_PyTorch_2.7_code/scripts/train_start.sh.
... export RAISE_UCE_ERROR_STEP_AND_RANK="{3:1,10:2}" # Configure the iteration and card number for fault injection. Inject UCE faults on rank 1 at the third iteration and rank 2 at the tenth iteration. sed -i 's/check_memory_result = torch_npu.npu.check_uce_in_memory(device)/check_memory_result = ha_constant.UCE_HIGH_LEVEL/g' /job/code/mindspeed_llm/core/high_availability/tft_stop_clean.py # Modify the return value of the PTA API to identify the exception thrown by the training code as a UCE fault. ...
Parent topic: Script Adaptation