Adaptation Example for PyTorch (MindSpeed-LLM)

Set up the training environment and start the training. For details, see Adaptation Example for PyTorch (MindSpeed-LLM).
Enable process-level online recovery. For details, see Configuring Process-Level Online Recovery.

Add the following information in bold to the QWEN3_for_PyTorch_2.7_code/mindspeed_llm/training/training.py code to inject a fault. The new code obtains the iteration position and fault rank based on the environment variable RAISE_UCE_ERROR_STEP_AND_RANK.

import os
import ast
...
GLB_CNT = 0
 
def train(forward_step_func, model, optimizer, opt_param_scheduler,
          train_data_iterator, valid_data_iterator,
          process_non_loss_data_func, config):
    """Train the model function."""
    args = get_args()
timers = get_timers()
...
    while iteration < args.train_iters:
        …
        num_microbatches = get_num_microbatches()
        update_num_microbatches(args.consumed_train_samples, consistency_check=True)
 
        global GLB_CNT
        cur_rank = torch.distributed.get_rank()
        uce_env = os.getenv("RAISE_UCE_ERROR_STEP_AND_RANK", "{}")
        uce_step_rank = ast.literal_eval(uce_env)
        if iteration in uce_step_rank and cur_rank == uce_step_rank[iteration] and GLB_CNT < iteration:
            GLB_CNT = iteration
            print(f"############# rank:{cur_rank} start UCE error #############")        
            raise RuntimeError('UCE ERROR')
 
        args.curr_iteration = iteration
        …

Modify the startup script QWEN3_for_PyTorch_2.7_code/scripts/train_start.sh.

...
export RAISE_UCE_ERROR_STEP_AND_RANK="{3:1,10:2}" # Configure the iteration and card number for fault injection. Inject UCE faults on rank 1 at the third iteration and rank 2 at the tenth iteration.
sed -i 's/check_memory_result = torch_npu.npu.check_uce_in_memory(device)/check_memory_result = ha_constant.UCE_HIGH_LEVEL/g' /job/code/mindspeed_llm/core/high_availability/tft_stop_clean.py # Modify the return value of the PTA API to identify the exception thrown by the training code as a UCE fault.
...

Parent topic: Script Adaptation