Adaptation Example for PyTorch (MindSpeed-LLM)

  1. Set up the training environment and start the training. For details, see Adaptation Example for PyTorch (MindSpeed-LLM).
  2. Enable process-level online recovery. For details, see Configuring Process-Level Online Recovery.
  3. Add the following information in bold to the QWEN3_for_PyTorch_2.7_code/mindspeed_llm/training/training.py code to inject a fault. The new code obtains the iteration position and fault rank based on the environment variable RAISE_UCE_ERROR_STEP_AND_RANK.
    import os
    import ast
    ...
    GLB_CNT = 0
     
    def train(forward_step_func, model, optimizer, opt_param_scheduler,
              train_data_iterator, valid_data_iterator,
              process_non_loss_data_func, config):
        """Train the model function."""
        args = get_args()
    timers = get_timers()
    ...
        while iteration < args.train_iters:
            …
            num_microbatches = get_num_microbatches()
            update_num_microbatches(args.consumed_train_samples, consistency_check=True)
     
            global GLB_CNT
            cur_rank = torch.distributed.get_rank()
            uce_env = os.getenv("RAISE_UCE_ERROR_STEP_AND_RANK", "{}")
            uce_step_rank = ast.literal_eval(uce_env)
            if iteration in uce_step_rank and cur_rank == uce_step_rank[iteration] and GLB_CNT < iteration:
                GLB_CNT = iteration
                print(f"############# rank:{cur_rank} start UCE error #############")        
                raise RuntimeError('UCE ERROR')
     
            args.curr_iteration = iteration
            …
  4. Modify the startup script QWEN3_for_PyTorch_2.7_code/scripts/train_start.sh.
    ...
    export RAISE_UCE_ERROR_STEP_AND_RANK="{3:1,10:2}" # Configure the iteration and card number for fault injection. Inject UCE faults on rank 1 at the third iteration and rank 2 at the tenth iteration.
    sed -i 's/check_memory_result = torch_npu.npu.check_uce_in_memory(device)/check_memory_result = ha_constant.UCE_HIGH_LEVEL/g' /job/code/mindspeed_llm/core/high_availability/tft_stop_clean.py # Modify the return value of the PTA API to identify the exception thrown by the training code as a UCE fault.
    ...