Adaptation Example of MindSpore (MindFormers)

Set up the training environment and start the training. For details, see Adaptation Example of MindSpore (MindFormers).
Enable process-level online recovery. For details, see Configuring Process-Level Online Recovery.

Add the following information in bold to the QWEN3_for_MS_code/mindformers/core/callback/callback.py code to inject a fault.

import json
import os
...
import ast
GLB_CNT = 0
EPOCH_CNT = 0
...
    def print_output_info(self, cb_params, cur_epoch_num, origin_epochs, throughput,
                          cur_step_num, steps_per_epoch, loss, per_step_seconds,
                          overflow, scaling_sens, time_remain, percent, global_norm):
        """print output information."""
        ...
        logger.info("  %4.1f%% %s %.5f samples/s/p  %s }", percent, show_str, throughput,
                    datetime.timedelta(seconds=int(time_remain)))
        global GLB_CNT
        global EPOCH_CNT
        if EPOCH_CNT < cur_epoch_num:
           GLB_CNT = 0
           EPOCH_CNT = cur_epoch_num
        uce_env = os.getenv("RAISE_UCE_ERROR_STEP_AND_RANK", "{}")
        uce_step_rank = ast.literal_eval(uce_env)
        if cur_step_num in uce_step_rank and get_rank() == uce_step_rank[cur_step_num] and GLB_CNT < cur_step_num:
           GLB_CNT = cur_step_num
           print(f"############# rank:{get_rank()} start UCE error #############")
           raise RuntimeError('UCEError occured.')
        if self.tensor_writer is not None:
            ...

Modify the startup script QWEN3_for_MS_code/scripts/msrun_launcher.sh.

...
export RAISE_UCE_ERROR_STEP_AND_RANK="{3:1,10:2}"  # Configure the iteration and card number for fault injection. Inject UCE faults on rank 1 at the third iteration and rank 2 at the tenth iteration.
sed -i 's/err_strategy = _get_uce_process_strategy()/err_strategy = "RS_UCE_LOWLEVEL"/g' $(pip3 show mindspore | grep Location | awk -F ' ' '{print $2}')/mindspore/train/callback/_train_fault_tolerance.py # Modify the UCE processing policy.
...

Parent topic: Script Adaptation