昇腾社区首页
中文
注册
开发者
下载

MindSpore场景适配示例(基于MindFormers)

  1. 搭建训练环境,拉起训练,详细请参见MindSpore场景适配示例(基于MindFormers)
  2. 开启进程级在线恢复,详细请参见配置进程级在线恢复
  3. “LLAMA2_for_MS_code/mindformers/core/callback/callback.py”代码中增加如下加粗内容,打桩注入故障。
    import json
    import os
    ...
    import ast
    GLB_CNT = 0
    EPOCH_CNT = 0
    ...
        def print_output_info(self, cb_params, cur_epoch_num, origin_epochs, throughput,
                              cur_step_num, steps_per_epoch, loss, per_step_seconds,
                              overflow, scaling_sens, time_remain, percent, global_norm):
            """print output information."""
            ...
            logger.info("  %4.1f%% %s %.5f samples/s/p  %s }", percent, show_str, throughput,
                        datetime.timedelta(seconds=int(time_remain)))
            global GLB_CNT
            global EPOCH_CNT
            if EPOCH_CNT < cur_epoch_num:
               GLB_CNT = 0
               EPOCH_CNT = cur_epoch_num
            uce_env = os.getenv("RAISE_UCE_ERROR_STEP_AND_RANK", "{}")
            uce_step_rank = ast.literal_eval(uce_env)
            if cur_step_num in uce_step_rank and get_rank() == uce_step_rank[cur_step_num] and GLB_CNT < cur_step_num:
               GLB_CNT = cur_step_num
               print(f"############# rank:{get_rank()} start UCE error #############")
               raise RuntimeError('UCEError occured.')
            if self.tensor_writer is not None:
                ...
  4. 修改启动脚本“LLAMA2_for_MS_code/scripts/msrun_launcher.sh”
    …
    export RAISE_UCE_ERROR_STEP_AND_RANK="{3:1,10:2}"  # 配置故障注入的迭代和卡号,在第3个迭代的rank 1卡和第10个迭代的rank 2卡上注入UCE故障
    sed -i 's/err_strategy = _get_uce_process_strategy()/err_strategy = "RS_UCE_LOWLEVEL"/g' $(pip3 show mindspore | grep Location | awk -F ' ' '{print $2}')/mindspore/train/callback/_train_fault_tolerance.py #修改UCE处理策略
    …