PyTorch场景适配示例（基于MindSpeed-LLM）

搭建训练环境，拉起训练，详细请参见PyTorch场景适配示例（基于MindSpeed-LLM）。
开启进程级在线恢复，详细请参见配置进程级在线恢复。

在“LLAMA2_for_PyTorch_2.7_code/mindspeed_llm/training/training.py”代码中增加如下加粗内容，打桩注入故障，新增代码根据环境变量“RAISE_UCE_ERROR_STEP_AND_RANK”获取注入故障迭代位置和故障rank信息。

import os
import ast
…
GLB_CNT = 0
 
def train(forward_step_func, model, optimizer, opt_param_scheduler,
          train_data_iterator, valid_data_iterator,
          process_non_loss_data_func, config):
    """Train the model function."""
    args = get_args()
timers = get_timers()
…
    while iteration < args.train_iters:
        …
        num_microbatches = get_num_microbatches()
        update_num_microbatches(args.consumed_train_samples, consistency_check=True)
 
        global GLB_CNT
        cur_rank = torch.distributed.get_rank()
        uce_env = os.getenv("RAISE_UCE_ERROR_STEP_AND_RANK", "{}")
        uce_step_rank = ast.literal_eval(uce_env)
        if iteration in uce_step_rank and cur_rank == uce_step_rank[iteration] and GLB_CNT < iteration:
            GLB_CNT = iteration
            print(f"############# rank:{cur_rank} start UCE error #############")        
            raise RuntimeError('UCE ERROR')
 
        args.curr_iteration = iteration
        …

修改启动脚本“LLAMA2_for_PyTorch_2.7_code/scripts/train_start.sh”。

…
export RAISE_UCE_ERROR_STEP_AND_RANK="{3:1,10:2}"  # 配置故障注入的迭代和卡号，在第3个迭代的rank 1卡和第10个迭代的rank 2卡上注入UCE故障
sed -i 's/check_memory_result = torch_npu.npu.check_uce_in_memory(device)/check_memory_result = UCE_HIGH_LEVEL/g' $(pip3 show mindio_ttp | grep Location | awk -F ' ' '{print $2}')/mindio_ttp/adaptor/tft_stop_clean.py #修改PTA接口返回值，将训练代码抛出的异常识别为UCE故障
…

父主题： 脚本适配