昇腾故障案例详情页

Hook的index顺序改变导致训练异常

更新时间: 2022/07/26

暂无评分

问题信息

问题来源产品大类产品子类关键字
官方模型训练TensorFlowHook

问题现象描述

训练报错:AttributeError:'xxxxHook' object has no attribute 'xxxx'。

原因分析

迁移工具对于Estimator脚本默认添加了NPUBroadcastHook,而出错的场景中对于hooks这个list,在添加完NPU的hook之后,改变了hooks这个list中的index顺序,所以出现了取hooks[-1]进行其他操作报错的问题。

    training_hooks.append(LogTrainRunHook(global_batch_size, hvd_rank, FLAGS.save_checkpoints_steps, num_steps_ignore_xla=25))
...
    estimator.train(input_fn=train_input_fn, max_steps=num_train_steps, hooks=npu_hooks_append(hooks_list=training_hooks))
    train_time_elapsed = time.time() - train_start_time
    train_time_wo_overhead = training_hooks[-1].total_time
    avg_sentences_per_second = num_train_steps * global_batch_size * 1.0 / train_time_elapsed
    ss_sentences_per_second = (training_hooks[-1].count - training_hooks[-1].skipped) * global_batch_size * 1.0 / train_time_wo_overhead

解决措施

需要修改训练脚本中hook list中的index。针对以上脚本,进行如下修改即可执行成功。

    training_hooks.append(LogTrainRunHook(global_batch_size, hvd_rank, FLAGS.save_checkpoints_steps, num_steps_ignore_xla=25))
...
    estimator.train(input_fn=train_input_fn, max_steps=num_train_steps, hooks=npu_hooks_append(hooks_list=training_hooks))
    train_time_elapsed = time.time() - train_start_time
    train_time_wo_overhead = training_hooks[-2].total_time
    avg_sentences_per_second = num_train_steps * global_batch_size * 1.0 / train_time_elapsed
    ss_sentences_per_second = (training_hooks[-2].count - training_hooks[-2].skipped) * global_batch_size * 1.0 / train_time_wo_overhead

本页内容

该页面对您有帮助吗?
我要评分