How Do I Fix Training Errors Caused by Hook Index Changes?
Symptom
An error is reported during training: AttributeError:'xxxxHook' object has no attribute 'xxxx'.

Possible Cause
The porting tool will add NPUBroadcastHook to an Estimator-based script. In this case, the newly added NPU hook has changed the hook indexes in the hooks list, resulting in this hooks[-1] error.
1 2 3 4 5 6 7 | training_hooks.append(LogTrainRunHook(global_batch_size, hvd_rank, FLAGS.save_checkpoints_steps, num_steps_ignore_xla=25)) ... estimator.train(input_fn=train_input_fn, max_steps=num_train_steps, hooks=npu_hooks_append(hooks_list=training_hooks)) train_time_elapsed = time.time() - train_start_time train_time_wo_overhead = training_hooks[-1].total_time avg_sentences_per_second = num_train_steps * global_batch_size * 1.0 / train_time_elapsed ss_sentences_per_second = (training_hooks[-1].count - training_hooks[-1].skipped) * global_batch_size * 1.0 / train_time_wo_overhead |
Solution
Correct the hook index in the hooks list in the training script as follows:
1 2 3 4 5 6 7 | training_hooks.append(LogTrainRunHook(global_batch_size, hvd_rank, FLAGS.save_checkpoints_steps, num_steps_ignore_xla=25)) ... estimator.train(input_fn=train_input_fn, max_steps=num_train_steps, hooks=npu_hooks_append(hooks_list=training_hooks)) train_time_elapsed = time.time() - train_start_time train_time_wo_overhead = training_hooks[-2].total_time avg_sentences_per_second = num_train_steps * global_batch_size * 1.0 / train_time_elapsed ss_sentences_per_second = (training_hooks[-2].count - training_hooks[-2].skipped) * global_batch_size * 1.0 / train_time_wo_overhead |
Parent topic: FAQs