Restrictions
iterations_per_loop is the number of iterations per training loop performed on the device per sess.run call. Training is performed according to the specified number of iterations per loop (iterations_per_loop) on the device and then the result is returned to the host. This parameter can save unnecessary interactions between the host and device and reduce the training time.
iterations_per_loop defaults to 1. You can enable the iteration offload feature by setting this parameter to a value greater than 1. Note the following restrictions when using this feature:
- The training script must read data in TensorFlow's dataset mode instead of the one-shot iterator for preprocessing initialization. For example, use the tf.data.make_initializable_iterator() iterator. Reading data in dataset mode is the prerequisite for GetNext operator offload and training iteration offload. For details about how to use datasets, see the TensorFlow official website.
- The GetNext operator to be executed on the device will be generated to make iteration offload take effect only when GetNext operator offload is enabled, that is, enable_data_pre_proc is set to True.
- Example of enabling GetNext operator offload in sess.run:
custom_op.parameter_map["enable_data_pre_proc"].b = True
- Example of enabling GetNext operator offload in NPURunConfig:
config = NPURunConfig(enable_data_pre_proc=True)
- Example of enabling GetNext operator offload in sess.run:
- The total number of training iterations must be evenly divisible by iterations_per_loop.
- When saving checkpoint data in iteration offload mode, set save_checkpoints_steps to a positive integer multiple of iterations_per_loop, so that checkpoints can be saved in strict accordance with save_checkpoints_steps. If the value of iterations_per_loop is greater than 1, data may not be saved as defined by save_summary_steps and log_step_count_steps. In this case, follow Log and Summary Operators to resolve this problem.
- In mixed computing mode (with mix_compile_mode set to True), iteration offload must not be enabled. That is, iterations_per_loop must be set to 1.
- During network development, you are advised to set iterations_per_loop to 1 to facilitate log printing every iteration. After the network is set up correctly, you can set the iterations_per_loop parameter to shorten the training time.
Parent topic: Iteration Offload