Setting the Number of Iterations Offloaded to NPU
You only need to set the npu loop size, number of offload iterations on an NPU at the porting point. There are two methods:
- Use the environment variable NPU_LOOP_SIZE to set this parameter:
export NPU_LOOP_SIZE=32
This variable should be set before the import npu_device operation.
- You can set this parameter by calling npu.set_npu_loop_size in your training script. It is simple, but you need to understand the meaning of npu loop size.
Let's first take an overview of the performance defects in the native TF2 workflow. In general GPU training, the script user initiates a training job of 10 epochs, for example. Every epoch, a training step is performed on the GPU. After the training step is complete, the model is returned to the Python side. The user determines that the number of current steps is equal to 10 and starts the next training epoch, until ten training epochs have been completed.

As shown in the sequence diagram, both the CPU and GPU work intermittently, which brings the following performance defects:
- The Python interpreter has extra overhead and unpredictable time consumption. The gaps between successive training steps cause performance black holes.
- It is possible to accelerate the preprocessing pipeline by leveraging the dataset prefetching function of TF2. However, the time spent on host to device (H2D) data transfer and CPU scheduling in every training epoch is inescapable.
In TF2, to avoid the extra overhead on the Python interpreter, you are advised to use operator While to implement the training loop (which is not a policy exclusive to the NPU). In this case, it is operator While, instead of the Python interpreter, that determines whether the specified number of steps is reached. Organize your training script as follows.
1 2 3 4 |
@tf.function def loop_train(iterator, steps): for i in tf.range(steps): train_step(next(iterator)) |
After the TF2 code is compiled, the training steps are nested in operator While. The following figure shows the new execution time sequence.

With the iteration offload policy, the time consumed on the Python interpreter is transferred to the TF CPU, which is shorter and more predictable. However, this mode also brings two extra overheads:
- Time spent in H2D data transfer in preprocessing
- Time spent by the operator in determining whether the specified number of steps is reached
To achieve better performance, the NPU employs the following two techniques to eliminate these two overheads:
- An asynchronous preprocessing H2D thread is used to asynchronize preprocessing output transfer from NPU training, hiding H2D transfer time within the NPU training time.
- The number of offloaded iterations is specified by the user to avoid the operator's time consumption, which also indicates the number of asynchronous H2D data transfers.
Asynchronous data transfer indicates that the TF Adapter's preprocessing thread proactively sends training data to the NPU. The execution time sequence without iteration offload is as follows.
In this case, the time consumed by preprocessing H2D data transfer and CPU scheduling can be reduced to some extent. (Data transfer is in progress when a training step is delivered.)

The NPU execution time sequence with iteration offload is as follows.

where
- After the script delivers a training job of 10 epochs to the NPU, there are no more interactions between the script and the Python interpreter until the training is complete.
- The variation of preprocessing time consumption can be offset by the preprocessing time consumption leading NPU computation in the previous training step, increasing the tolerance to the preprocessing performance fluctuation.
To minimize the time consumed by training computation and maximize the performance benefits by using iteration offload to the NPU and asynchronous preprocessing data transfer, ensure that your training job meets the following requirement:
As the preprocessing thread is asynchronous with NPU computation, iteration offload requires a mechanism to notify the NPU of the number of currently offloaded iterations. The simplest approach is to set the NPU loop size.
See the following example.
1 2 3 4 |
@tf.function def loop_train(iterator, steps): for i in tf.range(steps): train_step(next(iterator)) |
If you wish to train 100 steps each time loop_train is called, you can set the NPU loop size in either of the following ways:
- Set the NPU_LOOP_SIZE environment variable before starting training.
export NPU_LOOP_SIZE=100
- Insert a call to npu.set_npu_loop_size (pass 100 as the loop size) before the loop_train call in your Python script.
1 2
npu.set_npu_loop_size(100) loop_train(train_iter, tf.constant(100))
You might need to change the NPU loop size in a training epoch. For example, for a 100-step training job, if you expect to offload 30 steps to the NPU every loop, the final 91 to 100 steps will be less than the configured NPU loop size. In this case, npu.set_npu_loop_size can be called to tune the NPU loop size accordingly upon the completion of the first 90 steps.
1 2 3 4 5 6 7 8 9 |
remaining_steps = 100 # Number of remainder steps base_loop_size = 30 # Benchmark NPU loop size npu.set_npu_loop_size(base_loop_size) while remaining_steps >= base_loop_size: # Offload based on benchmark loop size until the remainder steps is less than one loop. loop_train(train_iterator, tf.constant(base_loop_size)) remaining_steps -= base_loop_size if remaining_steps > 0: # Process the remainder steps as a smaller NPU loop size. npu.set_npu_loop_size(remaining_steps) loop_train(train_iterator, tf.constant(remaining_steps)) |