Double Buffering

Instruction queues executed on the AI Core include the vector pipe (V), matrix pipe (M), and storage and movement pipes (MTE2 and MTE3). Different instruction queues can be executed independently and in parallel. This is the cornerstone of the double buffering optimization mechanism.

Figure 1 shows a complete data movement and computation process: MTE2 moves data from the Global Memory to the Unified Buffer, the Vector Unit performs data compute process and writes the result to the Unified Buffer, and MTE3 moves the result back to the Global Memory.

Figure 1 Data movement in the Unified Buffer and compute process in Vector Unit

In this process, data movement and vector compute are performed in serial mode. Therefore, the Vector Unit is idle during data movement. For example, if each of the three phases (MTE2, Vector, and MTE3) takes the time t, respectively. In this case, the time utilization ratio of the Vector Unit is only 1/3.

To reduce the wait time between Vector instructions, the double buffering mechanism is used to divide the Unified Buffer into two parts: UB_A and UB_B. As shown in Figure 2, when the Vector Unit reads and computes data in UB_A, MTE2 moves data to UB_B. When Vector reads and computes UB_B, MTE3 moves the result of UB_A out, and MTE2 continues to move data into UB_A for computation. In this way, data movement and vector compute process are performed in parallel, the utilization ratio of the Vector Unit is greatly improved.

In conclusion, based on the independence and parallelism of MTE pipes and Vector pipes, double buffering hides the data movement time and reduces the wait time between Vector instructions by performing data movement and Vector compute process in parallel, thereby improving the utilization ratio of Vector. You can implement the parallelism by setting the thread_num parameter in for_range. A simple code example is as follows.

with tik_instance.for_range(0, 10, thread_num=2) as i:
Figure 2 Double buffering mechanism

Notes:

In most cases, the double buffering mechanism can effectively improve the utilization ratio of vector compute time and reduce the operator execution time. However, the double buffering mechanism does not always result in higher overall performance. See the following examples:

  • For example, if the data movement time accounts for only a small proportion of the total time, the double buffering mechanism brings relatively smaller performance gain.
  • If the volume of data to be processed is small enough that the Vector Unit can compute all data at once, double buffering will bring no gain. Instead, it will reduce the utilization of the Vector Unit.

Therefore, before determine whether to use double buffering, consider the Vector Unit compute power, data volume, and proportion of the movement time and compute process time.

In a for loop, AI Core parallelism and double buffering are mutually exclusive. To enable them both, use multiple loops.