Double Buffering

Instruction queues executed on the AI Core include Vector, Cube, and MTE instruction queues. Different instruction queues can be executed independently and in parallel. This is the cornerstone of double buffering optimization.

The CopyIn and CopyOut processes use the MTE instruction queues (MTE2 and MTE3), and the Compute process uses the Vector instruction queue (V), which means that the CopyIn and CopyOut processes and the Compute process can be performed in parallel.

Figure 1 shows a complete CopyIn, Compute, and CopyOut process. During the CopyIn phase, data is transferred from the Global Memory to the Local Memory. After the Vector Unit completes the computation, the result is transferred back to the Global Memory through the CopyOut phase.

Figure 1 Data transfer and vector compute

In this process, data transfer and vector compute are performed in serial mode. Therefore, the Vector Unit is idle during data transfer. For example, if each of the three phases (CopyIn, Compute, and CopyOut) takes the time t, respectively. In this case, the time utilization ratio of the Vector Unit is only 1/3.

To reduce the waiting time of the Vector Unit, the double buffering mechanism divides the data to be processed into two parts, for example, Tensor1 and Tensor2. As shown in Figure 2, when Vector computes the data in Tensor1, Tensor2 can perform the CopyIn process. When Vector computes the data in Tensor2, Tensor1 can perform the CopyOut process. In this way, data transfer and vector compute are performed in parallel, the utilization ratio of the Vector Unit is greatly improved.

In conclusion, based on the independence and parallelism of MTE and Vector instruction queues, double buffering hides the data transfer time and reduces the waiting time between Vector instructions by performing data transfer and vector compute in parallel, thereby improving the utilization ratio of the Vector Unit. You can implement the parallelism according to the number of memory blocks set when allocating memory for the queues. A simple code example is as follows.

pipe.InitBuffer(inQueueX, 2, 256);
Figure 2 Double buffering mechanism

Notes:

In most cases, the double buffering mechanism can effectively improve the utilization ratio of the Vector Unit and reduce the operator execution time. However, the double buffering mechanism does not always result in higher overall performance. See the following examples:

  • If the data transfer time accounts for only a small proportion of the total time, the double buffering mechanism brings relatively smaller performance gain.
  • If the volume of data to be processed is small enough that the Vector Unit can compute all data at once, double buffering will bring no gain. Instead, it will reduce the utilization ratio of the Vector Unit.

Therefore, before determine whether to use double buffering, consider the Vector Unit compute capability, data volume, and proportion of the data transfer time to the vector compute time.