Enabling Double Buffering

[Priority] Medium

[Description] Instruction queues executed on the AI Core include the following types: vector instruction queue (V), cube instruction queue (M), scalar instruction queue (S), and transfer instruction queue (MTE1/MTE2/MTE3). Different instruction queues can be executed independently and in parallel. This is the cornerstone of double buffering optimization.

The CopyIn and CopyOut processes before and after vector computation use the MTE2 or MTE3, and the Compute process uses the vector instruction queue. The CopyIn/CopyOut processes and the Compute process can be executed concurrently. Figure 1 shows a complete CopyIn, Compute, and CopyOut process. During the CopyIn process, data is transferred from the Global Memory to the local memory. After the Vector unit completes the computation, the result is transferred back to the Global Memory through the CopyOut process.

Figure 1 Data transfer and vector computation

Figure 2 Pipeline chart when double buffering is disabled

In this process, CopyIn/CopyOut and Vector Compute take the serial mode. The Vector unit has idle resources. If the CopyIn, Compute, and CopyOut phases take the same time (t), the Vector utilization is only 1/3, and the waiting time is too long.

To reduce the Vector waiting time, enable double buffering, which divides the data to be processed into two parts, for example, Tensor1 and Tensor2. As shown in Figure 3, when the Vector unit computes the data in Tensor1, the Tensor2 data flow can execute the CopyIn process. When the Vector unit computes Tensor2, the Tensor1 data flow can execute the CopyOut process. In this way, CopyIn/CopyOut and Vector Compute are executed in parallel, and the Vector utilization is improved.

To sum up, double buffering implements parallelism between data transfer and Vector computation, reducing the wait time of Vector instructions and improving the utilization of the Vector unit. When allocating memory to a queue, set the number of buffers to 2 so that double buffering can be enabled to implement data parallelism. The following is a simple code example:

pipe.InitBuffer(inQueueX, 2, 256);

Figure 3 Double buffering mechanism

Figure 4 Pipeline chart when double buffering is enabled

Notes:

In most cases, double buffering can effectively improve the Vector utilization and reduce the operator execution time. However, it does not always result in higher overall performance. For example:

If the data transfer time is quite short compared with the total time, double buffering brings smaller performance gain.
If the original data is small enough that the Vector unit can compute all data at once, double buffering will bring no gain, but reduce the Vector utilization.

Therefore, before using double buffering, consider factors such as Vector compute power, data size, and ratio of data transfer time to compute time.

[Negative Example]

__aicore__ inline void Init(__gm__ uint8_t* src0Gm, __gm__ uint8_t* src1Gm, __gm__ uint8_t* dstGm)
{
    src0Global.SetGlobalBuffer((__gm__ half*)src0Gm);
    src1Global.SetGlobalBuffer((__gm__ half*)src1Gm);
    dstGlobal.SetGlobalBuffer((__gm__ half*)dstGm);
    // If double buffering is disabled, the occupied physical space is 1 * sizeSrc0 * sizeof(half).
    // After three InitBuffers are executed, the total space is 1 * (sizeSrc0 * sizeof(half) + sizeSrc1 * sizeof(half) + sizeDst0 * sizeof(half)).
    pipe.InitBuffer(inQueueSrc0, 1, sizeSrc0 * sizeof(half));
    pipe.InitBuffer(inQueueSrc1, 1, sizeSrc1 * sizeof(half));
    pipe.InitBuffer(outQueueDst, 1, sizeDst0 * sizeof(half));
    }
__aicore__ inline void Process()
{
    // Data can be processed only after round*2 cycles.
    for (uint32_t index = 0; index < round * 2; ++index) {
        CopyIn(index);
        Compute();
        CopyOut(index);
    }
}

[Positive Example]

__aicore__ inline void Init(__gm__ uint8_t* src0Gm, __gm__ uint8_t* src1Gm, __gm__ uint8_t* dstGm)
{
    src0Global.SetGlobalBuffer((__gm__ half*)src0Gm);
    src1Global.SetGlobalBuffer((__gm__ half*)src1Gm);
    dstGlobal.SetGlobalBuffer((__gm__ half*)dstGm);
    // The value 2 in InitBuffer indicates that double buffering is enabled and the occupied physical space is 2 * sizeSrc0 * sizeof(half).
    // After three InitBuffers are executed, the total space is 2 * (sizeSrc0 * sizeof(half) + sizeSrc1 * sizeof(half) + sizeDst0 * sizeof(half)).
    pipe.InitBuffer(inQueueSrc0, 2, sizeSrc0 * sizeof(half));
    pipe.InitBuffer(inQueueSrc1, 2, sizeSrc1 * sizeof(half));
    pipe.InitBuffer(outQueueDst, 2, sizeDst0 * sizeof(half));
    }
__aicore__ inline void Process()
{
    // Prerequisite for enabling double buffering: The number of cycles must be greater than or equal to 2.
    for (uint32_t index = 0; index < round; ++index) {
        CopyIn(index);
        Compute();
        CopyOut(index);
    }
}

Parent topic: Pipeline Optimization