Enabling DoubleBuffer

[Priority] Medium

[Description] Instruction queues executed on the AI Core include the following types: vector instruction queue (V), cube instruction queue (M), scalar instruction queue (S), and transfer instruction queue (MTE1/MTE2/MTE3). Different instruction queues can be executed independently and in parallel. This is the cornerstone of DoubleBuffer optimization.

Take pure vector compute as an example. The CopyIn and CopyOut processes before and after vector compute use the MTE2 or MTE3, and the Compute process uses the vector instruction queue. The CopyIn/CopyOut processes and the Compute process can be executed concurrently. Figure 1 shows a complete CopyIn, Compute, and CopyOut process. During the CopyIn process, data is transferred from the Global Memory to the local memory. After the Vector Unit completes the compute, the result is transferred back to the Global Memory through the CopyOut process.

Figure 1 Data movement and vector compute

Figure 2 Pipeline when DoubleBuffer is disabled

In this process, CopyIn/CopyOut and vector compute take the serial mode. The Vector Unit has idle resources. If the CopyIn, Compute, and CopyOut phases take the same time (t), the Vector utilization is only 1/3, and the waiting time is too long.

To reduce the waiting time of the Vector Unit, enable the DoubleBuffer mechanism, which divides the data to be processed into two parts, for example, Tensor1 and Tensor2. As shown in Figure 3, when the Vector Unit computes the data in Tensor1, the Tensor2 data flow can execute the CopyIn process. When the Vector Unit computes Tensor2, the Tensor1 data flow can execute the CopyOut process. In this way, CopyIn/CopyOut and vector compute are executed in parallel, and vector utilization is improved.

To sum up, DoubleBuffer implements parallelism between data transfer and vector compute, reducing the wait time of vector instructions and improving the utilization of the Vector Unit. When allocating memory to a queue, set the number of buffers to 2 so that DoubleBuffer can be enabled to implement data parallelism. The following is a simple code example:

pipe.InitBuffer(inQueueX, 2, 256);

Figure 3 DoubleBuffer mechanism

Figure 4 Pipeline when DoubleBuffer is enabled

Notes:

In most cases, the DoubleBuffer mechanism can effectively improve the utilization ratio of the Vector Unit and reduce the operator execution time. However, it does not always result in higher overall performance. For example:

If the data movement time accounts for only a small proportion of the total time, the DoubleBuffer mechanism brings relatively smaller performance gain.
If the original data is small enough that the Vector Unit can compute all data at once, DoubleBuffer will bring no gain, but reduce vector utilization.

Therefore, before using DoubleBuffer, consider factors such as vector compute power, data size, and ratio of data movement time to compute time.

[Negative Example]

__aicore__ inline void Init(__gm__ uint8_t* src0Gm, __gm__ uint8_t* src1Gm, __gm__ uint8_t* dstGm)
{
    src0Global.SetGlobalBuffer((__gm__ half*)src0Gm);
    src1Global.SetGlobalBuffer((__gm__ half*)src1Gm);
    dstGlobal.SetGlobalBuffer((__gm__ half*)dstGm);
    // If DoubleBuffer is disabled, the occupied physical space is 1 * sizeSrc0 * sizeof(half).
    // After three InitBuffers are executed, the total space is 1 * (sizeSrc0 * sizeof(half) + sizeSrc1 * sizeof(half) + sizeDst0 * sizeof(half)).
    pipe.InitBuffer(inQueueSrc0, 1, sizeSrc0 * sizeof(half));
    pipe.InitBuffer(inQueueSrc1, 1, sizeSrc1 * sizeof(half));
    pipe.InitBuffer(outQueueDst, 1, sizeDst0 * sizeof(half));
    }
__aicore__ inline void Process()
{
    // Data can be processed only after round*2 cycles.
    for (uint32_t index = 0; index < round * 2; ++index) {
        CopyIn(index);
        Compute();
        CopyOut(index);
    }
}

[Positive Example]

__aicore__ inline void Init(__gm__ uint8_t* src0Gm, __gm__ uint8_t* src1Gm, __gm__ uint8_t* dstGm)
{
    src0Global.SetGlobalBuffer((__gm__ half*)src0Gm);
    src1Global.SetGlobalBuffer((__gm__ half*)src1Gm);
    dstGlobal.SetGlobalBuffer((__gm__ half*)dstGm);
    // The value 2 in InitBuffer indicates that DoubleBuffer is enabled and the occupied physical space is 2 * sizeSrc0 * sizeof(half).
    // After three InitBuffers are executed, the total space is 2 * (sizeSrc0 * sizeof(half) + sizeSrc1 * sizeof(half) + sizeDst0 * sizeof(half)).
    pipe.InitBuffer(inQueueSrc0, 2, sizeSrc0 * sizeof(half));
    pipe.InitBuffer(inQueueSrc1, 2, sizeSrc1 * sizeof(half));
    pipe.InitBuffer(outQueueDst, 2, sizeDst0 * sizeof(half));
    }
__aicore__ inline void Process()
{
    // Prerequisite for enabling DoubleBuffer: The number of cycles must be greater than or equal to 2.
    for (uint32_t index = 0; index < round; ++index) {
        CopyIn(index);
        Compute();
        CopyOut(index);
    }
}

Parent topic: Pipeline Orchestration