L2 Cache Tiling

[Priority] High

[Description] Assume that the L2 cache size of the AI Processor is 192 MB. The read/write mixed bandwidth of the L2 cache is about 7 TB/s, and the bandwidth of the GM is about 1.6 TB/s. There is a large gap between the two bandwidths. When the same amount of data is transferred in or out, it is faster to access the L2 cache than the GM for reading and writing data. If the data cannot hit the L2 cache, that is, the data to be accessed is not in the L2 cache, it needs to be read and written on the GM. As a result, the bandwidth utilization is low, and the data transfer in or out of the operator becomes a performance bottleneck in the entire running process of the operator. It is recommended to enable the L2 cache tiling strategy when the size of input and output data exceeds the L2 cache size.

[Negative Example]

Assume that the input data size is InputTotalSize = 384 MB, the L2 cache size is 192 MB, the total number of cores is 20, and the computation is complete at a time without data tiling. If 20 cores can process a total of 192 MB data at a time, each core reads the input data at least twice.

Figure 1 L2 cache tiling disabled
constexpr int32_t TOTAL_LENGTH = 384 * 1024 * 1024 / sizeof(half);
constexpr int32_t USE_CORE_NUM = 20;
constexpr int32_t TILE_NUM = 2;
constexpr int32_t BLOCK_LENGTH = TOTAL_LENGTH / USE_CORE_NUM;
constexpr int32_t TILE_LENGTH = BLOCK_LENGTH / TILE_NUM;

class KernelSample {
public:
    __aicore__ inline KernelSample() {}
    __aicore__ inline void Init(GM_ADDR x)
    {
        xGm.SetGlobalBuffer((__gm__ half*)x + BLOCK_LENGTH * GetBlockIdx(), BLOCK_LENGTH);
        pipe.InitBuffer(inQueueX, 1, BLOCK_LENGTH * sizeof(half));
        pipe.InitBuffer(inQueueY, 1, BLOCK_LENGTH * sizeof(half));
    }
    __aicore__ inline void Process()
    {
        // This example demonstrates how to add 2 to the input data.
        constexpr int32_t loopCount = 2;
        for (int32_t i = 0; i < loopCount; i++) {
        // The outer loop adds 1 to the input data.
            for (int32_t j = 0; j < TILE_NUM; j++) {
                // The inner loop processes data blocks 0 and 1 of each core.
                CopyIn(j);
                Compute();
                CopyOut(j);
            }
        }
    }
private:
    __aicore__ inline void CopyIn(int32_t process)
    {
        LocalTensor<half> xLocal = inQueueX.AllocTensor<half>();
        // For each core, except for the first read, when data block 0 is read, data block 1 is stored in the L2 cache.
        // For each core, when data block 1 is read, data block 0 is stored in the L2 cache.
        // Each core needs to read data on the GM four times.
        DataCopy(xLocal, xGm[process * TILE_LENGTH], TILE_LENGTH );
        inQueueX.EnQue(xLocal);
    }
    __aicore__ inline void Compute()
    {
        LocalTensor<half> yLocal = inQueueY.AllocTensor<half>();
        LocalTensor<half> xLocal = inQueueX.DeQue<half>();
        Adds(yLocal, xLocal, 1, TILE_LENGTH);   
        inQueueY.EnQue<half>(yLocal);
        inQueueX.FreeTensor(xLocal);
    }
    __aicore__ inline void CopyOut(int32_t process)
    {
        LocalTensor<half> yLocal = inQueueY.DeQue<half>();
        DataCopy(yGm[process * TILE_LENGTH], yLocal, TILE_LENGTH);
        inQueueY.FreeTensor(yLocal);
    }
}
...

[Positive Example]

Assume that the size of input data InputTotalSize is 384 MB and the L2 cache size is 192 MB, and a total of 20 cores can be used. The data is evenly divided into two parts. The first and second 192 MB data is computed by the 20 cores in two separate times. The data read before each computation can hit the L2 cache, improving the operator performance.

Figure 2 L2 cache tiling enabled
constexpr int32_t TOTAL_LENGTH = 384 * 1024 * 1024 / sizeof(half);
constexpr int32_t TILE_NUM = 2;
constexpr int32_t USE_CORE_NUM = 20;
constexpr int32_t TILE_LENGTH = TOTAL_LENGTH / TILE_NUM;
constexpr int32_t BLOCK_LENGTH = TILE_LENGTH / USE_CORE_NUM;

class KernelSample {
public:
    __aicore__ inline KernelSample() {}
    __aicore__ inline void Init(GM_ADDR x, GM_ADDR y, int32_t index)
    {
        xGm.SetGlobalBuffer((__gm__ half*)x + BLOCK_LENGTH * GetBlockIdx() + index * TILE_LENGTH, BLOCK_LENGTH);
        yGm.SetGlobalBuffer((__gm__ half*)y + BLOCK_LENGTH  * GetBlockIdx() + index * TILE_LENGTH, BLOCK_LENGTH);
        pipe.InitBuffer(inQueueX, 1, BLOCK_LENGTH * sizeof(half));
        pipe.InitBuffer(inQueueY, 1, BLOCK_LENGTH * sizeof(half));
    }
    __aicore__ inline void Process()
    {
        // This example demonstrates how to add 2 to the input data.
        constexpr int32_t loopCount = 2;
        for (int32_t i = 0; i < loopCount; i++) {
            // Each loop adds 1 to the input data.
            CopyIn();
            Compute();
            CopyOut();
        }
    }
private:
    __aicore__ inline void CopyIn()
    {
        LocalTensor<half> xLocal = inQueueX.AllocTensor<half>();
        // For each core, except the first read, the second read can hit the L2 cache.
        // Each core reads data from the GM twice and accesses the L2 cache twice to read data.
        DataCopy(xLocal, xGm, BLOCK_LENGTH );
        inQueueX.EnQue(xLocal);
    }
    __aicore__ inline void Compute()
    {
        LocalTensor<half> yLocal = inQueueY.AllocTensor<half>();
        LocalTensor<half> xLocal = inQueueX.DeQue<half>();
        Adds(yLocal, xLocal, 1, BLOCK_LENGTH);   
        inQueueY.EnQue<half>(yLocal);
        inQueueX.FreeTensor(xLocal);
    }
    __aicore__ inline void CopyOut()
    {
        LocalTensor<half> yLocal = inQueueY.DeQue<half>();
        DataCopy(yGm, yLocal, BLOCK_LENGTH);
        inQueueY.FreeTensor(yLocal);
    }
}
...

extern "C" __global__ __aicore__ void simple_kernel(__gm__ uint8_t* srcGm, __gm__ uint8_t* dstGm)
{
    AscendC::KernelAdd op;
    // The input data is evenly divided into two pieces of data for computation.
    for (int32_t i = 0; i < TILE_NUM; i++) {
        op.Init(srcGm, dstGm, i);
        op.Process();
    }
}
...