How to Improve Operator Performance Through Inplace Tensor Operations

Inplace tensor operations (inplace API) are an optimization technique that globally allocates and retains LocalTensor memory. This prevents frequent creation and destruction of LocalTensor objects. The AllocTensor, FreeTensor, EnQue and DeQue APIs do not generate any new LocalTensor objects. Instead, they repeatedly allocate, free, enqueue, and dequeue on the global LocalTensor. The following figure shows the implementation principle.

Figure 1 Implementation principle of inplace tensor operations

Advantages of Inplace Tensor Operations

Reduced stack switching: Compared with constructing a new tensor, the inplace API reduces LocalTensor stack switching, allowing a tensor to be repeatedly used.
Reduced enqueue and dequeue operations: When EnQue and DeQue are called, the TQue object does not store the buffer address corresponding to the tensor. No actual enqueue or dequeue occurs, reducing the scalar instructions for repeated enqueue and dequeue operations.

Reasons for Retaining EnQue and DeQue

Although no actual enqueue or dequeue occurs during inplace tensor operations, there are valid reasons to retain the EnQue and DeQue APIs.

Programming compatibility: To ensure the consistency of programming APIs, the inplace API still needs to call EnQue and DeQue. This ensures the consistency and maintainability of the code structure.
Memory synchronization: The memory read/write synchronization function is implemented during EnQue and DeQue calls to ensure data consistency and correctness. Even without real queue operations, this synchronization mechanism still needs to be retained.

Use Cases

Scenarios requiring multiple computation iterations: As shown in Figure 1, the inplace API increases the initialization (InitBuffer) overhead of the TQue object, but significantly reduces the number of operations on LocalTensor and events in AllocTensor, EnQue, DeQue, and FreeTensor during each iteration. This makes it particularly well-suited for computation scenarios that require multiple iterations to complete.

How to Use

Configuring the TQue object: When creating a TQue object, set depth to 0 to enable the inplace operation mode.
Calling the inplace operation API: Call the inplace API to directly operate the LocalTensor.
- AllocTensor and DeQue differentiate between non-inplace and inplace modes. For details, see AllocTensor and DeQue.
- FreeTensor and EnQue do not differentiate between non-inplace and inplace modes.

Sample Code

      
       
         
         
           // ...
namespace AscendC {
class MyKernel {
public:
    __aicore__ inline MyKernel() {}
    __aicore__ inline void Init(__gm__ uint8_t* src0Gm, __gm__ uint8_t* src1Gm, __gm__ uint8_t* dstGm)
    {
        src0Global.SetGlobalBuffer((__gm__ half*)src0Gm);
        src1Global.SetGlobalBuffer((__gm__ half*)src1Gm);
        dstGlobal.SetGlobalBuffer((__gm__ half*)dstGm);
        pipe.InitBuffer(srcQue0, 1, BLOCK_SIZE * sizeof(half));
        pipe.InitBuffer(srcQue1, 1, BLOCK_SIZE * sizeof(half));
        pipe.InitBuffer(dstQue0, 1, BLOCK_SIZE * sizeof(half));
    }

    __aicore__ inline void Process()
    {
        for (int i = 0; i < REPTIMES; i++) {
            CopyIn(i);
            Compute(i);
            CopyOut(i);
        }
    }

private:
    __aicore__ inline void CopyIn(int32_t i)
    {
        srcQue0.AllocTensor<half>(src0Local);
        srcQue1.AllocTensor<half>(src1Local);
        DataCopy(src0Local, src0Global[i*BLOCK_SIZE], BLOCK_SIZE);
        DataCopy(src1Local, src1Global[i*BLOCK_SIZE], BLOCK_SIZE);
        srcQue0.EnQue(src0Local);
        srcQue1.EnQue(src1Local);
    }
    __aicore__ inline void Compute(int32_t i)
    {
        srcQue0.DeQue<half>(src0Local);
        srcQue1.DeQue<half>(src1Local);
        dstQue0.AllocTensor<half>(dstLocal);
        Add(dstLocal, src0Local, src1Local, BLOCK_SIZE);
        dstQue0.EnQue<half>(dstLocal);
        srcQue0.FreeTensor(src0Local);
        srcQue1.FreeTensor(src1Local);
    }
    __aicore__ inline void CopyOut(int32_t i)
    {
        dstQue0.DeQue<half>(dstLocal);
        DataCopy(dstGlobal[i*BLOCK_SIZE], dstLocal, BLOCK_SIZE);
        dstQue0.FreeTensor(dstLocal);
    }

private:
    TPipe pipe;
    TQue<QuePosition::VECIN, 0> srcQue0, srcQue1;
    TQue<QuePosition::VECOUT, 0> dstQue0;
    GlobalTensor<half> src0Global, src1Global, dstGlobal;
    LocalTensor<half> src0Local;
    LocalTensor<half> src1Local;
    LocalTensor<half> dstLocal;
};
} // namespace AscendC

// ...

          

        

      
     

Parent topic: Common Operations