如何使用Tensor原地操作提升算子性能

Tensor原地操作（inplace接口）是一种优化技术，全局申请、保留LocalTensor内存，避免了频繁创建和销毁LocalTensor对象。AllocTensor、FreeTensor、EnQue、DeQue接口不产生新的LocalTensor，而是在该全局LocalTensor上反复申请、释放、入队、出队。其实现原理如下图所示：

图1 Tensor原地操作实现原理

Tensor原地操作的优势

减少栈变换：相比构造新Tensor的方式，inplace接口减少了LocalTensor的栈变换，允许Tensor被反复使用。
减少入队/出队操作：在调用EnQue、DeQue的过程中，TQue对象没有存储该Tensor对应的Buffer地址，实际没有真正入队、出队，减少了反复入队、出队的Scalar指令。

保留EnQue和DeQue的原因

既然Tensor原地操作没有执行真正的入队出队操作，为什么还需要保留EnQue和DeQue接口呢？

编程兼容性：为了保持编程接口的一致性，inplace接口仍然需要调用EnQue和DeQue，确保代码结构的统一性和可维护性。
内存同步功能：EnQue和DeQue操作中实现了内存读写同步功能，确保数据的一致性和正确性，即使没有实际的队列操作，这些同步机制仍然需要保留。

适用场景

适合计算循环次数多的场景：如图1所示，inplace接口虽然增加了TQue对象InitBuffer的初始化开销，但显著减少了每次循环中AllocTensor、EnQue、DeQue和FreeTensor内部对LocalTensor和事件的操作次数，特别适合需要多次循环来完成计算的场景。

使用方法

配置TQue对象：在创建TQue对象时，设置深度（depth）为0，启用inplace操作模式。
调用原地操作接口：使用inplace接口直接操作LocalTensor。
- AllocTensor和DeQue区分non-inplace和inplace接口，详情请参考AllocTensor、DeQue。
- FreeTensor和EnQue不区分non-inplace和inplace接口。

示例代码

      
       
         
         
           // ...
namespace AscendC {
class MyKernel {
public:
    __aicore__ inline MyKernel() {}
    __aicore__ inline void Init(__gm__ uint8_t* src0Gm, __gm__ uint8_t* src1Gm, __gm__ uint8_t* dstGm)
    {
        src0Global.SetGlobalBuffer((__gm__ half*)src0Gm);
        src1Global.SetGlobalBuffer((__gm__ half*)src1Gm);
        dstGlobal.SetGlobalBuffer((__gm__ half*)dstGm);
        pipe.InitBuffer(srcQue0, 1, BLOCK_SIZE * sizeof(half));
        pipe.InitBuffer(srcQue1, 1, BLOCK_SIZE * sizeof(half));
        pipe.InitBuffer(dstQue0, 1, BLOCK_SIZE * sizeof(half));
    }

    __aicore__ inline void Process()
    {
        for (int i = 0; i < REPTIMES; i++) {
            CopyIn(i);
            Compute(i);
            CopyOut(i);
        }
    }

private:
    __aicore__ inline void CopyIn(int32_t i)
    {
        srcQue0.AllocTensor<half>(src0Local);
        srcQue1.AllocTensor<half>(src1Local);
        DataCopy(src0Local, src0Global[i*BLOCK_SIZE], BLOCK_SIZE);
        DataCopy(src1Local, src1Global[i*BLOCK_SIZE], BLOCK_SIZE);
        srcQue0.EnQue(src0Local);
        srcQue1.EnQue(src1Local);
    }
    __aicore__ inline void Compute(int32_t i)
    {
        srcQue0.DeQue<half>(src0Local);
        srcQue1.DeQue<half>(src1Local);
        dstQue0.AllocTensor<half>(dstLocal);
        Add(dstLocal, src0Local, src1Local, BLOCK_SIZE);
        dstQue0.EnQue<half>(dstLocal);
        srcQue0.FreeTensor(src0Local);
        srcQue1.FreeTensor(src1Local);
    }
    __aicore__ inline void CopyOut(int32_t i)
    {
        dstQue0.DeQue<half>(dstLocal);
        DataCopy(dstGlobal[i*BLOCK_SIZE], dstLocal, BLOCK_SIZE);
        dstQue0.FreeTensor(dstLocal);
    }

private:
    TPipe pipe;
    TQue<QuePosition::VECIN, 0> srcQue0, srcQue1;
    TQue<QuePosition::VECOUT, 0> dstQue0;
    GlobalTensor<half> src0Global, src1Global, dstGlobal;
    LocalTensor<half> src0Local;
    LocalTensor<half> src1Local;
    LocalTensor<half> dstLocal;
};
} // namespace AscendC

// ...

          

        

      
     

父主题： 常用操作