避免TPipe在对象内创建和初始化

【优先级】中

【描述】TPipe是用来管理全局内存和同步的框架，用户可以调用TPipe的接口为TQue/TBuf进行内存分配。scalar常量折叠是在编译时简化常量的一个过程。通过对常量进行折叠，可以简化代码，提升运行性能。参考c++编译，在scalar流水成为性能瓶颈（scalar bound）的场景下可以精简TPipe对象的创建和初始化，减少内存地址的占用，从而减少scalar指令耗时。

【反例】

代码中TPipe对象由KernelExample类内部创建并初始化，影响编译器scalar折叠优化，在npu侧导致scalar无谓增加。

template <typename ComputeT> class KernelExample {
 public:
     __aicore__ inline KernelExample() {}

     __aicore__ inline void Init(...)
     {
         ...
         pipe.InitBuffer(xxxBuf, BUFFER_NUM, xxxSize);
         ...
     }

 private:
     ...
     TPipe pipe;
     ...
 };

 extern "C" __global__ __aicore__ void example_kernel(...)
 {
     ...
     KernelExample<float> op;
     op.Init(...);
     ...
 }

【正例】

改为由kernel入口函数创建TPipe对象，在KernelExample类中保存TPipe指针使用。

template <typename ComputeT> class KernelExample {
 public:
     __aicore__ inline KernelExample() {}

     __aicore__ inline void Init(..., TPipe* pipeIn)
     {
         ...
         pipe = pipeIn;
         pipe->InitBuffer(xxxBuf, BUFFER_NUM, xxxSize);
         ...
     }

 private:
     ...
     TPipe* pipe;
     ...
 };

 extern "C" __global__ __aicore__ void example_kernel(...)
 {
     ...
     TPipe pipe;
     KernelExample<float> op;
     op.Init(..., &pipe);
     ...
 }

【性能对比】

图1 aiv_scalar_time优化前后对比

图2 aiv_scalar_ratio优化前后对比

通过性能数据对比可以看出，scalar time优化明显，平均时间从281us减少到236us，下降17%；平均scalar_time时延占比从21%下降到17%。因此在scalar bound的场景下可以使用此优化措施。

父主题： API使用优化