Preventing TPipe from Being Created and Initialized Inside the Object

[Priority] Medium

[Compiler background] When a class object is created, memory space is allocated to store related member variables or functions in the class. When a variable in the class needs to be used in computation, the variable value is loaded from the memory to the register. After the computation is complete, the variable is fetched from the register back to the memory. Constant folding and propagation of scalars are optimization methods used by the compiler. Before optimization, the compiler checks whether a variable has been initialized or assigned a value only once. If the prerequisites for compilation optimization are met, the variable value resides in the register, so that when the variable is used in the future, fewer memory read operations are required, improving the runtime performance.

[Description] TPipe is a framework used to manage global memory and synchronization. It can be called to allocate memory for TQue and TBuf. During the writing of Ascend C operators, a class named KernelExample is often used to store variables required for computation. When the TPipe object is defined and initialized in the implementation of the KernelExample class, the memory space of the TPipe object is located within the memory space of the entire KernelExample object. Note that when the TPipe object is created, the TPipe pointer to the global variable is set during object initialization. As a result, the memory of the KernelExample object may be polluted. The compiler will adopt a conservative policy for compilation optimization, rather than perform constant folding and propagation on the scalar variables in the KernelExample object. Therefore, in any scenario, it is recommended that the TPipe object be created outside the KernelExample class so that the memory space of the TPipe object is independent of that of the KernelExample class object. In this way, the compiler is triggered to optimize the compilation of scalars in the KernelExample class, reducing the time consumed by scalar instructions of the operator.

[Negative Example]

In the code, the TPipe object is created and initialized by the KernelExample class, which affects the scalar folding optimization of the compiler. As a result, the scalar increases on the NPU.

template <typename ComputeT> class KernelExample {
 public:
     __aicore__ inline KernelExample() {}

     __aicore__ inline void Init(...)
     {
         ...
         pipe.InitBuffer(xxxBuf, BUFFER_NUM, xxxSize);
         ...
     }

 private:
     ...
     TPipe pipe;
     ...
 };

 extern "C" __global__ __aicore__ void example_kernel(...)
 {
     ...
     KernelExample<float> op;
     op.Init(...);
     ...
 }

[Positive Example]

The TPipe object is created by the kernel entrypoint function, and the TPipe pointer is stored in the KernelExample class.

template <typename ComputeT> class KernelExample {
 public:
     __aicore__ inline KernelExample() {}

     __aicore__ inline void Init(..., TPipe* pipeIn)
     {
         ...
         pipe = pipeIn;
         pipe->InitBuffer(xxxBuf, BUFFER_NUM, xxxSize);
         ...
     }

 private:
     ...
     TPipe* pipe;
     ...
 };

 extern "C" __global__ __aicore__ void example_kernel(...)
 {
     ...
     TPipe pipe;
     KernelExample<float> op;
     op.Init(..., &pipe);
     ...
 }

[Performance Comparison]

Figure 1 Comparison of aiv_scalar_time before and after optimization
Figure 2 Comparison of aiv_scalar_ratio before and after optimization

According to the profile data comparison, the scalar time is significantly optimized. The average time decreases by 17%, from 281 µs to 236 µs. The average scalar_time delay ratio decreases from 21% to 17%. Therefore, this optimization measure can be used in scalar bound scenarios.