Reusing VECIN and VECOUT for Transfer Operators

[Priority] High

[Description] Transfer operators do not involve actual vector computation during execution. Redundant vector computations would prolong the overall execution time of operators. In this scenario, TQueBind provided by Ascend C for transfer operators can be used. This API can bind VECIN to VECOUT and skip copying data between them, avoiding unnecessary vector consumption.

[Negative Example]

The following code snippet contains the DataCopy instruction from a LocalTensor to another LocalTensor, to ensure pipeline synchronization between VECIN and VECOUT.
template <typename ComputeT> class KernelExample {
 public:
     ...
     __aicore__ inline void Process(...)
     {
         for (int i = 0; i < iLen; ++i) {
             ... 
             auto iLocal = QueI.AllocTensor<ComputeT>();
             DataCopy(iLocal, inGm[i * 32], size);
             QueI.EnQue(iLocal);
             auto iLocal = QueI.DeQue<ComputeT>();
             for (int j = 0; j < jLen; ++j) { 
                 ...
                 auto oLocal = QueO.AllocTensor<ComputeT>();
                 DataCopy(oLocal, iLocal, size); // Data copy instruction from a LocalTensor to another LocalTensor, used to transfer data from VECIN to VECOUT
                 QueO.EnQue(oLocal);

                 auto oLocal = QueO.DeQue<ComputeT>();
                 DataCopyPad(outGm[j], oLocal, ...);
                 QueO.FreeTensor(oLocal);
             }
             QueI.FreeTensor(iLocal);
         }
     }

 private:
     ... 
     TQue<QuePosition::VECIN, BUFFER_NUM> QueI;
     TQue<QuePosition::VECOUT, BUFFER_NUM> QueO;
     ...
 };

 extern "C" __global__ __aicore__ void example_kernel(...)
 {
     ...
     op.Process(...);
 }

[Positive Example]

The DataCopy instruction from a LocalTensor to another LocalTensor is replaced with TQueBind to skip copying VECIN to VECOUT, thereby avoiding redundant vector computation.

template <typename ComputeT> class KernelExample {
 public:
     ...
     __aicore__ inline void Process(...)
     {
         for (int i = 0; i < iLen; ++i) {
             ... 
             auto bindLocal = queBind.AllocTensor<ComputeT>();
             DataCopy(bindLocal, inGm[i * 32], size);
             queBind.EnQue(bindLocal);
             auto bindLocal = queBind.DeQue<ComputeT>();
             for (int j = 0; j < len; ++j) {
                 ...
                 DataCopyPad(outGm[j], bindLocal, ...);
             }
             queBind.FreeTensor(bindLocal);
         }
     }

 private:
     ... 
     TQueBind<QuePosition::VECIN, QuePosition::VECOUT, BUFFER_NUM> queBind; // Use TQueBind to replace QueI and QueO.
     ...
 };

 extern "C" __global__ __aicore__ void example_kernel(...)
 {
     ...
     op.Process(...);
 }

[Performance Comparison]

Figure 1 Comparison of aiv_vec_time before and after optimization

As shown in the preceding figure, the DataCopy instruction in the incorrect example is replaced with TQueBind, achieving obvious optimization. Because data copy from VECIN to VECOUT is omitted, aiv_vec_time is almost reduced to 0.