Reusing VECIN and VECOUT for Movement Operators
[Priority] High
[Description] Movement operators do not involve actual vector compute during execution. Redundant vector instructions would prolong the overall execution time of operators. In this scenario, TQueBind provided by Ascend C for transfer operators can be used. This API can bind VECIN to VECOUT and skip copying data between them, avoiding unnecessary vector consumption.
[Negative Example]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 | template <typename ComputeT> class KernelExample { public: ... __aicore__ inline void Process(...) { for (int i = 0; i < iLen; ++i) { ... auto iLocal = QueI.AllocTensor<ComputeT>(); DataCopy(iLocal, inGm[i * 32], size); QueI.EnQue(iLocal); iLocal = QueI.DeQue<ComputeT>(); for (int j = 0; j < jLen; ++j) { ... auto oLocal = QueO.AllocTensor<ComputeT>(); DataCopy(oLocal, iLocal, size); // Data copy instruction from a LocalTensor to another LocalTensor, used to move data from VECIN to VECOUT QueO.EnQue(oLocal); auto oLocal = QueO.DeQue<ComputeT>(); DataCopyPad(outGm[j], oLocal, ...); QueO.FreeTensor(oLocal); } QueI.FreeTensor(iLocal); } } private: ... TQue<TPosition::VECIN, BUFFER_NUM> QueI; TQue<TPosition::VECOUT, BUFFER_NUM> QueO; ... }; extern "C" __global__ __aicore__ void example_kernel(...) { ... op.Process(...); } |
[Positive Example]
The DataCopy instruction from a LocalTensor to another LocalTensor is replaced with TQueBind to skip copying VECIN to VECOUT, thereby avoiding redundant copying.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | template <typename ComputeT> class KernelExample { public: ... __aicore__ inline void Process(...) { for (int i = 0; i < iLen; ++i) { ... auto bindLocal = queBind.AllocTensor<ComputeT>(); DataCopy(bindLocal, inGm[i * 32], size); queBind.EnQue(bindLocal); bindLocal = queBind.DeQue<ComputeT>(); for (int j = 0; j < jlen; ++j) { ... DataCopyPad(outGm[j], bindLocal, ...); } queBind.FreeTensor(bindLocal); } } private: ... TQueBind<TPosition::VECIN, TPosition::VECOUT, BUFFER_NUM> queBind; // Use TQueBind to replace QueI and QueO. ... }; extern "C" __global__ __aicore__ void example_kernel(...) { ... op.Process(...); } |
[Performance comparison]

As shown in the preceding figure, the DataCopy instruction in the incorrect example is replaced with TQueBind, achieving obvious optimization. Because data copy from VECIN to VECOUT is omitted, aiv_vec_time is almost reduced to 0.