Reusing VECIN and VECOUT for Transfer Operators
[Priority] High
[Description] Transfer operators do not involve actual vector computation during execution. Redundant vector computations would prolong the overall execution time of operators. In this scenario, TQueBind provided by Ascend C for transfer operators can be used. This API can bind VECIN to VECOUT and skip copying data between them, avoiding unnecessary vector consumption.
[Negative Example]
template <typename ComputeT> class KernelExample {
public:
...
__aicore__ inline void Process(...)
{
for (int i = 0; i < iLen; ++i) {
...
auto iLocal = QueI.AllocTensor<ComputeT>();
DataCopy(iLocal, inGm[i * 32], size);
QueI.EnQue(iLocal);
auto iLocal = QueI.DeQue<ComputeT>();
for (int j = 0; j < jLen; ++j) {
...
auto oLocal = QueO.AllocTensor<ComputeT>();
DataCopy(oLocal, iLocal, size); // Data copy instruction from a LocalTensor to another LocalTensor, used to transfer data from VECIN to VECOUT
QueO.EnQue(oLocal);
auto oLocal = QueO.DeQue<ComputeT>();
DataCopyPad(outGm[j], oLocal, ...);
QueO.FreeTensor(oLocal);
}
QueI.FreeTensor(iLocal);
}
}
private:
...
TQue<QuePosition::VECIN, BUFFER_NUM> QueI;
TQue<QuePosition::VECOUT, BUFFER_NUM> QueO;
...
};
extern "C" __global__ __aicore__ void example_kernel(...)
{
...
op.Process(...);
}
[Positive Example]
The DataCopy instruction from a LocalTensor to another LocalTensor is replaced with TQueBind to skip copying VECIN to VECOUT, thereby avoiding redundant vector computation.
template <typename ComputeT> class KernelExample {
public:
...
__aicore__ inline void Process(...)
{
for (int i = 0; i < iLen; ++i) {
...
auto bindLocal = queBind.AllocTensor<ComputeT>();
DataCopy(bindLocal, inGm[i * 32], size);
queBind.EnQue(bindLocal);
auto bindLocal = queBind.DeQue<ComputeT>();
for (int j = 0; j < len; ++j) {
...
DataCopyPad(outGm[j], bindLocal, ...);
}
queBind.FreeTensor(bindLocal);
}
}
private:
...
TQueBind<QuePosition::VECIN, QuePosition::VECOUT, BUFFER_NUM> queBind; // Use TQueBind to replace QueI and QueO.
...
};
extern "C" __global__ __aicore__ void example_kernel(...)
{
...
op.Process(...);
}
[Performance Comparison]

As shown in the preceding figure, the DataCopy instruction in the incorrect example is replaced with TQueBind, achieving obvious optimization. Because data copy from VECIN to VECOUT is omitted, aiv_vec_time is almost reduced to 0.