How Do I Use Tensor In-place Operations to Improve Operator Performance?

Tensor in-place operation (inplace API) is an optimization technology. It allocates and reserves the LocalTensor memory globally to avoid frequent creation and destruction of LocalTensor objects. The AllocTensor, FreeTensor, EnQue and DeQue APIs do not generate a new LocalTensor. Instead, they repeatedly apply for, release, enqueue, and dequeue the global LocalTensor. The following figure shows the implementation principle.

Figure 1 Implementation principle of in-place tensor operations

Advantages of in-place tensor operations

  • Reduced stack transformation: Compared with the method of constructing a new tensor, the inplace API reduces the stack transformation of the local tensor, allowing the tensor to be repeatedly used.
  • Reduce enqueue and dequeue operations: When EnQue and DeQue are called, the TQue object does not store the buffer address corresponding to the tensor. Actually, there is no enqueue or dequeue operation, reducing the scalar instructions for repeated enqueue and dequeue operations.

Reasons for retaining EnQue and DeQue

Since the tensor in-place operation does not perform the actual enqueue and dequeue operations, why do the EnQue and DeQue APIs need to be retained?

  • Programming compatibility: To ensure the consistency of programming interfaces, the inplace interface still needs to call EnQue and DeQue to ensure the consistency and maintainability of the code structure.
  • Memory synchronization: The memory read/write synchronization function is implemented in the EnQue and DeQue operations to ensure data consistency and correctness. Even if there is no actual queue operation, the synchronization mechanism still needs to be retained.

Application Scenario

Applicable to scenarios with a large number of computation cycles: As shown in Figure 1, the inplace API increases the initialization overhead of the TQue object InitBuffer, but significantly reduces the number of operations on LocalTensors and events in AllocTensor, EnQue, DeQue, and FreeTensor in each cycle, it is especially suitable for scenarios where multiple cycles are required to complete calculation.

Usage

  • Configure the TQue object. When creating a TQue object, set the depth to 0 to enable the inplace operation mode.
  • In-place operation API call: Use the inplace API to directly operate the LocalTensor.
    • The AllocTensor and DeQue APIs are classified into non-inplace and inplace APIs. For details, see AllocTensor and DeQue.
    • FreeTensor and EnQue do not distinguish between non-inplace and inplace APIs.

Sample Code

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
// ...
namespace AscendC {
class MyKernel {
public:
    __aicore__ inline MyKernel() {}
    __aicore__ inline void Init(__gm__ uint8_t* src0Gm, __gm__ uint8_t* src1Gm, __gm__ uint8_t* dstGm)
    {
        src0Global.SetGlobalBuffer((__gm__ half*)src0Gm);
        src1Global.SetGlobalBuffer((__gm__ half*)src1Gm);
        dstGlobal.SetGlobalBuffer((__gm__ half*)dstGm);
        pipe.InitBuffer(srcQue0, 1, BLOCK_SIZE * sizeof(half));
        pipe.InitBuffer(srcQue1, 1, BLOCK_SIZE * sizeof(half));
        pipe.InitBuffer(dstQue0, 1, BLOCK_SIZE * sizeof(half));
    }

    __aicore__ inline void Process()
    {
        for (int i = 0; i < REPTIMES; i++) {
            CopyIn(i);
            Compute(i);
            CopyOut(i);
        }
    }

private:
    __aicore__ inline void CopyIn(int32_t i)
    {
        srcQue0.AllocTensor<half>(src0Local);
        srcQue1.AllocTensor<half>(src1Local);
        DataCopy(src0Local, src0Global[i*BLOCK_SIZE], BLOCK_SIZE);
        DataCopy(src1Local, src1Global[i*BLOCK_SIZE], BLOCK_SIZE);
        srcQue0.EnQue(src0Local);
        srcQue1.EnQue(src1Local);
    }
    __aicore__ inline void Compute(int32_t i)
    {
        srcQue0.DeQue<half>(src0Local);
        srcQue1.DeQue<half>(src1Local);
        dstQue0.AllocTensor<half>(dstLocal);
        Add(dstLocal, src0Local, src1Local, BLOCK_SIZE);
        dstQue0.EnQue<half>(dstLocal);
        srcQue0.FreeTensor(src0Local);
        srcQue1.FreeTensor(src1Local);
    }
    __aicore__ inline void CopyOut(int32_t i)
    {
        dstQue0.DeQue<half>(dstLocal);
        DataCopy(dstGlobal[i*BLOCK_SIZE], dstLocal, BLOCK_SIZE);
        dstQue0.FreeTensor(dstLocal);
    }

private:
    TPipe pipe;
    TQue<QuePosition::VECIN, 0> srcQue0, srcQue1;
    TQue<QuePosition::VECOUT, 0> dstQue0;
    GlobalTensor<half> src0Global, src1Global, dstGlobal;
    LocalTensor<half> src0Local;
    LocalTensor<half> src1Local;
    LocalTensor<half> dstLocal;
};
} // namespace AscendC

// ...