Using TBuf

During the development of most operators, temporary memory is required to store intermediate results of kernel function computation. These intermediate results are represented by temporary variables. The memory occupied by the temporary variables can be managed by the TBuf data structure. For details, see TBuf. The following uses the Add operator that runs on a single core, with bfloat16_t input, as an example to describe how to use TBuf. For details about the complete code of the operator described in this sample, see the Add operator sample with temporary buffer.

On the Atlas A2 Training Series Product/Atlas 800I A2 Inference Product , the Add API does not support summation on the source operands of the bfloat16_t type. Therefore, you need to convert the input data type of the operator to the data type supported by the Add API before computation. To ensure the computing precision, call the Cast API to convert the input data type from bfloat16_t to float, perform Add computation, and then convert the data type back from float to bfloat16_t.

Based on the preceding analysis, the design specifications of the Ascend C Add operator are as follows.

Table 1 Design specifications of the Ascend C Add operator

OpType

Add

Operator input and output

name

shape

data type

format

x (input)

(1, 2048)

bfloat16_t

ND

y (input)

(1, 2048)

bfloat16_t

ND

z (output)

(1, 2048)

bfloat16_t

ND

Kernel function name

add_custom

Main APIs

DataCopy: data movement API

Cast: vector precision conversion API

Add: vector basic arithmetic API

EnQue, DeQue, and others: queue management APIs

Operator implementation file

add_custom.cpp

Operator Class Implementation

The CopyIn and CopyOut tasks in this sample are the same as those in Basic Vector Operators. The following figure shows the Compute task process.

Figure 1 Add computation process with bfloat16_t input
In the Compute task, temporary variables that represent the Cast conversion result and Add computation result need to be stored in the temporary memory. Compared with the KernelAdd operator class in the basic vector operator implementation, two member variables tmpBuf0 and tmpBuf1 of the TBuf type are added to this sample to manage the temporary memory used during computation. The code is as follows:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
class KernelAdd {
public:
    __aicore__ inline KernelAdd() {}
    __aicore__ inline void Init(GM_ADDR x, GM_ADDR y, GM_ADDR z){}
    __aicore__ inline void Process(){}
private:
    __aicore__ inline void CopyIn(){}
    __aicore__ inline void Compute(){}
    __aicore__ inline void CopyOut(){}
private:
    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueX, inQueueY;
    AscendC::TQue<AscendC::TPosition::VECOUT, 1> outQueueZ;
    AscendC::TBuf<AscendC::TPosition::VECCALC> tmpBuf0, tmpBuf1;     
    AscendC::GlobalTensor<bfloat16_t> xGm; 
    AscendC::GlobalTensor<bfloat16_t> yGm;
    AscendC::GlobalTensor<bfloat16_t> zGm;
};

In addition to the original steps, you need to call the InitBuffer API to allocate memory for the TBuf variables. The initialization function code is as follows:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
 __aicore__ inline void Init(GM_ADDR x, GM_ADDR y, GM_ADDR z)
{
    xGm.SetGlobalBuffer((__gm__ bfloat16_t *)x, TOTAL_LENGTH);
    yGm.SetGlobalBuffer((__gm__ bfloat16_t *)y, TOTAL_LENGTH);
    zGm.SetGlobalBuffer((__gm__ bfloat16_t *)z, TOTAL_LENGTH);

    pipe.InitBuffer(inQueueX, 1, TOTAL_LENGTH * sizeof(bfloat16_t));
    pipe.InitBuffer(inQueueY, 1, TOTAL_LENGTH * sizeof(bfloat16_t));
    pipe.InitBuffer(outQueueZ, 1, TOTAL_LENGTH * sizeof(bfloat16_t));
 
    pipe.InitBuffer(tmpBuf0, TOTAL_LENGTH * sizeof(float));
    pipe.InitBuffer(tmpBuf1, TOTAL_LENGTH * sizeof(float));
 }

Based on the vector programming paradigm, the kernel function needs to implement three basic tasks: CopyIn, Compute, and CopyOut. Similar to the basic vector operator implementation, the Process function calls the CopyIn, Compute, and CopyOut functions in sequence. The implementation of the CopyIn and CopyOut functions is the same as that of the CopyIn function of the basic vector operator and CopyOut function of the basic vector operator. The Compute function is implemented as follows:

  1. Call DeQue to obtain LocalTensor from the VECIN queue.
  2. Call TBuf.Get to obtain a tensor of all lengths from TBuf as the temporary memory.
  3. Call Cast to convert LocalTensor to float and store the tensor in the temporary memory.
  4. Call Add to perform vector computation and store the computation result in the temporary memory.
  5. Call Cast to convert the computation result in the temporary memory to bfloat16_t.
  6. Call EnQue to place the bfloat16_t result of LocalTensor in the VECOUT queue.
  7. Call FreeTensor to release LocalTensor that is no longer used.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
__aicore__ inline void Compute()
{
    AscendC::LocalTensor<bfloat16_t> xLocal = inQueueX.DeQue<bfloat16_t> ();
    AscendC::LocalTensor<bfloat16_t> yLocal = inQueueY.DeQue<bfloat16_t> ();
    AscendC::LocalTensor<bfloat16_t> zLocal = outQueueZ.AllocTensor<bfloat16_t> ();
 
    AscendC::LocalTensor<float> tmpTensor0 = tmpBuf0.Get<float>();
    AscendC::LocalTensor<float> tmpTensor1 = tmpBuf1.Get<float>();
    AscendC::Cast(tmpTensor0, xLocal, AscendC::RoundMode::CAST_NONE, TOTAL_LENGTH);
    AscendC::Cast(tmpTensor1, yLocal, AscendC::RoundMode::CAST_NONE, TOTAL_LENGTH);
 
    AscendC::Add(tmpTensor0, tmpTensor0, tmpTensor1, TOTAL_LENGTH);
    AscendC::Cast(zLocal, tmpTensor0, AscendC::RoundMode::CAST_RINT, TOTAL_LENGTH);
 
    outQueueZ.EnQue<bfloat16_t>(zLocal);
    inQueueX.FreeTensor(xLocal);
    inQueueY.FreeTensor(yLocal);
}