Using TBuf
During the development of most operators, temporary memory is required to store intermediate results of kernel function computation. These intermediate results are represented by temporary variables. The memory occupied by the temporary variables can be managed by the TBuf data structure. For details, see TBuf. The following uses the Add operator that runs on a single core, with bfloat16_t input, as an example to describe how to use TBuf. For details about the complete code of the operator described in this sample, see the Add operator sample with temporary buffer.
On the
Based on the preceding analysis, the design specifications of the Ascend C Add operator are as follows.
|
OpType |
Add |
|||
|---|---|---|---|---|
|
Operator input and output |
name |
shape |
data type |
format |
|
x (input) |
(1, 2048) |
bfloat16_t |
ND |
|
|
y (input) |
(1, 2048) |
bfloat16_t |
ND |
|
|
z (output) |
(1, 2048) |
bfloat16_t |
ND |
|
|
Kernel function name |
add_custom |
|||
|
Main APIs |
DataCopy: data movement API |
|||
|
Cast: vector precision conversion API |
||||
|
Add: vector basic arithmetic API |
||||
|
EnQue, DeQue, and others: queue management APIs |
||||
|
Operator implementation file |
add_custom.cpp |
|||
Operator Class Implementation
The CopyIn and CopyOut tasks in this sample are the same as those in Basic Vector Operators. The following figure shows the Compute task process.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
class KernelAdd { public: __aicore__ inline KernelAdd() {} __aicore__ inline void Init(GM_ADDR x, GM_ADDR y, GM_ADDR z){} __aicore__ inline void Process(){} private: __aicore__ inline void CopyIn(){} __aicore__ inline void Compute(){} __aicore__ inline void CopyOut(){} private: AscendC::TPipe pipe; AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueX, inQueueY; AscendC::TQue<AscendC::TPosition::VECOUT, 1> outQueueZ; AscendC::TBuf<AscendC::TPosition::VECCALC> tmpBuf0, tmpBuf1; AscendC::GlobalTensor<bfloat16_t> xGm; AscendC::GlobalTensor<bfloat16_t> yGm; AscendC::GlobalTensor<bfloat16_t> zGm; }; |
In addition to the original steps, you need to call the InitBuffer API to allocate memory for the TBuf variables. The initialization function code is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
__aicore__ inline void Init(GM_ADDR x, GM_ADDR y, GM_ADDR z) { xGm.SetGlobalBuffer((__gm__ bfloat16_t *)x, TOTAL_LENGTH); yGm.SetGlobalBuffer((__gm__ bfloat16_t *)y, TOTAL_LENGTH); zGm.SetGlobalBuffer((__gm__ bfloat16_t *)z, TOTAL_LENGTH); pipe.InitBuffer(inQueueX, 1, TOTAL_LENGTH * sizeof(bfloat16_t)); pipe.InitBuffer(inQueueY, 1, TOTAL_LENGTH * sizeof(bfloat16_t)); pipe.InitBuffer(outQueueZ, 1, TOTAL_LENGTH * sizeof(bfloat16_t)); pipe.InitBuffer(tmpBuf0, TOTAL_LENGTH * sizeof(float)); pipe.InitBuffer(tmpBuf1, TOTAL_LENGTH * sizeof(float)); } |
Based on the vector programming paradigm, the kernel function needs to implement three basic tasks: CopyIn, Compute, and CopyOut. Similar to the basic vector operator implementation, the Process function calls the CopyIn, Compute, and CopyOut functions in sequence. The implementation of the CopyIn and CopyOut functions is the same as that of the CopyIn function of the basic vector operator and CopyOut function of the basic vector operator. The Compute function is implemented as follows:
- Call DeQue to obtain LocalTensor from the VECIN queue.
- Call TBuf.Get to obtain a tensor of all lengths from TBuf as the temporary memory.
- Call Cast to convert LocalTensor to float and store the tensor in the temporary memory.
- Call Add to perform vector computation and store the computation result in the temporary memory.
- Call Cast to convert the computation result in the temporary memory to bfloat16_t.
- Call EnQue to place the bfloat16_t result of LocalTensor in the VECOUT queue.
- Call FreeTensor to release LocalTensor that is no longer used.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
__aicore__ inline void Compute() { AscendC::LocalTensor<bfloat16_t> xLocal = inQueueX.DeQue<bfloat16_t> (); AscendC::LocalTensor<bfloat16_t> yLocal = inQueueY.DeQue<bfloat16_t> (); AscendC::LocalTensor<bfloat16_t> zLocal = outQueueZ.AllocTensor<bfloat16_t> (); AscendC::LocalTensor<float> tmpTensor0 = tmpBuf0.Get<float>(); AscendC::LocalTensor<float> tmpTensor1 = tmpBuf1.Get<float>(); AscendC::Cast(tmpTensor0, xLocal, AscendC::RoundMode::CAST_NONE, TOTAL_LENGTH); AscendC::Cast(tmpTensor1, yLocal, AscendC::RoundMode::CAST_NONE, TOTAL_LENGTH); AscendC::Add(tmpTensor0, tmpTensor0, tmpTensor1, TOTAL_LENGTH); AscendC::Cast(zLocal, tmpTensor0, AscendC::RoundMode::CAST_RINT, TOTAL_LENGTH); outQueueZ.EnQue<bfloat16_t>(zLocal); inQueueX.FreeTensor(xLocal); inQueueY.FreeTensor(yLocal); } |