EXTERN_IMPL_BUFPOOL Macro
Function Usage
Developers can manually manage the physical memory of the Unified Buffer and L1 Buffer by using the TBufPool class.
The memory blocks split by the TBufPool class are continuous. Developers may have specific customized memory block allocation requirements, for example, discontinuous memory blocks and memory block sharing between different TQues. In this case, developers need to customize the implementation of a TBufPool.
To simplify customization, the EXTERN_IMPL_BUFPOOL macro is provided to help you customize TBufPool. When using the customized TBufPool function, pay attention to the following:
- Before customizing the TBufPool, initialize the TBufPool memory resource pool by calling the TPipe::InitBufPool API.
- To customize a TBufPool, developers need to allocate, initialize, and release TQue/TBuf memory blocks.
For details about the Reset, Init, GetBufHandle, SetCurAddr, GetCurAddr, SetCurBufSize, and GetCurBufSize APIs defined in the EXTERN_IMPL_BUFPOOL macro, see the subsequent sections. After this macro is used, the preceding APIs can be used to customize the TBufPool function.
The customized TBufPool APIs are for trial use and may be adjusted or improved in later versions. Compatibility is not guaranteed. Pay attention to later versions.
Prototype
1 2 | // The macro definition is omitted. #define EXTERN_IMPL_BUFPOOL(EXT_BUFPOOL, POSITION, BUFID_SIZE) ... |
Parameters
Parameter |
Input/Output |
Meaning |
|---|---|---|
EXT_BUFPOOL |
Input |
Customized TBufPool class name. |
POSITION |
Input |
Logical location of customized TBufPool. The value can be VECIN, VECOUT, VECCALC, A1, B1, or C1. For details about TPosition, see TPosition. |
BUFID_SIZE |
Input |
Number of buffers allocated by the customized TBufPool. A value less than or equal to 16 is recommended. |
Availability
Precautions
None
Returns
None
Example
In the following example, the buffer of 65536 x 3 is allocated to tbufPool0, and the InitBuffer function of MyBufPool is customized to allocate the TQue and Tbuf buffers.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 | #include "kernel_operator.h" class MyBufPool { public: __aicore__ inline MyBufPool() { Init(); } template<class T> __aicore__ inline bool InitBuffer(T& que, uint8_t num, uint32_t len) { len = (len + 32 - 1) / 32 * 32; // Ensure that the memory block length is 32-byte aligned. auto ptr = this->GetBufHandle(this->GetCurBufSize()); auto curPoolAddr = this->GetCurAddr(); // call internal func to initnitial bufhandle que.InitStartBufHandle(ptr, num, len); for (int32_t i = 0; i < num; i++) { que.InitBufHandle(this, i, ptr, curPoolAddr + i * len, len); } this->SetCurAddr(curPoolAddr + num * len); this->SetCurBufSize(this->GetCurBufSize() + num); return true; } template<AscendC::TPosition bufPos> __aicore__ inline bool InitBuffer(AscendC::TBuf<bufPos>& buf, uint32_t len) { len = (len + 32 - 1) / 32 * 32; // Ensure that the memory block length is 32-byte aligned. auto ptr = this->GetBufHandle(this->GetCurBufSize()); auto curPoolAddr = this->GetCurAddr(); // call internal func to initnitial bufhandle buf.InitStartBufHandle(ptr, 1, len); buf.InitBufHandle(this, 0, ptr, curPoolAddr, len); this->SetCurAddr(curPoolAddr + len); this->SetCurBufSize(this->GetCurBufSize() + 1); return true; } EXTERN_IMPL_BUFPOOL(MyBufPool, AscendC::TPosition::VECCALC, 16); }; class MyTBufPoolKernel { public: __aicore__ inline MyTBufPoolKernel() {} __aicore__ inline void Init(__gm__ uint8_t* src0Gm, __gm__ uint8_t* src1Gm, __gm__ uint8_t* dstGm) { src0Global.SetGlobalBuffer((__gm__ half*)src0Gm); src1Global.SetGlobalBuffer((__gm__ half*)src1Gm); dstGlobal.SetGlobalBuffer((__gm__ half*)dstGm); pipe.InitBufPool(tbufPool0, 65536 * 3); tbufPool0.InitBuffer(srcQue0, 1, 65536); tbufPool0.InitBuffer(srcBuf1, 65536); tbufPool0.InitBuffer(dstQue0, 1, 65536); } __aicore__ inline void Process() { CopyIn(); Compute(); CopyOut(); tbufPool0.Reset(); pipe.Reset(); } private: __aicore__ inline void CopyIn() { AscendC::LocalTensor<half> src0Local = srcQue0.AllocTensor<half>(); AscendC::LocalTensor<half> src1Local = srcBuf1.Get<half>(); AscendC::DataCopy(src0Local, src0Global, 32768); AscendC::DataCopy(src1Local, src1Global, 32768); srcQue0.EnQue(src0Local); } __aicore__ inline void Compute() { AscendC::LocalTensor<half> src0Local = srcQue0.DeQue<half>(); AscendC::LocalTensor<half> src1Local = srcBuf1.Get<half>(); AscendC::LocalTensor<half> dstLocal = dstQue0.AllocTensor<half>(); AscendC::Add(dstLocal, src0Local, src1Local, 32768); dstQue0.EnQue<half>(dstLocal); srcQue0.FreeTensor(src0Local); } __aicore__ inline void CopyOut() { AscendC::LocalTensor<half> dstLocal = dstQue0.DeQue<half>(); AscendC::DataCopy(dstGlobal, dstLocal, 32768); dstQue0.FreeTensor(dstLocal); } private: AscendC::TPipe pipe; MyBufPool tbufPool0; AscendC::TBuf<AscendC::QuePosition::VECIN> srcBuf1; AscendC::TQue<AscendC::QuePosition::VECIN, 1> srcQue0; AscendC::TQue<AscendC::QuePosition::VECOUT, 1> dstQue0; AscendC::GlobalTensor<half> src0Global, src1Global, dstGlobal; }; extern "C" __global__ __aicore__ void mytbufpool_kernel(__gm__ uint8_t* src0Gm, __gm__ uint8_t* src1Gm, __gm__ uint8_t* dstGm) { MyTBufPoolKernel op; op.Init(src0Gm, src1Gm, dstGm); op.Process(); } |