Introduction to TQue
Tasks communicate and synchronize with each other through queues. TQue is a data structure used to perform queue-related operations and manage related resources. TQue is inherited from the TQueBind parent class. The inheritance relationship is as follows:

Template Parameters
1 | template <TPosition pos, int32_t depth, auto mask = 0> class TQue{...}; |
Parameter |
Meaning |
||||
|---|---|---|---|---|---|
pos |
Logical location of a queue. It can be VECIN, VECOUT, A1, A2, B1, B2, CO1, or CO2. For details about TPosition, see TPosition. |
||||
depth |
The depth of a queue indicates the number of consecutive enqueue or dequeue operations that can be performed in the queue. During code running, if there are n consecutive EnQues (with no DeQues in between) in the queue, then its depth needs to be set to n. Note that the queue depth is irrelevant to double buffering. The queue mechanism is used to implement pipeline parallelism. On this basis, double buffering further improves the pipeline utilization. Even if the queue depth is 1, double buffering can still be enabled. When the queue depth is set to 1, the compiler is optimized to achieve better performance. Therefore, it is recommended to set the queue depth to 1.
|
||||
mask |
|
TQue Buffer Limit
The buffer allocated by the TQue stores the synchronization event ID. Therefore, the number of QUE buffers in the same TPosition is related to the synchronization event ID of the hardware.
For the
The maximum number of buffers of the QUE is 8 or 4, that is, the number of synchronization events that can be inserted is 8 or 4. When the InitBuffer of the TPipe is used to allocate TQue, the maximum number of TQue that can be allocated is 8 or 4 due to buffer limits.
If the number of QUE buffers used at the same time exceeds the limit, no more TQue can be allocated. If you want to continue the allocation, you can call the FreeAllEvent API to release some TQue that is not used temporarily. After the corresponding TQue is used, this API is used to release all events in the corresponding queue. Then, the TQue can be allocated again. An example is as follows.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 | // The maximum number of buffers in the VECIN position that can be allocated is eight. If this limit is exceeded, resource allocation may fail when AllocTensor or FreeTensor is used. Therefore, when the double buffer function is disabled, a maximum of eight TQues can be allocated. AscendC::TPipe pipe; int len = 1024; AscendC::TQue<AscendC::TPosition::VECIN, 1> que0; AscendC::TQue<AscendC::TPosition::VECIN, 1> que1; AscendC::TQue<AscendC::TPosition::VECIN, 1> que2; AscendC::TQue<AscendC::TPosition::VECIN, 1> que3; AscendC::TQue<AscendC::TPosition::VECIN, 1> que4; AscendC::TQue<AscendC::TPosition::VECIN, 1> que5; AscendC::TQue<AscendC::TPosition::VECIN, 1> que6; AscendC::TQue<AscendC::TPosition::VECIN, 1> que7; pipe.InitBuffer(que0, 1, len); pipe.InitBuffer(que1, 1, len); pipe.InitBuffer(que2, 1, len); pipe.InitBuffer(que3, 1, len); pipe.InitBuffer(que4, 1, len); pipe.InitBuffer(que5, 1, len); pipe.InitBuffer(que6, 1, len); pipe.InitBuffer(que7, 1, len); // If double buffer is enabled, two memory blocks are allocated to each TQue. Therefore, a maximum of four TQues can be allocated. TPipe pipe; int len = 1024; AscendC::TQue<AscendC::TPosition::VECIN, 1> que0; AscendC::TQue<AscendC::TPosition::VECIN, 1> que1; AscendC::TQue<AscendC::TPosition::VECIN, 1> que2; AscendC::TQue<AscendC::TPosition::VECIN, 1> que3; pipe.InitBuffer(que0, 2, len); pipe.InitBuffer(que1, 2, len); pipe.InitBuffer(que2, 2, len); pipe.InitBuffer(que3, 2, len); // If the number of TQue reaches the maximum, call the FreeAllEvent API to allocate more TQue. AscendC::TPipe pipe; int len = 1024; AscendC::TQue<AscendC::TPosition::VECIN, 1> que0; pipe.InitBuffer(que0, 1, len); AscendC::LocalTensor<half> tensor1 = que0.AllocTensor<half>(); que0.EnQue(tensor1); tensor1 = que0.DeQue<half>(); // Move the tensor out of the VECOUT queue. que0.FreeTensor<half>(tensor1); que0.FreeAllEvent(); // Release all synchronization events of que0. After that, you can continue to allocate TQue. AscendC::TQue<AscendC::TPosition::VECIN, 1> que1; pipe.InitBuffer(que1, 1, len); |
Example
In the following cases, TQueConfig is passed to enable the compilation period calculation of bufferNumber. The vector operator does not involve data format conversion. Therefore, the values of nd2nz and nz2nd are false.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 | // User-defined meta function for constructing TQueConfig __aicore__ constexpr AscendC::TQueConfig GetMyTQueConfig(bool nd2nzIn, bool nz2ndIn, bool scmBlockGroupIn, uint32_t bufferLenIn, uint32_t bufferNumberIn, uint32_t consumerSizeIn, const AscendC::TPosition consumerIn[]) { return { .nd2nz = nd2nzIn, .nz2nd = nz2ndIn, .scmBlockGroup = scmBlockGroupIn, .bufferLen = bufferLenIn, .bufferNumber = bufferNumberIn, .consumerSize = consumerSizeIn, .consumer = {consumerIn[0], consumerIn[1], consumerIn[2], consumerIn[3], consumerIn[4], consumerIn[5], consumerIn[6], consumerIn[7]} }; } static constexpr AscendC::TPosition tp[8] = {AscendC::TPosition::MAX, AscendC::TPosition::MAX, AscendC::TPosition::MAX, AscendC::TPosition::MAX, AscendC::TPosition::MAX, AscendC::TPosition::MAX, AscendC::TPosition::MAX, AscendC::TPosition::MAX}; static constexpr AscendC::TQueConfig conf = GetMyTQueConfig(false, false, false, 0, 1, 0, tp); template <typename srcType> class KernelAscendQuant { public: __aicore__ inline KernelAscendQuant() {} __aicore__ inline void Init(GM_ADDR src_gm, GM_ADDR dst_gm, uint32_t inputSize) { dataSize = inputSize; src_global.SetGlobalBuffer(reinterpret_cast<__gm__ srcType*>(src_gm), dataSize); dst_global.SetGlobalBuffer(reinterpret_cast<__gm__ int8_t*>(dst_gm), dataSize); pipe.InitBuffer(inQueueX, 1, dataSize * sizeof(srcType)); pipe.InitBuffer(outQueue, 1, dataSize * sizeof(int8_t)); } __aicore__ inline void Process() { CopyIn(); Compute(); CopyOut(); } private: __aicore__ inline void CopyIn() { ... } __aicore__ inline void Compute() { ... } __aicore__ inline void CopyOut() { ... } private: AscendC::GlobalTensor<srcType> src_global; AscendC::GlobalTensor<int8_t> dst_global; AscendC::TPipe pipe; AscendC::TQue<AscendC::QuePosition::VECIN, 1, &conf> inQueueX; AscendC::TQue<AscendC::QuePosition::VECOUT, 1, &conf> outQueue; uint32_t dataSize = 0; }; template <typename dataType> __aicore__ void kernel_ascend_quant_operator(GM_ADDR src_gm, GM_ADDR dst_gm, uint32_t dataSize) { KernelAscendQuant<dataType> op; op.Init(src_gm, dst_gm, dataSize); op.Process(); } |