Introduction to TQue

Tasks communicate and synchronize with each other through queues. TQue is a data structure used to perform queue-related operations and manage related resources. TQue is inherited from the TQueBind parent class. The inheritance relationship is as follows:

Template Parameters

template <TPosition pos, int32_t depth, auto mask = 0> class TQue{...};

Table 1 Parameters in the TQue template

Parameter

Meaning

pos

Logical location of a queue. It can be VECIN, VECOUT, A1, A2, B1, B2, CO1, or CO2. For details about TPosition, see TPosition.

depth

The depth of a queue indicates the number of consecutive enqueue or dequeue operations that can be performed in the queue. During code running, if there are n consecutive EnQues (with no DeQues in between) in the queue, then its depth needs to be set to n.

Note that the queue depth is irrelevant to double buffering. The queue mechanism is used to implement pipeline parallelism. On this basis, double buffering further improves the pipeline utilization. Even if the queue depth is 1, double buffering can still be enabled.

When the queue depth is set to 1, the compiler is optimized to achieve better performance. Therefore, it is recommended to set the queue depth to 1.

In the following example, the queue is not enqueued consecutively, and the queue depth is set to 1.
1 2 3 4
a1 = que.AllocTensor(); que.EnQue(a1); a1 = que.DeQue(); que.FreeTensor(a1);

In the following example, the queue is enqueued two consecutive times. The queue depth should be set to 2. This may be used only in a few preload scenarios. (For example, two pieces of data are consecutively moved in. Once one piece is computed, another piece is moved in, and then the previously moved-in piece is computed.) In other cases, it is not recommended to set the depth to 2 or higher.
1 2 3 4 5 6 7 8
a1 = que.AllocTensor(); a2 = que.AllocTensor(); que.EnQue(a1); que.EnQue(a2); a1 = que.DeQue(); a2 = que.DeQue(); que.FreeTensor(a1); que.FreeTensor(a2);

mask

If the mask is of the const TQueConfig* type, the structure and parameters of TQueConfig are defined as follows. For a call example, see Example.

struct TQueConfig {
    bool scmBlockGroup = false;  // tscm parameter. This parameter is reserved. The default value is false.
    uint32_t bufferLen = 0; // The value must be the same as the value of len entered during InitBuffer. Performance optimization can be performed during compilation. The value 0 indicates that resources are allocated during InitBuffer.
    uint32_t bufferNumber = 0;  // The value must be the same as the num parameter entered during InitBuffer. Performance optimization can be performed during compilation. The value 0 indicates that resources are allocated during InitBuffer.
    uint32_t consumerSize = 0;  // Reserved
    TPosition consumer[8] = {}; // Reserved
    bool enableStaticEvtId = false; // Reserved
    bool enableLoopQueue = false;   // Reserved
};

TQue Buffer Limit

The buffer allocated by the TQue stores the synchronization event ID. Therefore, the number of QUE buffers in the same TPosition is related to the synchronization event ID of the hardware.

For the Atlas Training Series Product, the number of event IDs is 4.

The maximum number of buffers of the QUE is 8 or 4, that is, the number of synchronization events that can be inserted is 8 or 4. When the InitBuffer of the TPipe is used to allocate TQue, the maximum number of TQue that can be allocated is 8 or 4 due to buffer limits.

If the number of QUE buffers used at the same time exceeds the limit, no more TQue can be allocated. If you want to continue the allocation, you can call the FreeAllEvent API to release some TQue that is not used temporarily. After the corresponding TQue is used, this API is used to release all events in the corresponding queue. Then, the TQue can be allocated again. An example is as follows.

// The maximum number of buffers in the VECIN position that can be allocated is eight. If this limit is exceeded, resource allocation may fail when AllocTensor or FreeTensor is used. Therefore, when the double buffer function is disabled, a maximum of eight TQues can be allocated.
AscendC::TPipe pipe;
int len = 1024;
AscendC::TQue<AscendC::TPosition::VECIN, 1> que0;
AscendC::TQue<AscendC::TPosition::VECIN, 1> que1;
AscendC::TQue<AscendC::TPosition::VECIN, 1> que2;
AscendC::TQue<AscendC::TPosition::VECIN, 1> que3;
AscendC::TQue<AscendC::TPosition::VECIN, 1> que4;
AscendC::TQue<AscendC::TPosition::VECIN, 1> que5;
AscendC::TQue<AscendC::TPosition::VECIN, 1> que6;
AscendC::TQue<AscendC::TPosition::VECIN, 1> que7;
 
pipe.InitBuffer(que0, 1, len);
pipe.InitBuffer(que1, 1, len);
pipe.InitBuffer(que2, 1, len);
pipe.InitBuffer(que3, 1, len);
pipe.InitBuffer(que4, 1, len);
pipe.InitBuffer(que5, 1, len);
pipe.InitBuffer(que6, 1, len);
pipe.InitBuffer(que7, 1, len);
 
// If double buffer is enabled, two memory blocks are allocated to each TQue. Therefore, a maximum of four TQues can be allocated.
TPipe pipe;
int len = 1024;
AscendC::TQue<AscendC::TPosition::VECIN, 1> que0;
AscendC::TQue<AscendC::TPosition::VECIN, 1> que1;
AscendC::TQue<AscendC::TPosition::VECIN, 1> que2;
AscendC::TQue<AscendC::TPosition::VECIN, 1> que3;
 
pipe.InitBuffer(que0, 2, len);
pipe.InitBuffer(que1, 2, len);
pipe.InitBuffer(que2, 2, len);
pipe.InitBuffer(que3, 2, len);
 
// If the number of TQue reaches the maximum, call the FreeAllEvent API to allocate more TQue.
AscendC::TPipe pipe;
int len = 1024;
AscendC::TQue<AscendC::TPosition::VECIN, 1> que0;
pipe.InitBuffer(que0, 1, len);
AscendC::LocalTensor<half> tensor1 = que0.AllocTensor<half>();
que0.EnQue(tensor1);
tensor1 = que0.DeQue<half>(); // Move the tensor out of the VECOUT queue.
que0.FreeTensor<half>(tensor1);
que0.FreeAllEvent(); // Release all synchronization events of que0. After that, you can continue to allocate TQue.
AscendC::TQue<AscendC::TPosition::VECIN, 1> que1;
pipe.InitBuffer(que1, 1, len);

Example

In the following cases, TQueConfig is passed to enable the compilation period calculation of bufferNumber. The vector operator does not involve data format conversion. Therefore, the values of nd2nz and nz2nd are false.

// User-defined meta function for constructing TQueConfig
__aicore__ constexpr AscendC::TQueConfig GetMyTQueConfig(bool nd2nzIn, bool nz2ndIn, bool scmBlockGroupIn,
    uint32_t bufferLenIn, uint32_t bufferNumberIn, uint32_t consumerSizeIn, const AscendC::TPosition consumerIn[])
{
    return {
        .nd2nz = nd2nzIn,
        .nz2nd = nz2ndIn,
        .scmBlockGroup = scmBlockGroupIn,
        .bufferLen = bufferLenIn,
        .bufferNumber = bufferNumberIn,
        .consumerSize = consumerSizeIn,
        .consumer = {consumerIn[0], consumerIn[1], consumerIn[2], consumerIn[3],
            consumerIn[4], consumerIn[5], consumerIn[6], consumerIn[7]}
    };
}
static constexpr AscendC::TPosition tp[8] = {AscendC::TPosition::MAX, AscendC::TPosition::MAX, AscendC::TPosition::MAX, AscendC::TPosition::MAX,
            AscendC::TPosition::MAX, AscendC::TPosition::MAX, AscendC::TPosition::MAX, AscendC::TPosition::MAX};
static constexpr AscendC::TQueConfig conf = GetMyTQueConfig(false, false, false, 0, 1, 0, tp);
template <typename srcType> class KernelAscendQuant {
public:
    __aicore__ inline KernelAscendQuant() {}
    __aicore__ inline void Init(GM_ADDR src_gm, GM_ADDR dst_gm, uint32_t inputSize)
    {
        dataSize = inputSize;
        src_global.SetGlobalBuffer(reinterpret_cast<__gm__ srcType*>(src_gm), dataSize);
        dst_global.SetGlobalBuffer(reinterpret_cast<__gm__ int8_t*>(dst_gm), dataSize);
        pipe.InitBuffer(inQueueX, 1, dataSize * sizeof(srcType));
        pipe.InitBuffer(outQueue, 1, dataSize * sizeof(int8_t));
    }
    __aicore__ inline void Process()
    {
        CopyIn();
        Compute();
        CopyOut();
    }
private:
    __aicore__ inline void CopyIn()
    {
        ...
    }
    __aicore__ inline void Compute()
    {
        ...
    }
    __aicore__ inline void CopyOut()
    {
        ...
    }
private:
    AscendC::GlobalTensor<srcType> src_global;
    AscendC::GlobalTensor<int8_t> dst_global;
    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::QuePosition::VECIN, 1, &conf> inQueueX;
    AscendC::TQue<AscendC::QuePosition::VECOUT, 1, &conf> outQueue;
    uint32_t dataSize = 0;
};
template <typename dataType> __aicore__ void kernel_ascend_quant_operator(GM_ADDR src_gm, GM_ADDR dst_gm, uint32_t dataSize)
{
    KernelAscendQuant<dataType> op;
    op.Init(src_gm, dst_gm, dataSize);
    op.Process();
}

Parent topic: TQue