Introduction to TQue

Tasks communicate and synchronize with each other through queues. TQue is a data structure used to perform queue-related operations and manage related resources. TQue is inherited from the TQueBind parent class. The inheritance relationship is as follows:

Template Parameters

1
template <TPosition pos, int32_t depth, auto mask = 0> class TQue{...};
Table 1 Parameters in the TQue template

Parameter

Meaning

pos

Logical location of a queue. It can be VECIN, VECOUT, A1, A2, B1, B2, CO1, or CO2. For details about TPosition, see TPosition.

depth

The depth of a queue indicates the number of consecutive enqueue or dequeue operations that can be performed in the queue. During code running, if there are n consecutive EnQues (with no DeQues in between) in the queue, then its depth needs to be set to n.

Note that the queue depth is irrelevant to double buffering. The queue mechanism is used to implement pipeline parallelism. On this basis, double buffering further improves the pipeline utilization. Even if the queue depth is 1, double buffering can still be enabled.

In the scenario where the tensor is not operated in-place, if the queue depth is set to 1, the compiler performs special optimization for this scenario, and the performance is usually better. Therefore, 1 is recommended.

In the in-place tensor operation scenario, the value must be set to 0.

  • In the following example, the queue is not enqueued consecutively, and the queue depth is set to 1.
    1
    2
    3
    4
    a1 = que.AllocTensor(); 
    que.EnQue(a1);
    a1 = que.DeQue();
    que.FreeTensor(a1);
    
  • In the following example, the queue is enqueued two consecutive times. The queue depth should be set to 2. This may be used only in a few preload scenarios. (For example, two pieces of data are consecutively moved in. Once one piece is computed, another piece is moved in, and then the previously moved-in piece is computed.) In other cases, it is not recommended to set the depth to 2 or higher.
    1
    2
    3
    4
    5
    6
    7
    8
    a1 = que.AllocTensor(); 
    a2 = que.AllocTensor();
    que.EnQue(a1);
    que.EnQue(a2);
    a1 = que.DeQue();
    a2 = que.DeQue(); 
    que.FreeTensor(a1);
    que.FreeTensor(a2);
    

mask

  • If the mask is of the int type, bits are used to express information.
    • If bit 0 is 1, the data format is converted from ND to NZ. TPosition supports only A1 or B1.
    • If bit 1 is 1, the data format is converted from NZ to ND. TPosition supports only CO2.

    The following models are supported:

    Atlas inference product 's AI Core

  • If the mask is of the const TQueConfig* type, the structure and parameters of TQueConfig are defined as follows. For a call example, see Example.
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    struct TQueConfig {
        bool nd2nz = false;  // true indicates that the data format is converted from ND to NZ. TPosition can only be A1 or B1. The default value is false.
        bool nz2nd = false;  // true indicates that the data format is converted from NZ to ND. TPosition can only be CO2. The default value is false.
        bool scmBlockGroup = false; // TSCM-related parameter, which is reserved. The default value is false.
        uint32_t bufferLen = 0; // The value must be the same as the value of len entered during InitBuffer. Performance optimization can be performed during compilation. The value 0 indicates that resources are allocated during InitBuffer.
        uint32_t bufferNumber = 0;  // The value must be the same as the num parameter entered during InitBuffer. Performance optimization can be performed during compilation. The value 0 indicates that resources are allocated during InitBuffer.
        uint32_t consumerSize = 0;  // Reserved
        TPosition consumer[8] = {}; // Reserved
        bool enableStaticEvtId = false; // Reserved
        bool enableLoopQueue = false;   // Reserved
    };
    

    The following models support the parameters related to ND and NZ format conversion:

    Atlas inference product 's AI Core

TQue Buffer Limit

The buffer allocated by TQue stores the synchronization event ID. Therefore, the number of QUE buffers in the same TPosition is related to the synchronization event ID of the hardware.

For the Atlas training products , the number of event IDs is 4.

For the Atlas inference product 's AI Core, the number of event IDs is 8.

For the Atlas inference product 's Vector Core, the number of event IDs is 8.

For the Atlas A2 training products / Atlas A2 inference products , the number of event IDs is 8.

For the Atlas A3 training products / Atlas A3 inference products , the number of event IDs is 8.

For the Atlas 200I/500 A2 inference products , the number of event IDs is 8.

The maximum number of buffers of the QUE is 8 or 4, that is, the number of synchronization events that can be inserted is 8 or 4. When the InitBuffer of the TPipe is used to allocate TQue, the maximum number of TQue that can be allocated is 8 or 4 due to buffer limits.

If the number of QUE buffers used at the same time exceeds the limit, no more TQue can be allocated. If you want to continue the allocation, you can call the FreeAllEvent API to release some TQue that is not used temporarily. After the corresponding TQue is used, this API is called to release all events in the corresponding queue. Then, the TQue can be allocated again. An example is as follows.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
// The maximum number of buffers in the VECIN position that can be allocated is 8. If this limit is exceeded, resource allocation may fail when AllocTensor or FreeTensor is used. Therefore, when the double buffer function is disabled, a maximum of 8 TQues can be allocated.
AscendC::TPipe pipe;
int len = 1024;
AscendC::TQue<AscendC::TPosition::VECIN, 1> que0;
AscendC::TQue<AscendC::TPosition::VECIN, 1> que1;
AscendC::TQue<AscendC::TPosition::VECIN, 1> que2;
AscendC::TQue<AscendC::TPosition::VECIN, 1> que3;
AscendC::TQue<AscendC::TPosition::VECIN, 1> que4;
AscendC::TQue<AscendC::TPosition::VECIN, 1> que5;
AscendC::TQue<AscendC::TPosition::VECIN, 1> que6;
AscendC::TQue<AscendC::TPosition::VECIN, 1> que7;
 
pipe.InitBuffer(que0, 1, len);
pipe.InitBuffer(que1, 1, len);
pipe.InitBuffer(que2, 1, len);
pipe.InitBuffer(que3, 1, len);
pipe.InitBuffer(que4, 1, len);
pipe.InitBuffer(que5, 1, len);
pipe.InitBuffer(que6, 1, len);
pipe.InitBuffer(que7, 1, len);
 
// If double buffer is enabled, two buffers are allocated to each TQue. Therefore, a maximum of four TQues can be allocated.
TPipe pipe;
int len = 1024;
AscendC::TQue<AscendC::TPosition::VECIN, 1> que0;
AscendC::TQue<AscendC::TPosition::VECIN, 1> que1;
AscendC::TQue<AscendC::TPosition::VECIN, 1> que2;
AscendC::TQue<AscendC::TPosition::VECIN, 1> que3;
 
pipe.InitBuffer(que0, 2, len);
pipe.InitBuffer(que1, 2, len);
pipe.InitBuffer(que2, 2, len);
pipe.InitBuffer(que3, 2, len);
 
// If the number of TQue reaches the maximum, call FreeAllEvent to allocate more TQue.
AscendC::TPipe pipe;
int len = 1024;
AscendC::TQue<AscendC::TPosition::VECIN, 1> que0;
pipe.InitBuffer(que0, 1, len);
AscendC::LocalTensor<half> tensor1 = que0.AllocTensor<half>();
que0.EnQue(tensor1);
tensor1 = que0.DeQue<half>(); // Move the tensor out of the VECOUT queue.
que0.FreeTensor<half>(tensor1);
que0.FreeAllEvent(); // Release all synchronization events of que0. After that, you can continue to allocate TQue.
AscendC::TQue<AscendC::TPosition::VECIN, 1> que1;
pipe.InitBuffer(que1, 1, len);

Example

In the following cases, TQueConfig is transferred to enable the compilation period calculation of bufferNumber. The vector operator does not involve data format conversion. Therefore, the values of ND2NZ and NZ2ND are false.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
// Custom meta function for constructing TQueConfig
__aicore__ constexpr AscendC::TQueConfig GetMyTQueConfig(bool nd2nzIn, bool nz2ndIn, bool scmBlockGroupIn,
    uint32_t bufferLenIn, uint32_t bufferNumberIn, uint32_t consumerSizeIn, const AscendC::TPosition consumerIn[])
{
    return {
        .nd2nz = nd2nzIn,
        .nz2nd = nz2ndIn,
        .scmBlockGroup = scmBlockGroupIn,
        .bufferLen = bufferLenIn,
        .bufferNumber = bufferNumberIn,
        .consumerSize = consumerSizeIn,
        .consumer = {consumerIn[0], consumerIn[1], consumerIn[2], consumerIn[3],
            consumerIn[4], consumerIn[5], consumerIn[6], consumerIn[7]}
    };
}
static constexpr AscendC::TPosition tp[8] = {AscendC::TPosition::MAX, AscendC::TPosition::MAX, AscendC::TPosition::MAX, AscendC::TPosition::MAX,
            AscendC::TPosition::MAX, AscendC::TPosition::MAX, AscendC::TPosition::MAX, AscendC::TPosition::MAX};
static constexpr AscendC::TQueConfig conf = GetMyTQueConfig(false, false, false, 0, 1, 0, tp);
template <typename srcType> class KernelAscendQuant {
public:
    __aicore__ inline KernelAscendQuant() {}
    __aicore__ inline void Init(GM_ADDR src_gm, GM_ADDR dst_gm, uint32_t inputSize)
    {
        dataSize = inputSize;
        src_global.SetGlobalBuffer(reinterpret_cast<__gm__ srcType*>(src_gm), dataSize);
        dst_global.SetGlobalBuffer(reinterpret_cast<__gm__ int8_t*>(dst_gm), dataSize);
        pipe.InitBuffer(inQueueX, 1, dataSize * sizeof(srcType));
        pipe.InitBuffer(outQueue, 1, dataSize * sizeof(int8_t));
    }
    __aicore__ inline void Process()
    {
        CopyIn();
        Compute();
        CopyOut();
    }
private:
    __aicore__ inline void CopyIn()
    {
        ...
    }
    __aicore__ inline void Compute()
    {
        ...
    }
    __aicore__ inline void CopyOut()
    {
        ...
    }
private:
    AscendC::GlobalTensor<srcType> src_global;
    AscendC::GlobalTensor<int8_t> dst_global;
    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::TPosition::VECIN, 1, &conf> inQueueX;
    AscendC::TQue<AscendC::TPosition::VECOUT, 1, &conf> outQueue;
    uint32_t dataSize = 0;
};
template <typename dataType> __aicore__ void kernel_ascend_quant_operator(GM_ADDR src_gm, GM_ADDR dst_gm, uint32_t dataSize)
{
    KernelAscendQuant<dataType> op;
    op.Init(src_gm, dst_gm, dataSize);
    op.Process();
}