WaitPreTaskEnd

This API is for trial use and may be adjusted or updated in later versions. Compatibility is not guaranteed. Stay tuned for future updates.

Supported Products

Product

Supported/Unsupported

Remarks

Atlas A3 training products/Atlas A3 inference products

This interface takes effect.

Atlas A2 training products/Atlas A2 inference products

Only compilation compatibility is ensured. The actual function does not take effect.

Atlas 200I/500 A2 inference products

Only compilation compatibility is ensured. The actual function does not take effect.

Atlas inference product's AI Core

Only compilation compatibility is ensured. The actual function does not take effect.

Atlas inference product's Vector Core

Only compilation compatibility is ensured. The actual function does not take effect.

Atlas training products

Only compilation compatibility is ensured. The actual function does not take effect.

Functions

is called in the sub-kernel of SuperKernel. The instructions before calling can be implemented in parallel with other sub-kernel to improve the overall performance. As shown in Figure 1, SuperKernel calls sub-kernel in sequence. To ensure that data of sub-kernel does not interfere with each other, operators are inserted between sub-kernel for order preserving. The instruction before the sub-kernel N+1 calls this interface is parallel with the previous sub-kernel N.

SuperKernel is an operator binary fusion technology. Different from source code fusion, SuperKernel focuses on the binary scheduling solution of kernel functions, performs in-depth optimization, and creates a super kernel function (SuperKernel) based on the compiled binary code, a plurality of other kernel functions, that is, sub-kernel, are called in a manner of calling a sub-function. Compared with single-operator delivery, the SuperKernel technology reduces the task scheduling waiting time and scheduling overhead, and uses task gap resources to further optimize the operator header overhead.

You need to ensure that the instruction before calling this API does not interfere with the previous operator, which may cause accuracy issues. You are advised to call this API before the first operator transfer instruction.

Figure 1 Parallel processing through WaitPreTaskEnd

Prototype

1
__aicore__ inline void WaitPreTaskEnd()

Command-Line Options

None

Return Value Description

None

Restrictions

  • This API is applicable to the TorchAir graph development scenario and takes effect only after the SuperKernel feature is enabled. For details, see section ""max-autotune Mode" > "Calibrating the SuperKernel Range in a Graph"" in PyTorch Graph Mode User Guide (TorchAir).
  • During operator execution, ensure that this API is called on each core and is called only once on each core.
  • If this API is called for a tiling key branch of a subkernel, you need to ensure that this API is called for all tiling keys that may be executed by the current operator. Otherwise, the execution will be suspended due to mismatched number of synchronization instructions.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
#include "kernel_operator.h"
class KernelEarlyStart {
public:
    __aicore__ inline KernelEarlyStart() {}
    __aicore__ inline void Init(__gm__ uint8_t* src0Gm, __gm__ uint8_t* src1Gm, __gm__ uint8_t* dstGm)
    {
        src0Global.SetGlobalBuffer((__gm__ half*)src0Gm);
        src1Global.SetGlobalBuffer((__gm__ half*)src1Gm);
        dstGlobal.SetGlobalBuffer((__gm__ half*)dstGm);
        pipe.InitBuffer(inQueueSrc0, 1, 512 * sizeof(half));
        pipe.InitBuffer(inQueueSrc1, 1, 512 * sizeof(half));
        pipe.InitBuffer(outQueueDst, 1, 512 * sizeof(half));
    }
    __aicore__ inline void Process()
    {
        CopyIn();
        Compute();
        CopyOut();
    }
private:
    __aicore__ inline void CopyIn()
    {
        AscendC::LocalTensor<half> src0Local = inQueueSrc0.AllocTensor<half>();
        AscendC::LocalTensor<half> src1Local = inQueueSrc1.AllocTensor<half>();
        // Inserted before the first movement instruction of the operator and ensure that the operator is called only once.
        AscendC::WaitPreTaskEnd();
        AscendC::DataCopy(src0Local, src0Global, 512);
        AscendC::DataCopy(src1Local, src1Global, 512);
        inQueueSrc0.EnQue(src0Local);
        inQueueSrc1.EnQue(src1Local);
    }
    __aicore__ inline void Compute()
    {
        AscendC::LocalTensor<half> src0Local = inQueueSrc0.DeQue<half>();
        AscendC::LocalTensor<half> src1Local = inQueueSrc1.DeQue<half>();
        AscendC::LocalTensor<half> dstLocal = outQueueDst.AllocTensor<half>();
        AscendC::Add(dstLocal, src0Local, src1Local, 512);
        outQueueDst.EnQue<half>(dstLocal);
        inQueueSrc0.FreeTensor(src0Local);
        inQueueSrc1.FreeTensor(src1Local);
    }
    __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<half> dstLocal = outQueueDst.DeQue<half>();
        AscendC::DataCopy(dstGlobal, dstLocal, 512);
        outQueueDst.FreeTensor(dstLocal);
    }
private:
    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueSrc0, inQueueSrc1;
    AscendC::TQue<AscendC::TPosition::VECOUT, 1> outQueueDst;
    AscendC::GlobalTensor<half> src0Global, src1Global, dstGlobal;
};
extern "C" __global__ __aicore__ void early_start_kernel(__gm__ uint8_t* src0Gm, __gm__ uint8_t* src1Gm, __gm__ uint8_t* dstGm)
{
    KernelEarlyStart op;
    op.Init(src0Gm, src1Gm, dstGm);
    op.Process();
}