WaitPreTaskEnd

This API is for trial use and may be adjusted or updated in later versions. Compatibility is not guaranteed. Stay tuned for future updates.

Product Support

Product	Supported	Remarks
Atlas A3 training products/Atlas A3 inference products	√	This API takes effect.
Atlas A2 training products/Atlas A2 inference products	√	Only compilation compatibility is guaranteed. The actual function does not take effect.
Atlas 200I/500 A2 inference products	√	Only compilation compatibility is guaranteed. The actual function does not take effect.
Atlas inference product's AI Core	√	Only compilation compatibility is guaranteed. The actual function does not take effect.
Atlas inference product's Vector Core	√	Only compilation compatibility is guaranteed. The actual function does not take effect.
Atlas training products	√	Only compilation compatibility is guaranteed. The actual function does not take effect.

Function

Is called in the subkernels of SuperKernel. Before the call, the instructions can be executed in parallel with earlier subkernels, improving the overall performance. As shown in Figure 1, SuperKernel calls subkernels in sequence. To ensure that data of subkernels does not interfere with each other, inter-operator synchronization is inserted between subkernels to preserve the order. The instructions before the subkernel_N+1 calls this API are executed in parallel with the previous subkernel_N.

SuperKernel is a binary fusion technology for operators. Different from source code fusion, SuperKernel focuses on the binary scheduling solution of kernel functions and performs in-depth optimization. Based on the compiled binary code, a super kernel function (SuperKernel) is created to call multiple other kernel functions, that is, subkernels, by calling sub-functions. Compared with single-operator delivery, the SuperKernel technology can reduce the task scheduling waiting time and scheduling overhead, and further optimize the operator header overhead by utilizing the task gap resources.

You need to ensure that the instructions before this API is called do not interfere with the earlier operators, as the interference may cause accuracy issues. You are advised to call this API before the first transfer instruction of the entire operator.

Figure 1 Parallelism implemented by WaitPreTaskEnd

Prototype

__aicore__ inline void WaitPreTaskEnd()

Parameters

None

Returns

None

Restrictions

This API is applicable to the TorchAir graph development scenario and takes effect only after the SuperKernel feature is enabled. For details, see section "max-autotune Mode" > "Calibrating the SuperKernel Range in a Graph" in PyTorch Graph Mode User Guide (TorchAir).
During operator execution, you need to ensure that this API is called on each core and is called only once on each core.
If this API is called in a TilingKey branch of a sub-kernel, you need to ensure that this API is called for all TilingKeys that may be run by the current operator. Otherwise, the execution may be suspended due to the mismatch of the number of synchronization instructions.

Example

#include "kernel_operator.h"
class KernelEarlyStart {
public:
    __aicore__ inline KernelEarlyStart() {}
    __aicore__ inline void Init(__gm__ uint8_t* src0Gm, __gm__ uint8_t* src1Gm, __gm__ uint8_t* dstGm)
    {
        src0Global.SetGlobalBuffer((__gm__ half*)src0Gm);
        src1Global.SetGlobalBuffer((__gm__ half*)src1Gm);
        dstGlobal.SetGlobalBuffer((__gm__ half*)dstGm);
        pipe.InitBuffer(inQueueSrc0, 1, 512 * sizeof(half));
        pipe.InitBuffer(inQueueSrc1, 1, 512 * sizeof(half));
        pipe.InitBuffer(outQueueDst, 1, 512 * sizeof(half));
    }
    __aicore__ inline void Process()
    {
        CopyIn();
        Compute();
        CopyOut();
    }
private:
    __aicore__ inline void CopyIn()
    {
        AscendC::LocalTensor<half> src0Local = inQueueSrc0.AllocTensor<half>();
        AscendC::LocalTensor<half> src1Local = inQueueSrc1.AllocTensor<half>();
        // Inserted before the first transfer instruction of the operator. Ensure that the API is called only once.
        AscendC::WaitPreTaskEnd();
        AscendC::DataCopy(src0Local, src0Global, 512);
        AscendC::DataCopy(src1Local, src1Global, 512);
        inQueueSrc0.EnQue(src0Local);
        inQueueSrc1.EnQue(src1Local);
    }
    __aicore__ inline void Compute()
    {
        AscendC::LocalTensor<half> src0Local = inQueueSrc0.DeQue<half>();
        AscendC::LocalTensor<half> src1Local = inQueueSrc1.DeQue<half>();
        AscendC::LocalTensor<half> dstLocal = outQueueDst.AllocTensor<half>();
        AscendC::Add(dstLocal, src0Local, src1Local, 512);
        outQueueDst.EnQue<half>(dstLocal);
        inQueueSrc0.FreeTensor(src0Local);
        inQueueSrc1.FreeTensor(src1Local);
    }
    __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<half> dstLocal = outQueueDst.DeQue<half>();
        AscendC::DataCopy(dstGlobal, dstLocal, 512);
        outQueueDst.FreeTensor(dstLocal);
    }
private:
    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueSrc0, inQueueSrc1;
    AscendC::TQue<AscendC::TPosition::VECOUT, 1> outQueueDst;
    AscendC::GlobalTensor<half> src0Global, src1Global, dstGlobal;
};
extern "C" __global__ __aicore__ void early_start_kernel(__gm__ uint8_t* src0Gm, __gm__ uint8_t* src1Gm, __gm__ uint8_t* dstGm)
{
    KernelEarlyStart op;
    op.Init(src0Gm, src1Gm, dstGm);
    op.Process();
}

Parent topic: Inter-Task Synchronization