SetNextTaskStart

This API is for trial use and may be adjusted or updated in later versions. Compatibility is not guaranteed. Stay tuned for future updates.

Product Support

Product	Supported	Remarks
Atlas A3 training products/Atlas A3 inference products	√	This API takes effect.
Atlas A2 training products/Atlas A2 inference products	√	Only compilation compatibility is guaranteed. The actual function does not take effect.
Atlas 200I/500 A2 inference products	√	Only compilation compatibility is guaranteed. The actual function does not take effect.
Atlas inference product's AI Core	√	Only compilation compatibility is guaranteed. The actual function does not take effect.
Atlas inference product's Vector Core	√	Only compilation compatibility is guaranteed. The actual function does not take effect.
Atlas training products	√	Only compilation compatibility is guaranteed. The actual function does not take effect.

Function

Is called in the subkernels of SuperKernel. After the call, the instructions can be executed in parallel with subsequent subkernels, improving the overall performance. As shown in Figure 1, SuperKernel calls subkernels in sequence. To ensure that data of subkernels does not interfere with each other, inter-operator synchronization is inserted between subkernels to preserve the order. After the subkernel_N-1 calls this API, subsequent instructions are executed in parallel with the subsequent subkernel_N.

SuperKernel is a binary fusion technology for operators. Different from source code fusion, SuperKernel focuses on the binary scheduling solution of kernel functions and performs in-depth optimization. Based on the compiled binary code, a super kernel function (SuperKernel) is created to call multiple other kernel functions, that is, subkernels, by calling sub-functions. Compared with single-operator delivery, the SuperKernel technology can reduce the task scheduling waiting time and scheduling overhead, and further optimize the operator header overhead by utilizing the task gap resources.

You need to ensure that the instructions after this API is called do not interfere with the subsequent operators, as the interference may cause accuracy issues. You are advised to call this API after the last transfer instruction of the entire operator.

Figure 1 Parallelism implemented by SetNextTaskStart

Prototype

This prototype is supported by the following product models:
Atlas A3 training products/Atlas A3 inference products

Atlas A2 training products/Atlas A2 inference products

Atlas 200I/500 A2 inference products
1 2
template<pipe_t AIV_PIPE = PIPE_MTE3, pipe_t AIC_PIPE = PIPE_FIX> __aicore__ inline void SetNextTaskStart()

This prototype is supported by the following product models:

Atlas inference product's AI Core

Atlas training products

template<pipe_t AIV_PIPE = PIPE_MTE3, pipe_t AIC_PIPE = PIPE_MTE3>
__aicore__ inline void SetNextTaskStart()

Parameters

**Table 1** Template parameters
Parameter	Description
AIV_PIPE	Instruction executed after SetNextTaskStart is called. If the instruction is in the AIV_PIPE pipeline on the AIV, it can be executed in parallel with subsequent operators. The value of AIV_PIPE can be PIPE_MTE2, PIPE_MTE3, PIPE_S, or PIPE_V. For details about the pipeline types, see Pipelines.
AIC_PIPE	Instruction executed after SetNextTaskStart is called. If the instruction is in the AIC_PIPE pipeline on the AIC, it can be executed in parallel with subsequent operators. The value of AIC_PIPE can be PIPE_MTE1, PIPE_MTE2, PIPE_MTE3, PIPE_FIX, or PIPE_M. For details about the pipeline types, see Pipelines.

Returns

None

Restrictions

This API is applicable to the TorchAir graph development scenario and takes effect only after the SuperKernel feature is enabled. For details, see section "max-autotune Mode" > "Calibrating the SuperKernel Range in a Graph" in PyTorch Graph Mode User Guide (TorchAir).
During operator execution, you need to ensure that this API is called on each core and is called only once on each core.
If this API is called in a TilingKey branch of a sub-kernel, you need to ensure that this API is called for all TilingKeys that may be run by the current operator. Otherwise, the execution may be suspended due to the mismatch of the number of synchronization instructions.

Example

#include "kernel_operator.h"
class KernelEarlyStart {
public:
    __aicore__ inline KernelEarlyStart() {}
    __aicore__ inline void Init(__gm__ uint8_t* src0Gm, __gm__ uint8_t* src1Gm, __gm__ uint8_t* dstGm)
    {
        src0Global.SetGlobalBuffer((__gm__ half*)src0Gm);
        src1Global.SetGlobalBuffer((__gm__ half*)src1Gm);
        dstGlobal.SetGlobalBuffer((__gm__ half*)dstGm);
        pipe.InitBuffer(inQueueSrc0, 1, 512 * sizeof(half));
        pipe.InitBuffer(inQueueSrc1, 1, 512 * sizeof(half));
        pipe.InitBuffer(outQueueDst, 1, 512 * sizeof(half));
    }
    __aicore__ inline void Process()
    {
        CopyIn();
        Compute();
        CopyOut();
    }
private:
    __aicore__ inline void CopyIn()
    {
        AscendC::LocalTensor<half> src0Local = inQueueSrc0.AllocTensor<half>();
        AscendC::LocalTensor<half> src1Local = inQueueSrc1.AllocTensor<half>();
        AscendC::DataCopy(src0Local, src0Global, 512);
        AscendC::DataCopy(src1Local, src1Global, 512);
        inQueueSrc0.EnQue(src0Local);
        inQueueSrc1.EnQue(src1Local);
    }
    __aicore__ inline void Compute()
    {
        AscendC::LocalTensor<half> src0Local = inQueueSrc0.DeQue<half>();
        AscendC::LocalTensor<half> src1Local = inQueueSrc1.DeQue<half>();
        AscendC::LocalTensor<half> dstLocal = outQueueDst.AllocTensor<half>();
        AscendC::Add(dstLocal, src0Local, src1Local, 512);
        outQueueDst.EnQue<half>(dstLocal);
        inQueueSrc0.FreeTensor(src0Local);
        inQueueSrc1.FreeTensor(src1Local);
    }
    __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<half> dstLocal = outQueueDst.DeQue<half>();
        AscendC::DataCopy(dstGlobal, dstLocal, 512);
        // Inserted after the last transfer instruction of the operator. Ensure that the API is called only once.
       AscendC::SetNextTaskStart();
        outQueueDst.FreeTensor(dstLocal);
    }
private:
    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueSrc0, inQueueSrc1;
    AscendC::TQue<AscendC::TPosition::VECOUT, 1> outQueueDst;
    AscendC::GlobalTensor<half> src0Global, src1Global, dstGlobal;
};
extern "C" __global__ __aicore__ void early_start_kernel(__gm__ uint8_t* src0Gm, __gm__ uint8_t* src1Gm, __gm__ uint8_t* dstGm)
{
    KernelEarlyStart op;
    op.Init(src0Gm, src1Gm, dstGm);
    op.Process();
}

Parent topic: Inter-Task Synchronization