SetNextTaskStart
This API is for trial use and may be adjusted or updated in later versions. Compatibility is not guaranteed. Stay tuned for future updates.
Supported Products
Product |
Supported/Unsupported |
Remarks |
|---|---|---|
√ |
This interface takes effect. |
|
√ |
Only compilation compatibility is ensured. The actual function does not take effect. |
|
√ |
Only compilation compatibility is ensured. The actual function does not take effect. |
|
√ |
Only compilation compatibility is ensured. The actual function does not take effect. |
|
√ |
Only compilation compatibility is ensured. The actual function does not take effect. |
|
√ |
Only compilation compatibility is ensured. The actual function does not take effect. |
Functions
is called in the sub-kernel of SuperKernel. The called instruction can be implemented in parallel with other sub-kernel, improving the overall performance. As shown in Figure 1, SuperKernel calls sub-kernel in sequence. To ensure that data of sub-kernel does not interfere with each other, operators are inserted between sub-kernel for order preserving. After the sub-kernel N-1 calls this interface, subsequent instructions are parallel with the subsequent sub-kernel N.
SuperKernel is an operator binary fusion technology. Different from source code fusion, SuperKernel focuses on the binary scheduling solution of kernel functions, performs in-depth optimization, and creates a super kernel function (SuperKernel) based on the compiled binary code, a plurality of other kernel functions, that is, sub-kernel, are called in a manner of calling a sub-function. Compared with single-operator delivery, the SuperKernel technology reduces the task scheduling waiting time and scheduling overhead, and uses task gap resources to further optimize the operator header overhead.
You need to ensure that the instruction after this API is called does not interfere with the subsequent operators, which may cause accuracy issues. You are advised to call this API after the last transfer instruction of the entire operator.
Prototype
- This prototype is supported by the following product models:
Atlas A3 training products /Atlas A3 inference products Atlas A2 training products /Atlas A2 inference products Atlas 200I/500 A2 inference products 1 2
template<pipe_t AIV_PIPE = PIPE_MTE3, pipe_t AIC_PIPE = PIPE_FIX> __aicore__ inline void SetNextTaskStart()
- This prototype is supported by the following product models:
Atlas inference product 's AI CoreAtlas training products 1 2
template<pipe_t AIV_PIPE = PIPE_MTE3, pipe_t AIC_PIPE = PIPE_MTE3> __aicore__ inline void SetNextTaskStart()
Parameters
Parameter |
Description |
|---|---|
AIV_PIPE |
Instruction that runs after SetNextTaskStart. If the instruction is in the AIV_PIPE pipeline on the AIV, it can run in parallel with subsequent operators. The value of AIV_PIPE can be PIPE_MTE2, PIPE_MTE3, PIPE_S, or PIPE_V. For details about the pipeline types, see Pipelines. |
AIC_PIPE |
Instruction that runs after SetNextTaskStart. If the instruction is in the AIC_PIPE pipeline on the AIC, it can run in parallel with subsequent operators. The value of AIC_PIPE can be PIPE_MTE1, PIPE_MTE2, PIPE_MTE3, PIPE_FIX, or PIPE_M. For details about the pipeline types, see Pipelines. |
Returns
None
Restrictions
- This API is applicable to the TorchAir graph development scenario and takes effect only after the SuperKernel feature is enabled. For details, see section ""max-autotune Mode" > "Calibrating the SuperKernel Range in a Graph"" in PyTorch Graph Mode User Guide (TorchAir).
- During operator execution, ensure that this API is called on each core and is called only once on each core.
- If this API is called for a tiling key branch of a subkernel, you need to ensure that this API is called for all tiling keys that may be executed by the current operator. Otherwise, the execution will be suspended due to mismatched number of synchronization instructions.
Examples
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 | #include "kernel_operator.h" class KernelEarlyStart { public: __aicore__ inline KernelEarlyStart() {} __aicore__ inline void Init(__gm__ uint8_t* src0Gm, __gm__ uint8_t* src1Gm, __gm__ uint8_t* dstGm) { src0Global.SetGlobalBuffer((__gm__ half*)src0Gm); src1Global.SetGlobalBuffer((__gm__ half*)src1Gm); dstGlobal.SetGlobalBuffer((__gm__ half*)dstGm); pipe.InitBuffer(inQueueSrc0, 1, 512 * sizeof(half)); pipe.InitBuffer(inQueueSrc1, 1, 512 * sizeof(half)); pipe.InitBuffer(outQueueDst, 1, 512 * sizeof(half)); } __aicore__ inline void Process() { CopyIn(); Compute(); CopyOut(); } private: __aicore__ inline void CopyIn() { AscendC::LocalTensor<half> src0Local = inQueueSrc0.AllocTensor<half>(); AscendC::LocalTensor<half> src1Local = inQueueSrc1.AllocTensor<half>(); AscendC::DataCopy(src0Local, src0Global, 512); AscendC::DataCopy(src1Local, src1Global, 512); inQueueSrc0.EnQue(src0Local); inQueueSrc1.EnQue(src1Local); } __aicore__ inline void Compute() { AscendC::LocalTensor<half> src0Local = inQueueSrc0.DeQue<half>(); AscendC::LocalTensor<half> src1Local = inQueueSrc1.DeQue<half>(); AscendC::LocalTensor<half> dstLocal = outQueueDst.AllocTensor<half>(); AscendC::Add(dstLocal, src0Local, src1Local, 512); outQueueDst.EnQue<half>(dstLocal); inQueueSrc0.FreeTensor(src0Local); inQueueSrc1.FreeTensor(src1Local); } __aicore__ inline void CopyOut() { AscendC::LocalTensor<half> dstLocal = outQueueDst.DeQue<half>(); AscendC::DataCopy(dstGlobal, dstLocal, 512); // Inserted after the last movement instruction of the operator and ensure that the operator is called only once. AscendC::SetNextTaskStart(); outQueueDst.FreeTensor(dstLocal); } private: AscendC::TPipe pipe; AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueSrc0, inQueueSrc1; AscendC::TQue<AscendC::TPosition::VECOUT, 1> outQueueDst; AscendC::GlobalTensor<half> src0Global, src1Global, dstGlobal; }; extern "C" __global__ __aicore__ void early_start_kernel(__gm__ uint8_t* src0Gm, __gm__ uint8_t* src1Gm, __gm__ uint8_t* dstGm) { KernelEarlyStart op; op.Init(src0Gm, src1Gm, dstGm); op.Process(); } |
