WaitPreTaskEnd
This API is for trial use and may be adjusted or updated in later versions. Compatibility is not guaranteed. Stay tuned for future updates.
Supported Products
Product |
Supported/Unsupported |
Remarks |
|---|---|---|
√ |
This interface takes effect. |
|
√ |
Only compilation compatibility is ensured. The actual function does not take effect. |
|
√ |
Only compilation compatibility is ensured. The actual function does not take effect. |
|
√ |
Only compilation compatibility is ensured. The actual function does not take effect. |
|
√ |
Only compilation compatibility is ensured. The actual function does not take effect. |
|
√ |
Only compilation compatibility is ensured. The actual function does not take effect. |
Functions
is called in the sub-kernel of SuperKernel. The instructions before calling can be implemented in parallel with other sub-kernel to improve the overall performance. As shown in Figure 1, SuperKernel calls sub-kernel in sequence. To ensure that data of sub-kernel does not interfere with each other, operators are inserted between sub-kernel for order preserving. The instruction before the sub-kernel N+1 calls this interface is parallel with the previous sub-kernel N.
SuperKernel is an operator binary fusion technology. Different from source code fusion, SuperKernel focuses on the binary scheduling solution of kernel functions, performs in-depth optimization, and creates a super kernel function (SuperKernel) based on the compiled binary code, a plurality of other kernel functions, that is, sub-kernel, are called in a manner of calling a sub-function. Compared with single-operator delivery, the SuperKernel technology reduces the task scheduling waiting time and scheduling overhead, and uses task gap resources to further optimize the operator header overhead.
You need to ensure that the instruction before calling this API does not interfere with the previous operator, which may cause accuracy issues. You are advised to call this API before the first operator transfer instruction.
Prototype
1 | __aicore__ inline void WaitPreTaskEnd() |
Command-Line Options
None
Return Value Description
None
Restrictions
- This API is applicable to the TorchAir graph development scenario and takes effect only after the SuperKernel feature is enabled. For details, see section ""max-autotune Mode" > "Calibrating the SuperKernel Range in a Graph"" in PyTorch Graph Mode User Guide (TorchAir).
- During operator execution, ensure that this API is called on each core and is called only once on each core.
- If this API is called for a tiling key branch of a subkernel, you need to ensure that this API is called for all tiling keys that may be executed by the current operator. Otherwise, the execution will be suspended due to mismatched number of synchronization instructions.
Examples
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 | #include "kernel_operator.h" class KernelEarlyStart { public: __aicore__ inline KernelEarlyStart() {} __aicore__ inline void Init(__gm__ uint8_t* src0Gm, __gm__ uint8_t* src1Gm, __gm__ uint8_t* dstGm) { src0Global.SetGlobalBuffer((__gm__ half*)src0Gm); src1Global.SetGlobalBuffer((__gm__ half*)src1Gm); dstGlobal.SetGlobalBuffer((__gm__ half*)dstGm); pipe.InitBuffer(inQueueSrc0, 1, 512 * sizeof(half)); pipe.InitBuffer(inQueueSrc1, 1, 512 * sizeof(half)); pipe.InitBuffer(outQueueDst, 1, 512 * sizeof(half)); } __aicore__ inline void Process() { CopyIn(); Compute(); CopyOut(); } private: __aicore__ inline void CopyIn() { AscendC::LocalTensor<half> src0Local = inQueueSrc0.AllocTensor<half>(); AscendC::LocalTensor<half> src1Local = inQueueSrc1.AllocTensor<half>(); // Inserted before the first movement instruction of the operator and ensure that the operator is called only once. AscendC::WaitPreTaskEnd(); AscendC::DataCopy(src0Local, src0Global, 512); AscendC::DataCopy(src1Local, src1Global, 512); inQueueSrc0.EnQue(src0Local); inQueueSrc1.EnQue(src1Local); } __aicore__ inline void Compute() { AscendC::LocalTensor<half> src0Local = inQueueSrc0.DeQue<half>(); AscendC::LocalTensor<half> src1Local = inQueueSrc1.DeQue<half>(); AscendC::LocalTensor<half> dstLocal = outQueueDst.AllocTensor<half>(); AscendC::Add(dstLocal, src0Local, src1Local, 512); outQueueDst.EnQue<half>(dstLocal); inQueueSrc0.FreeTensor(src0Local); inQueueSrc1.FreeTensor(src1Local); } __aicore__ inline void CopyOut() { AscendC::LocalTensor<half> dstLocal = outQueueDst.DeQue<half>(); AscendC::DataCopy(dstGlobal, dstLocal, 512); outQueueDst.FreeTensor(dstLocal); } private: AscendC::TPipe pipe; AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueSrc0, inQueueSrc1; AscendC::TQue<AscendC::TPosition::VECOUT, 1> outQueueDst; AscendC::GlobalTensor<half> src0Global, src1Global, dstGlobal; }; extern "C" __global__ __aicore__ void early_start_kernel(__gm__ uint8_t* src0Gm, __gm__ uint8_t* src1Gm, __gm__ uint8_t* dstGm) { KernelEarlyStart op; op.Init(src0Gm, src1Gm, dstGm); op.Process(); } |
