SuperKernel Development

SuperKernel is a binary fusion technology for operators. Different from source code fusion, SuperKernel focuses on the binary scheduling solution of kernel functions and performs in-depth optimization. Based on the compiled binary code, a super kernel function (SuperKernel) is created to call multiple other kernel functions, that is, subkernels, by calling sub-functions. Compared with single-operator delivery, the SuperKernel technology can reduce the task scheduling waiting time and scheduling overhead, and further optimize the operator header overhead by utilizing the task gap resources.

SuperKernel applies only to static graphs.
SuperKernel is supported by the following models:

SuperKernel Supported by Custom Operators

The development process of custom operators that support SuperKernel is similar to that of common operators. However, you need to pay attention to some specific restrictions (as described below). Currently, the SuperKernel feature can be used only in the PyTorch framework. Therefore, after integrating operators into a GE graph, you need to integrate them into a PyTorch graph by referring to "Integrating Custom Operators into a Graph" in PyTorch Graph Mode User Guide (TorchAir). In addition, TorchAir provides the capability of calibrating the scope of SuperKernel. You can annotate operators and configure optimizations within the fusion scope based on actual service requirements. For details, see section in .

The specific restrictions during development are as follows:

If full-core synchronization is implemented for a custom operator, ensure that the number of cores launched by the subkernel (that is, the operator) is the same as that launched by the SuperKernel. If the subkernel launches fewer cores than the SuperKernel, full-core synchronization will wait for all cores to complete, causing a stall and timeout.
Note: The number of cores launched by the SuperKernel is the maximum number of cores launched by the subkernel. Assume that the SuperKernel includes operator a (launching four cores) and operator b (launching two cores). In this case, the number of cores launched by the SuperKernel is four.
- When SyncAll is used, you can enable feed-sync-all during the calibration of the SuperKernel scope to address this problem. In this case, the system inserts the SyncAll instruction into the remaining cores of the subkernel in the SuperKernel to prevent timeout.
- If hardware synchronization APIs CrossCoreSetFlag and CrossCoreWaitFlag are used to implement full-core synchronization, the number of cores launched by the subkernel must be the same as that launched by the SuperKernel.
If the kernel type of a custom operator is set to KERNEL_TYPE_MIX_AIC_1_1 and the operator uses the hardware synchronization APIs (CrossCoreSetFlag and CrossCoreWaitFlag) between the AIC and AIV, note that the SuperKernel may adjust its launch ratio based on the number of launched cores. In this case, the operator must also support the 1:2 launch ratio in the SuperKernel to ensure that the hardware synchronization between the AIC and AIV is correctly performed. For example, instead of specifying only certain AIV cores to call the hardware synchronization APIs, ensure that all AIV cores call them. This prevents mismatches in synchronization counts that could cause stalls and timeouts.

When developing a custom operator, you must ensure that the DataCacheCleanAndInvalid instruction is correctly inserted for all GM read and write operations performed by the Scalar Unit as required. In the single-operator compilation scenario, the BiSheng Compiler automatically appends the DataCacheCleanAndInvalid instruction at the end of the operator to refresh the entire DCache. However, in a SuperKernel, subkernels are processed as common functions, and the compiler does not automatically insert this instruction to ensure data cache consistency. You need to ensure that errors are not caused by the change of the fault tolerance mechanism.
Custom operators with tiling offload enabled cannot be fused into a SuperKernel.
When the GetBlockNum API is called in a subkernel to obtain the number of cores, the returned value remains unchanged regardless of whether the subkernel is fused into a SuperKernel or how many cores the SuperKernel launches. Therefore, you can use this API in the same way as you would when developing a common operator, without needing to pay special attention to the number of cores launched by the SuperKernel.

In addition, during programming on the kernel side, you can call SetNextTaskStart and WaitPreTaskEnd to further improve performance.

After SetNextTaskStart is called, the instructions that follow can be executed in parallel with other subsequent subkernels, improving the overall performance. As shown in Figure 1, the SuperKernel calls subkernels in sequence. To prevent data interference between subkernels, inter-operator synchronization is inserted between subkernels to ensure order-preserving. After subkernel_N-1 calls this API, subsequent instructions are implemented in parallel with subkernel_N.
Figure 1 Parallelism implemented by SetNextTaskStart
After WaitPreTaskEnd is called, the preceding instructions can be executed in parallel with earlier subkernels, improving the overall performance. As shown in Figure 2, the SuperKernel calls subkernels in sequence. To prevent data interference between subkernels, inter-operator synchronization is inserted between subkernels to ensure order-preserving. After subkernel_N+1 calls this API, its preceding instructions are implemented in parallel with subkernel_N.
Figure 2 Parallelism implemented by WaitPreTaskEnd

Parent topic: Integrating Operators into a GE Graph