Setting a Proper Number of Cores and Operator Kernel Type

During operator execution, additional startup overhead or header overhead may be generated due to the following reasons:

Core startup: Each core needs to be initialized when it is started, and necessary configurations and resources need to be loaded.
Core addressing TLB miss: When a core accesses the memory, if the Translation Lookaside Buffer (TLB) does not contain the corresponding page table entry, the page table entry needs to be loaded from the memory, which causes additional latency.
Same-address access conflict: Due to hardware limitations, conflicts may occur when multiple cores access the same memory address at the same time, resulting in additional latency.
Variable resource initialization: Before operator execution, some variables and resources need to be initialized, which may also cause additional performance overhead.

The header overhead increases with the number of used cores. The following figure shows the change of the header overhead with the number of started cores.

Figure 1 Header overhead changes with the number of started cores

For operators whose overall execution duration is at the microsecond level and whose single-core compute time is relatively short, you can reduce the number of started cores and increase the single-core compute amount to improve performance. This optimization method is essentially a trade-off between the time consumed by the header overhead and the time consumed by the single-core compute. To achieve optimal performance, you need to find an appropriate number of cores through practical attempts.

For a custom operator project, you can use the SetBlockDim API in TilingFunc (the default function provided by the operator project for tiling compute on the host) to set the number of cores used by the operator. For details, see SetBlockDim . For a kernel launch project, you can specify the number of cores used by the operator when calling <<<>>>.
In addition, the kernel type of the operator also affects the number of cores used for operator startup. Take a pure vector operator as an example. If the operator is executed in hybrid startup mode, the scheduler starts both the AI Vector core and the AI Cube core at the same time. In this case, the AI Cube core does not have actual compute instructions, but still incurs the overhead of core startup and initialization. Therefore, you are advised to set the kernel type to minimize the overhead.
Generally, the operator type is automatically identified based on the instructions used by the operator. However, this function cannot distinguish the ratio of AICs to AIVs, and tasks are delivered in the default ratio of 1:2 (AIVs to AICs). In addition, the automatic identification function may fail because it depends on the compilation optimization result. Therefore, you are advised to manually set the kernel type of the operator. For details, see Setting Kernel Type.

Parent topic: Header and Tailer Overhead Optimization