AI CPU Operator Elimination

Background

Due to SIMD, the performance of AI CPU operators on the NPU is generally poor. For such operators, you need to eliminate them or switch them to the CPU for execution.

Restrictions

Currently, the model code is manually modified based on human eye recognition, which requires the cooperation of the customer.

Cases

Case 1:

For the topK operator, if the value is of the int32/int64 type, the operator can be executed only on the AI CPU. If the value is of the fp32 type, the operator can be implemented by the AI vector. Based on the actual scenario, if the int32/int64fp32 conversion can be implemented without precision loss (value < 2^24), you can customize a pass to convert the topK operator whose index is int32/int64 to cast(int32fp32/int64) + TopK (index fp32) + cast(fp32int32/int64).

Case 2:

For some AI CPU operators executed on the NPU, there is no vector implementation. In addition, there are operators that cannot be switched to the CPU side. In this case, you need to modify the graph so that the AI CPU operator does not depend on other NPU operators. As shown in the following figure, the gather operator on which the Bucketize operator depends is indexed from a large NPU subgraph. This subgraph cannot be completely switched to the CPU for execution. If you want to offload the Bucketize operator to the CPU, you need to analyze that the Bucketize operator directly depends only on the gather operator. As an index operator, the gather operator can perform bucketization on the table of the gather operator instead of the result of the gather operator. (The prior conclusion that the table size is not far greater than the gather output size is required.) If the Bucketize operator is moved above the table branch of the gather operator, it is found that the squeeze operator does not depend on the Bucketize operator in sequence. Therefore, the Bucketize operator can be moved above the input and executed on the CPU.

Parent topic: Performance Tuning Methods