Custom Operator
Background
If the performance of some operators cannot be improved by optimizing the surrounding computing structure through the PASS, you need to manually optimize these operators. However, to ensure the generalization of a handwritten operator, different tiling branches need to be implemented for different shapes, which requires a large amount of code. During model optimization, the long optimization time cannot meet user requirements. Therefore, if you need to write a custom operator, the optimal policy is to implement the optimization branch for a specific network structure based on the prior knowledge of the model.
Cases
Case 1: TopK with valid data length
The TopK operator is used in recommendation inference, such as the TWINS model. Even if the operator is optimized to the vector core through the cast solution in section 1.5.4, the performance of the TopK operator with a long sequence length is still poor. In addition, due to hardware reasons, the algorithm of the TopK operator cannot achieve the same performance as that on the GPU on the NPU. Therefore, the optimization of the custom TopK operator needs to be analyzed at the model layer. From the perspective of the model, the sequence of the TopK input parameter comes from a padded sequence. Actually, not all data needs to participate in TopK, and the proportion of valid data is only about 10% on average. Based on this prior condition, you only need to implement a custom TopK operator and add a valid length input parameter to reduce the average operator time consumption by 90%.
Case 2: Custom floormod Operator
In the model, some features are processed by using the floormod operator to implement bucketing, as shown in the following figure. The implementation logic is as follows: The index 0 is used as 0, and other indexes are used as the index after bucketing by taking the remainder of a number and adding 1, and then the index is transferred to the Gather operator. Currently, the NPU does not directly support int64 computing at the bottom layer. The floormod operator uses the scalar instruction to perform computing for the int64 type. When there are a large number of input parameters, the performance is poor. Mathematically, if int64 is converted into two int32 for computing, the compute logic and sign bit processing are complex. However, based on this structure, the index can be greater than 0 (otherwise, an error is reported for Gather). In addition, floormod is used for Gather index bucketing. Therefore, the divisor (number of buckets) does not need to be expressed in int64. The underlying fmod/division instruction supports only fp32/fp16 computing. If the divisor is converted to fp32, accuracy drop does not occur within 2^24-1 (16.77 million). This range meets user requirements. (Actually, the number of buckets required by the algorithm must be within 2^21-1 to ensure performance, which is sufficient in most scenarios.) Therefore, the operator can be designed as int64/fp32, which is implemented within the vec range to avoid the scalar problem.

Case 3: Custom Gather Operator
For some features with default values, the following structure is used to convert the index less than 0 to the 0 vector and the index greater than 0 to the index for embedding. In this structure, the Where operator is used as the AI CPU operator, and the structure is executed as a dynamic subgraph, resulting in poor performance. In this case, the logic is integrated into the Gather operator. The operator determines each index value, and then moves the 0 vector or embedding value to the target position to implement the logic. The overall time consumption is reduced by three orders of magnitude.
