Case 2

Background

For the coarse-grained ranking model, the TP99 latency at low voltage needs to be reduced to within 5 ms while limiting the number of cores (to ensure performance in high-voltage scenarios, the current number of cores is 7 for Cube and 10 for Vector).

Analysis Process

Profiling:

The latency in low-voltage scenarios is mainly optimized. Therefore, profiling data is collected in serial mode. Run the model for inference and use the msprof --dynamic=on --pid=9527 --output=/home/projects/output --model-execution=on --runtime-api=on --aicpu=on command for dynamic data profiling. For details, see msprof.

Out-of-the-box: In the scenario where optimization is not enabled (only the operator mode configuration is used, gatherV2 high performance, and HF32 is enabled for the Cube), a single model execution takes 51 ms, which is 10 times lower than the target performance.

Because the operators are executed in serial mode, the time consumed by the operators is stable. You can analyze the performance of a single inference based on operator name deduplication in profiling.
The following figure shows the execution time of each inference is about 44 ms. In addition, the task_wait time is about 7 ms (waiting for host scheduling) due to a large number of dynamic operators.
Analyze the shape and surrounding structure of the Gather operator. The dynamic Gather operator uses logical cores and cannot be restricted by the core restriction function of GE. In this case, the performance in high pressure is not considered. In addition, the number of indexes of the Gather operator in the figure in step 1 is only 20,000. Theoretically, the ms-level time is not required. (Actually, the data repetition rate is high, and there are large inter-core conflicts.)

According to the structure analysis, the gather operator comes from the first Gather operator in the following structure.

Because the where operator is introduced, the shape of subsequent operators is not fixed. Therefore, a dynamic subgraph is introduced. In addition, the Where operator is a type 3 operator (the output shape is not fixed). As a result, after the dynamic graph executes the Where operator, the shape information is returned to the host before the shape derivation of subsequent operators can be performed. As a result, a large amount of task_wait time occurs on the entire network.

According to the execution logic, this substructure extends the original gather logic to return a vector of all 0s when an index less than 0 is encountered. In addition, based on the analysis of duplicate data, the input actually carries a long segment of padding 0s. A custom gather operator is implemented. In addition, after some tables are fully loaded in the UB, the dynamic subgraph, gather dynamic core limiting does not take effect, and gather performance problems can be solved.

Profiling analysis 2

After the preceding problems are solved by using the custom operator, collect profile data again for analysis. The serial execution time is reduced to about 9 ms, and the performance of default_gather is greatly improved compared with that in the original scenario.

The operator that is not replaced by DefaultGather is not in the dynamic structure. However, for the gather operator whose sequence length is only 4000, theoretically, the time required is less than 180 μs. (The problem is actually caused by the implementation of the GatherV2 high-performance mode in the full-load scenario.) In terms of operator design, for the table shape whose shape is 400, 50, the performance is greatly improved after full loading.

In this case, the custom gather operator is used to cover this scenario (synchronous operator optimization requirement).

Profiling analysis 3

After the full loading logic is implemented through CustomGather and the remaining gather operators are replaced, the model serial time is about 6.1 ms. In this scenario, you can fully scan the optimization points of the model.

Continue to analyze the bottleneck. The TopK and Equal operators of the top layer are implemented through a complex algorithm of vec because the underlying layer does not support SIMT. As a result, the latency is long and there is no optimization space for the operators. However, based on the actual service, the input of the TopK operator also contains padding 0. Therefore, if the valid length is transferred to the TopK operator, the average execution time of the TopK operator can be greatly reduced (the TP99 is not affected, but the maximum QPS is affected). This optimization depends on the actual length transferred by the upper-layer framework. Therefore, this optimization is not implemented in this example.

For the Equal operator, the execution time of the shape 1, 4000; 400, 1 is 230 μs or longer, which is far from the computing power of the vector operator. This is caused by the implementation of Int64. According to the test, the Int32 Equal operator with the same shape takes less than 30 μs. Compared with Int32, Int64 does not require nearly one order of magnitude more time. In this dual-broadcast scenario, an Equal Int64 implementation is designed to reduce the Equal operator latency to about 75 μs.

In addition, there are AI CPU operators such as Unique and Where.

The operator depends on the upper-layer image splitting. (In the current scenario, the image splitting logic is controlled by users and is not processed in the example.) The Where operator comes from the following structure. You can customize the PASS to eliminate the dynamic structure introduced by where.

For batchMatmul, the performance can also be improved after the operation logic is replaced with 2, 400, 300 x 2, 300, 16.

After the Equal operator is customized, the where operator is eliminated, and the BMM operator is replaced, the serial latency of the model is 4.6 ms, which basically meets the performance requirements.

Parent topic: Performance Tuning Analysis