Performance Deterioration Due to Time-consuming Scheduling of the Framework

Symptom

In the same environment with the same configurations, the serving performance of MindIE 2.0.RC1 deteriorates seriously compared with that of MindIE 2.0.T3.

As shown in Figure 1, the upper part is the test result of the LM inference. The average latency in the Decode phase of the LM inference is 36 ms when 300 concurrent requests are sent. Check the value of non_first_token_time in the red box in the figure. The lower part is the test result of the serving inference. The average latency in the Decode phase is about 66 ms when 300 concurrent requests are sent. Check the value of DecodeTime in the red box in the figure.

Figure 1 Test result for the performance of MindIE 2.0.T3

As shown in Figure 2, the upper part is the test result of the LM inference. The average latency in the Decode phase of the LM inference is 35.44 ms when 300 concurrent requests are sent. Check the value of non_first_token_time in the red box in the figure. The performance is close to that of MindIE 2.0.T3. The lower part is the test result of the serving inference. The average latency in the Decode phase is about 95 ms when 300 concurrent requests are sent. Check the value of DecodeTime in the red box in the figure. The performance deteriorates by 50% compared with that of MindIE 2.0.T3.

Figure 2 Test result for the performance of MindIE 2.0.RC1

Solution

Use the pre-check tool to dump and compare the configurations, as shown in Figure 3. ms_performance_prechecker_dump_20250520_152124.json is the flush file in the MindIE 2.0.T3 environment, and ms_performance_prechecker_dump_20250520_152138.json is the flush file in the MindIE 2.0.RC1 environment. Except for environment variables that do not affect performance, such as log settings, no obvious configuration difference is found.
Figure 3 Comparing configurations
Collect and compare the serving performance data of MindIE 2.0.RC1. It is found that the interval between forward operations in the Decode phase of MindIE 2.0.RC1 is serious, indicating that the pre-processing and post-processing on the CPU side take a long time, as shown in Figure 4.
Figure 4 Viewing the forward operations
Enable asynchronous scheduling and shorten the interval between forward operations. Then, the E2E output throughput of MindIE 2.0.RC1 increases from 2900 to 4500, which is 500 tokens/s higher than that of MindIE 2.0.T3. For details about how to enable asynchronous scheduling, see "Asynchronous Scheduling" in MindIE LLM Development Guide.
Figure 5 Enabling asynchronous scheduling

Parent topic: Cases for Tuning the Serving Performance