Clarifying the Optimization Objective

Users usually focus on performance metrics such as the first token latency and throughput. During the test, the number of concurrent requests and the length of input and output data greatly affect the performance. You can guide the customer to test the performance based on the actual situation.

Low-latency scenario: real-time interaction (such as a dialog system), focusing on the speed of generating the first token (optimization in the Prefill phase).
High-throughput scenario: offline batch processing (such as document generation), focusing on tokens per second (optimization in the Decode phase).

Sometimes, the performance objective is greatly different from the actual performance. You can evaluate the performance upper limit based on the LLM inference performance to determine the upper limit of latency or throughput and determine whether the objective can be achieved.

Evaluating the Upper Limit of the Serving Performance

Based on the estimation scenarios, objectives, and LLM inference test results, the following (favorable) methods are used to estimate the upper limit of the serving performance:

Single concurrency: The performance is close to that of the LLM inference in single-concurrency mode.
Average first token delay in the case of a large number of concurrent requests.
- Inference in the Prefill phase is a computing-intensive scenario. After Prefill Batchsize is increased, the computing bottleneck is triggered. After the bottleneck is triggered, the Prefill delay increases linearly with Batchsize.
- Perform an LLM inference test to find Prefill Batchsize that reaches the computing bottleneck. Divide the number of concurrent service requests based on Prefill Batchsize. That is, calculate the upper limit of the first token delay based on the LLM inference result. Assume that Prefill Batchsize at the bottleneck is . In this case, the LLM inference calculation time of a single Prefill Batch is , and the actual concurrency to be estimated is . The total Prefill time of concurrent requests in this turn is estimated to be . Under the first-come first-served (FCFS) scheduling policy, the average first token delay is about .
Throughput and non-first token latency in the case of high concurrency.
- Decode is a bandwidth bottleneck instead of a computing power bottleneck. Theoretically, a larger value of Batchsize for a single Decode indicates a larger total throughput until the graphical processing unit (GPU) memory limit is reached.
- For the output throughput and non-first token latency of serving, you can refer to the LLM inference result under the same concurrency (that is, Decode Batchsize) as the theoretical upper limit. Considering that serving inference has extra time overheads in the aspects such as framework scheduling, the favorable throughput of the serving inference can be estimated as 0.8 to 0.9 times of the throughput of the LLM inference.
Theoretical upper limit of the maximum concurrency divided by maxBatchSize under the GPU memory bottleneck: In the Decode phase, the value of maxBatchSize is mainly limited by the GPU memory and can be calculated based on the number of KV cache blocks and sequence length.
- The KV Cache pool is allocated by block. The block size is specified by cacheBlockSize in the MindIE serving parameters. Generally, the default value is 128. That is, each block can store 128 tokens. The number of KV cache blocks occupied by each request is calculated based on the context length in the application scenario.
  - Maximum number of blocks = Ceil (Number of input tokens/cacheBlockSize) + Ceil (Maximum number of output tokens/cacheBlockSize)
  - Average number of blocks = Ceil (Average number of input tokens/cacheBlockSize) + Ceil (Average number of output tokens/cacheBlockSize)
- The total number of available KV cache blocks (Total Block Num) on the NPU of the MindIE server can be obtained from the test.
  - Clear all old log files in /root/mindie/log. By enabling environment variables MINDIE_LLM_PYTHON_LOG_LEVEL and MINDIE_LLM_PYTHON_LOG_TO_FILE, MindIE LLM when Python is running can generate INFO logs and write the logs into a file.
  - After determining other serving configurations except maxBatchSize, use the default values to start the service. After the service is successfully started on the server, run the grep command to search for keywords npuBlockNum in the MindIE LLM Python program logs. Multiple results will be returned based on the number of used cards, the minimum value is the total number of available KV cache blocks on the NPU of the MindIE server. Take Figure 1 as an example. The total number of available KV cache blocks is 817.
    Figure 1 Example for obtaining the total number of available KV cache blocks on the NPU of the MindIE server
- The value of maxBatchSize cannot exceed that of Floor [Total block number/Maximum number of blocks] to ensure optimal performance. If the context length difference between requests is large, the upper limit can be increased to the value of Floor [Total block number/Average number of blocks].

Parent topic: Methodology for Tuning the Serving Performance