性能测试以vLLM BenchMark为例。使用该工具前,请先执行如下命令获取vllm代码仓:git clone -b v0.7.3 https://github.com/vllm-project/vllm.git。
python -m vllm.entrypoints.openai.api_server --model=/{模型路径}/Qwen2.5-7B-Instruct --enforce-eager -tp 4 --port 8000 --block-size=128
cd {vllm代码仓下载路径}/vllm
python benchmarks/benchmark_serving.py \ --backend vllm \ --model /{模型路径}/Qwen2.5-7B-Instruct \ --dataset-name random \ --tokenizer /{模型路径}/Qwen2.5-7B-Instruct \ --random-input-len 256 \ --num-prompts 640 \ --port 8000
性能测试结果如下所示:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | ============ Serving Benchmark Result ============ Successful requests: 640 Benchmark duration (s): 22.74 Total input tokens: 163840 Total generated tokens: 80593 Request throughput (req/s): 28.14 Output token throughput (tok/s): 3543.65 Total Token throughput (tok/s): 10747.65 ---------------Time to First Token---------------- Mean TTFT (ms): 9263.34 Median TTFT (ms): 10661.86 P99 TTFT (ms): 19515.53 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 54.21 Median TPOT (ms): 51.16 P99 TPOT (ms): 71.11 ---------------Inter-token Latency---------------- Mean ITL (ms): 50.04 Median ITL (ms): 41.61 P99 ITL (ms): 624.80 ================================================== |