性能测试

性能测试以vLLM BenchMark为例。使用该工具前,请先执行如下命令获取vllm代码仓:git clone -b v0.7.3 https://github.com/vllm-project/vllm.git。

  1. 执行如下命令启动服务:
    python -m vllm.entrypoints.openai.api_server --model=/{模型路径}/Qwen2.5-7B-Instruct --enforce-eager -tp 4 --port 8000 --block-size=128
  2. 进入vllm代码仓下载路径的如下路径。
    cd {vllm代码仓下载路径}/vllm 
  3. 性能测试样例如下所示:
    python benchmarks/benchmark_serving.py \
        --backend vllm  \
        --model /{模型路径}/Qwen2.5-7B-Instruct   \
        --dataset-name random  \
        --tokenizer /{模型路径}/Qwen2.5-7B-Instruct  \
        --random-input-len 256   \
        --num-prompts 640  \
        --port 8000

    性能测试结果如下所示:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    ============ Serving Benchmark Result ============
    Successful requests:                     640       
    Benchmark duration (s):                  22.74     
    Total input tokens:                      163840    
    Total generated tokens:                  80593     
    Request throughput (req/s):              28.14     
    Output token throughput (tok/s):         3543.65   
    Total Token throughput (tok/s):          10747.65  
    ---------------Time to First Token----------------
    Mean TTFT (ms):                          9263.34   
    Median TTFT (ms):                        10661.86  
    P99 TTFT (ms):                           19515.53  
    -----Time per Output Token (excl. 1st token)------
    Mean TPOT (ms):                          54.21     
    Median TPOT (ms):                        51.16     
    P99 TPOT (ms):                           71.11     
    ---------------Inter-token Latency----------------
    Mean ITL (ms):                           50.04     
    Median ITL (ms):                         41.61     
    P99 ITL (ms):                            624.80    
    ==================================================
    

如果需要性能调优,请参考《MindIE Turbo开发指南》中的“性能调优”章节