Performance Test

MindIE supports performance tests using tools such as AISBench. The following is an example.

AISBench

  1. Download and install AISBench.
    git clone https://gitee.com/aisbench/benchmark.git 
    cd benchmark/ 
    pip3 install -e ./ --use-pep517
    pip3 install -r requirements/api.txt 
    pip3 install -r requirements/extra.txt

    The pip installation mode applies to scenarios where the latest functions of AISBench are used (except the scenario where MindIE is installed using an image). AISBench has been pre-installed in the MindIE image. You can run the following command to view the installation path of AISBench in the MindIE image:

    pip show ais_bench_benchmark
  2. Prepare a dataset.

    Take gsm8k as an example. Click gsm8k dataset to download the dataset and place the decompressed gsm8k folder in the ais_bench/datasets folder in the root path of the tool.

  3. Configure the ais_bench/benchmark/configs/models/vllm_api/vllm_api_stream_chat.py file. The following is an example:
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    from ais_bench.benchmark.models import VLLMCustomAPIChatStream
    from ais_bench.benchmark.utils.model_postprocessors import extract_non_reasoning_content
    models = [     
        dict(         
            attr="service",         
            type=VLLMCustomAPIChatStream,         
            abbr='vllm-api-stream-chat',         
            path="",                    # Absolute path of the model serialization vocabulary file, that is, the path of the model weight folder        
            model="DeepSeek-R1",        # Name of the model loaded on the server. Set this parameter based on the name of the model obtained by the VLLM inference service. (If this parameter is set to an empty string, the model name is automatically obtained.)        
            request_rate = 0,           # Request sending frequency. One request is sent to the server every 1/request_rate second. If the value is less than 0.1, all requests are sent at a time.        
            retry = 2,         
            host_ip = "localhost",      # IP address of the inference service        
            host_port = 8080,           # Port number of the inference service        
            max_out_len = 512,          # Maximum number of tokens output by the inference service        
            batch_size=1,               # Maximum number of concurrent requests        
            generation_kwargs = dict(             
                temperature = 0.5,             
                top_k = 10,             
                top_p = 0.95,             
                seed = None,             
                repetition_penalty = 1.03,             
                ignore_eos = True,      # The inference service output ignores EOS (the output length reaches max_out_len).        
            )     
        ) 
    ]
    
  4. Run the following command to start the serving performance test:
    ais_bench --models vllm_api_stream_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt --mode perf --debug

    The command is executed successfully if the command output is as follows:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    ╒════════════╤════╤════════╤═══════╤══════╤═══════╤══════╤═══════╤═══════╤═══╕
     Performance Parameters Stage    Average         Min           Max         Median        P75         P90           P99           N     
     E2EL                   total    2048.2945  ms   1729.7498 ms  3450.96 ms  2491.8789 ms  2750.85 ms  3184.9186 ms  3424.4354 ms  8    
     TTFT                   total    50.332 ms       50.6244 ms    52.0585 ms  50.3237 ms    50.5872 ms  50.7566 ms    50.0551 ms    8    
     TPOT                   total    10.6965 ms      10.061 ms     10.8805 ms  10.7495 ms    10.7818 ms  10.808 ms     10.8582 ms    8     
     ITL                    total    10.6965 ms      7.3583 ms     13.7707 ms  10.7513 ms    10.8009 ms  10.8358 ms    10.9322 ms    8     
     InputTokens            total    1512.5          1481.0        1566.0      1511.5        1520.25     1536.6        1563.06       8     
     OutputTokens           total    287.375         200.0         407.0       280.0         322.75      374.8         403.78        8     
     OutputTokenThroughput  total    115.9216        107.6555      116.5352    117.6448      118.2426    118.3765      118.6388      8    
    ╘════════════╧════╧════════╧═══════╧══════╧═══════╧══════╧═══════╧═══════╧═══╛
    ╒═════════════╤═════╤══════════╕
     Common Metric             Stage     Value               
     Benchmark Duration        total     19897.8505 ms       
     Total Requests            total     8                   
     Failed Requests           total     0                   
     Success Requests          total     8                   
     Concurrency               total     0.9972              
     Max Concurrency           total     1                   
     Request Throughput        total     0.4021 req/s        
     Total Input Tokens        total     12100               
     Prefill Token Throughput  total     17014.3123 token/s  
     Total generated tokens    total     2299                
     Input Token Throughput    total     608.7438 token/s    
     Output Token Throughput   total     115.7835 token/s    
     Total Token Throughput    total     723.5273 token/s    
    ╘═════════════╧═════╧══════════╛
    

    Pay attention to the output parameters TTFT, TPOT, Request Throughput, and Output Token Throughput in the performance test result. For details about the parameters, see Table 2.

    The task execution process is flushed to the default output path. The output path is recorded in the running log. The log content is as follows:

    1
    08/28 15:13:26 - AISBench - INFO - Current exp folder: outputs/default/20250828_151326
    

    After the command is executed, the task execution details in outputs/default/20250828_151326 are as follows:

    20250828_151326           # Unique directory generated based on the timestamp for each experiment
    ├── configs               # All dumped configuration files that are automatically stored
    ├── logs                  # Logs generated during the execution. If --debug is added to the command, no process log is flushed to the disk (all process logs are directly printed).
    │   └── performance/      # Log files for the inference phase
    └── performance           # Performance evaluation result
    │    └── vllm-api-stream-chat/          # Serving model configuration name, which corresponds to the abbr parameter of models in the model task configuration file
    │         ├── gsm8kdataset.csv          # Performance output of a single request (CSV), which is the same as the Performance Parameters table in the performance result printing
    │         ├── gsm8kdataset.json         # E2E performance output (JSON), which is the same as the Common Metric table in the performance result printing
    │         ├── gsm8kdataset_details.json # Full dotting logs (JSON)
    │         └── gsm8kdataset_plot.html    # Visualized report of concurrent requests (HTML)