Performance Test

MindIE supports performance tests using tools such as AISBench. The following is an example.

AISBench

Download and install AISBench.
```
git clone https://gitee.com/aisbench/benchmark.git 
cd benchmark/ 
pip3 install -e ./ --use-pep517
pip3 install -r requirements/api.txt 
pip3 install -r requirements/extra.txt
```
The pip installation mode applies to scenarios where the latest functions of AISBench are used (except the scenario where MindIE is installed using an image). AISBench has been pre-installed in the MindIE image. You can run the following command to view the installation path of AISBench in the MindIE image:
```
pip show ais_bench_benchmark
```
Prepare a dataset.
Take gsm8k as an example. Click gsm8k dataset to download the dataset and place the decompressed gsm8k folder in the ais_bench/datasets folder in the root path of the tool.

Configure the ais_bench/benchmark/configs/models/vllm_api/vllm_api_stream_chat.py file. The following is an example:

from ais_bench.benchmark.models import VLLMCustomAPIChatStream
from ais_bench.benchmark.utils.model_postprocessors import extract_non_reasoning_content
models = [     
    dict(         
        attr="service",         
        type=VLLMCustomAPIChatStream,         
        abbr='vllm-api-stream-chat',         
        path="",                    # Absolute path of the model serialization vocabulary file, that is, the path of the model weight folder        
        model="DeepSeek-R1",        # Name of the model loaded on the server. Set this parameter based on the name of the model obtained by the VLLM inference service. (If this parameter is set to an empty string, the model name is automatically obtained.)        
        request_rate = 0,           # Request sending frequency. One request is sent to the server every 1/request_rate second. If the value is less than 0.1, all requests are sent at a time.        
        retry = 2,         
        host_ip = "localhost",      # IP address of the inference service        
        host_port = 8080,           # Port number of the inference service        
        max_out_len = 512,          # Maximum number of tokens output by the inference service        
        batch_size=1,               # Maximum number of concurrent requests        
        generation_kwargs = dict(             
            temperature = 0.5,             
            top_k = 10,             
            top_p = 0.95,             
            seed = None,             
            repetition_penalty = 1.03,             
            ignore_eos = True,      # The inference service output ignores EOS (the output length reaches max_out_len).        
        )     
    ) 
]

Run the following command to start the serving performance test:

ais_bench --models vllm_api_stream_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt --mode perf --debug

The command is executed successfully if the command output is as follows:

╒════════════╤════╤════════╤═══════╤══════╤═══════╤══════╤═══════╤═══════╤═══╕
│ Performance Parameters │Stage   │ Average        │ Min          │ Max        │ Median       │ P75        │ P90          │ P99          │ N    │ 
│ E2EL                   │total   │ 2048.2945  ms  │ 1729.7498 ms │ 3450.96 ms │ 2491.8789 ms │ 2750.85 ms │ 3184.9186 ms │ 3424.4354 ms │ 8    │
│ TTFT                   │total   │ 50.332 ms      │ 50.6244 ms   │ 52.0585 ms │ 50.3237 ms   │ 50.5872 ms │ 50.7566 ms   │ 50.0551 ms   │ 8    │
│ TPOT                   │total   │ 10.6965 ms     │ 10.061 ms    │ 10.8805 ms │ 10.7495 ms   │ 10.7818 ms │ 10.808 ms    │ 10.8582 ms   │ 8    │ 
│ ITL                    │total   │ 10.6965 ms     │ 7.3583 ms    │ 13.7707 ms │ 10.7513 ms   │ 10.8009 ms │ 10.8358 ms   │ 10.9322 ms   │ 8    │ 
│ InputTokens            │total   │ 1512.5         │ 1481.0       │ 1566.0     │ 1511.5       │ 1520.25    │ 1536.6       │ 1563.06      │ 8    │ 
│ OutputTokens           │total   │ 287.375        │ 200.0        │ 407.0      │ 280.0        │ 322.75     │ 374.8        │ 403.78       │ 8    │ 
│ OutputTokenThroughput  │total   │ 115.9216       │ 107.6555     │ 116.5352   │ 117.6448     │ 118.2426   │ 118.3765     │ 118.6388     │ 8    │
╘════════════╧════╧════════╧═══════╧══════╧═══════╧══════╧═══════╧═══════╧═══╛
╒═════════════╤═════╤══════════╕
│ Common Metric            │ Stage    │ Value              │ 
│ Benchmark Duration       │ total    │ 19897.8505 ms      │ 
│ Total Requests           │ total    │ 8                  │ 
│ Failed Requests          │ total    │ 0                  │ 
│ Success Requests         │ total    │ 8                  │ 
│ Concurrency              │ total    │ 0.9972             │ 
│ Max Concurrency          │ total    │ 1                  │ 
│ Request Throughput       │ total    │ 0.4021 req/s       │ 
│ Total Input Tokens       │ total    │ 12100              │ 
│ Prefill Token Throughput │ total    │ 17014.3123 token/s │ 
│ Total generated tokens   │ total    │ 2299               │ 
│ Input Token Throughput   │ total    │ 608.7438 token/s   │ 
│ Output Token Throughput  │ total    │ 115.7835 token/s   │ 
│ Total Token Throughput   │ total    │ 723.5273 token/s   │ 
╘═════════════╧═════╧══════════╛

Pay attention to the output parameters TTFT, TPOT, Request Throughput, and Output Token Throughput in the performance test result. For details about the parameters, see Table 2.

The task execution process is flushed to the default output path. The output path is recorded in the running log. The log content is as follows:

08/28 15:13:26 - AISBench - INFO - Current exp folder: outputs/default/20250828_151326

After the command is executed, the task execution details in outputs/default/20250828_151326 are as follows:

20250828_151326           # Unique directory generated based on the timestamp for each experiment
├── configs               # All dumped configuration files that are automatically stored
├── logs                  # Logs generated during the execution. If --debug is added to the command, no process log is flushed to the disk (all process logs are directly printed).
│   └── performance/      # Log files for the inference phase
└── performance           # Performance evaluation result
│    └── vllm-api-stream-chat/          # Serving model configuration name, which corresponds to the abbr parameter of models in the model task configuration file
│         ├── gsm8kdataset.csv          # Performance output of a single request (CSV), which is the same as the Performance Parameters table in the performance result printing
│         ├── gsm8kdataset.json         # E2E performance output (JSON), which is the same as the Common Metric table in the performance result printing
│         ├── gsm8kdataset_details.json # Full dotting logs (JSON)
│         └── gsm8kdataset_plot.html    # Visualized report of concurrent requests (HTML)

Parent topic: Quick Start