环境准备
操作步骤
Server可以部署兼容Triton/OpenAI/TGI/vLLM第三方框架接口的服务应用。推荐用户开启HTTPS通信,并按照,配置开启HTTPS通信所需服务证书、私钥等证书文件。
Server启动的默认IP地址和端口号为
[object Object],用户可修改config.json文件中的"ipAddress"和"port"参数来配置启动IP地址与端口号。Server可实现服务状态查询,模型信息查询,文本/流式推理等功能。
[object Object]
两种启动服务方法如下所示。
启动命令需在 {MindIE安装目录} 目录中执行,使用以下命令查看安装路径。
[object Object][object Object]
方式一(推荐):使用后台进程方式启动服务。后台进程方式启动服务后,关闭窗口后进程也会保留。
[object Object]在标准输出流捕获到的文件中,打印如下信息说明启动成功。
[object Object]方式二:直接启动服务。
[object Object]回显如下则说明启动成功。
[object Object]
[object Object]
用户可使用HTTPS客户端(Linux curl命令,Postman工具等)发送HTTPS请求,此处以Linux curl命令为例进行说明。
重新打开一个窗口,使用以下命令发送请求。例如列出当前模型列表:
[object Object][object Object]
本章节以v1/chat流式推理接口和v1/completions流式推理接口为例介绍接口调用,其他接口的调用方法请参见章节。
1. v1/chat流式推理接口
[object Object]data: {"id":"endpoint_common_10","object":"chat.completion.chunk","created":1744038509,"model":"llama","choices":[{"index":0,"delta":{"role":"assistant","content":" are"},"logprobs":null,"finish_reason":null}]}
data: {"id":"endpoint_common_10","object":"chat.completion.chunk","created":1744038509,"model":"llama","choices":[{"index":1,"delta":{"role":"assistant","content":" are"},"logprobs":null,"finish_reason":null}]}
data: {"id":"endpoint_common_10","object":"chat.completion.chunk","created":1744038509,"model":"llama","choices":[{"index":0,"delta":{"role":"assistant","content":" a"},"logprobs":null,"finish_reason":null}]}
data: {"id":"endpoint_common_10","object":"chat.completion.chunk","created":1744038509,"model":"llama","choices":[{"index":1,"delta":{"role":"assistant","content":" a"},"logprobs":null,"finish_reason":null}]}
data: {"id":"endpoint_common_10","object":"chat.completion.chunk","created":1744038509,"model":"llama","choices":[{"index":0,"delta":{"role":"assistant","content":" helpful"},"logprobs":null,"finish_reason":null}]}
data: {"id":"endpoint_common_10","object":"chat.completion.chunk","created":1744038509,"model":"llama","choices":[{"index":1,"delta":{"role":"assistant","content":" helpful"},"logprobs":null,"finish_reason":null}]}
data: {"id":"endpoint_common_10","object":"chat.completion.chunk","created":1744038509,"model":"llama","usage":{"prompt_tokens":24,"prompt_tokens_details": {"cached_tokens": 0},"completion_tokens":5,"total_tokens":29,"batch_size":[1,1,1,1,1],"queue_wait_time":[5318,117,82,72,196]},"choices":[{"index":0,"delta":{"role":"assistant","content":" assistant"},"logprobs":null,"finish_reason":"length"}]}
data: {"id":"endpoint_common_10","object":"chat.completion.chunk","created":1744038509,"model":"llama","usage":{"prompt_tokens":24,"prompt_tokens_details": {"cached_tokens": 0},"completion_tokens":5,"total_tokens":29,"batch_size":[1,1,1,1,1],"queue_wait_time":[5318,117,82,72,196]},"choices":[{"index":1,"delta":{"role":"assistant","content":" assistant"},"logprobs":null,"finish_reason":"length"}]}
data: [DONE]
[object Object]2. v1/completions流式推理接口
[object Object]data: [DONE]
[object Object]本章节以文本推理接口和流式推理接口为例介绍接口调用,其他接口的调用方法请参见章节。
1. 文本推理接口
[object Object]2. 流式推理接口
[object Object][object Object]
[object Object]data: {"prefill_time":null,"decode_time":128.32,"token":{"id":[263],"text":" a"}}
data: {"prefill_time":null,"decode_time":18.17,"token":{"id":[5176],"text":" French"}}
data: {"prefill_time":null,"decode_time":16.80,"token":{"id":[17739],"text":" photograph"}}
data: {"prefill_time":null,"decode_time":16.80,"token":{"id":[261],"text":"er"}}
data: {"prefill_time":null,"decode_time":16.80,"token":{"id":[2729],"text":" based"}}
data: {"prefill_time":null,"decode_time":16.80,"token":{"id":[297],"text":" in"}}
data: {"prefill_time":null,"decode_time":16.80,"token":{"id":[3681],"text":" Paris"}}
data: {"prefill_time":null,"decode_time":16.80,"token":{"id":[29889],"text":"."}}
data: {"prefill_time":null,"decode_time":16.80,"token":{"id":[13],"text":"\n"}}
data: {"prefill_time":null,"decode_time":16.80,"token":{"id":[29902],"text":"I"}}
data: {"prefill_time":null,"decode_time":16.80,"token":{"id":[505],"text":" have"}}
data: {"prefill_time":null,"decode_time":16.80,"token":{"id":[1063],"text":" been"}}
data: {"prefill_time":null,"decode_time":16.80,"token":{"id":[27904],"text":" shooting"}}
data: {"prefill_time":null,"decode_time":16.80,"token":{"id":[1951],"text":" since"}}
data: {"prefill_time":null,"decode_time":16.80,"token":{"id":[306],"text":" I"}}
data: {"prefill_time":null,"decode_time":16.80,"token":{"id":[471],"text":" was"}}
data: {"prefill_time":null,"decode_time":16.80,"token":{"id":[29871],"text":" "}}
data: {"prefill_time":null,"decode_time":16.80,"token":{"id":[29896],"text":"1"}}
data: {"prefill_time":null,"decode_time":16.80,"generated_text":"am a French photographer based in Paris.\nI have been shooting since I was 15","details":{"finish_reason":"length","generated_tokens":20,"seed":846930886},"token":{"id":[29945],"text":null}}
[object Object]目前MindIE支持AISBench工具进行精度测试,示例如下所示,其详细使用方法请参见。
操作步骤
使用以下命令下载并安装AISBench工具。
[object Object][object Object]
准备数据集。
以gsm8k为例,单击下载数据集,将解压后的gsm8k文件夹放置于工具根路径的ais_bench/datasets文件夹下。
配置ais_bench/benchmark/configs/models/vllm_api/vllm_api_stream_chat.py文件,示例如下所示。
[object Object]执行以下命令启动服务化精度测试。
[object Object]回显如下所示则表示执行成功:
[object Object]
目前MindIE支持AISBench工具进行性能测试,示例如下所示,其详细使用方法请参见。
操作步骤
使用以下命令下载并安装AISBench工具。
[object Object][object Object]
准备数据集。
以gsm8k为例,单击下载数据集,将解压后的gsm8k/文件夹放置于工具根路径的ais_bench/datasets文件夹下。
配置ais_bench/benchmark/configs/models/vllm_api/vllm_api_stream_chat.py文件,示例如下所示。
[object Object]执行以下命令启动服务化性能测试。
[object Object]回显如下所示则表示执行成功:
[object Object]性能测试结果主要关注TTFT、TPOT、Request Throughput和Output Token Throughput输出参数,参数详情信息请参见。
[object Object]
使用安装用户登录安装节点,两种停止Server服务方式如下所示。
方式一(推荐):若使用后台进程方式启动服务,两种停止服务方式如下所示:
使用kill命令停止进程。
[object Object][object Object]
或使用pkill命令停止进程。
[object Object]
**方式二:**若直接启动进程方式启动服务,可以通过按ctrl+c停止服务。