Inference API
- The version of Transformers in the operating environment cannot be earlier than 4.34.1. The tokenizer of an earlier version does not support the chat_template method.
- The tokenizer_config.json file in the inference model weight path must contain the chat_template field and its implementation.
- Currently, the tool_call_id, tool_calls, tools, and tool_choice parameters related to function call support only some models. An error may be reported if an unsupported model is used. Currently, the following models are supported: ChatGLM3-6B, DeepSeek-R1, Qwen2.5 series, and Qwen3 series.
Function
Processes text/streaming inference.
Format
Operation type: POST
URL: https://{ip}:{port}/v1/chat/completions
- Replace {ip} and {port} with the IP address and port number of the service plane, that is, ipAddress and port.
- This URL is the same as the URL in v1/chat. You need to use the openAiSupport parameter in the config.json file to distinguish them.
- If the value is vllm or the configuration field is missing, the OpenAI API compatible with vLLM are used.
- This API is used when the value contains other characters.
For details, see the ServerConfig parameter description in "Core Concepts and Configurations" > "Configuration Parameters (Serving)" in MindIE LLM Development Guide.
Request Parameters
Parameter |
Mandatory/Optional |
Description |
Value |
|||
|---|---|---|---|---|---|---|
model |
Mandatory |
Indicates the model name. |
The value must be the same as the value of modelName in the MindIE Server configuration file. |
|||
messages |
Mandatory |
Indicates the structure of the inference request message. |
The value is of the list type. The character length is greater than 0 KB but less than and equal to 4 MB. Chinese and English are supported. The number of tokens after prompt tokenization is less than or equal to the minimum value among maxInputTokenLen, maxSeqLen-1, max_position_embeddings, and 1 MB. Obtain the max_position_embeddings from the weight file config.json, and other related parameters from the configuration file. |
|||
- |
role |
Mandatory |
Indicates the role of the inference request message. |
Character string type. The available roles are as follows:
|
||
content |
Mandatory |
Inference request content. The value is of the string type or the list type for a single-modal text model and of the list type for a multi-modal model. |
|
|||
- |
type |
Optional |
Indicates the inference request content type. |
Instructions for using multimedia files:
NOTE:
Security warning:
|
||
text |
Optional |
Indicates that the inference request content is text. |
The value cannot be empty. Both Chinese and English are supported. |
|||
image_url |
Optional |
Indicates that the inference request content is an image. |
Local JPG, PNG, JPEG, and Base64-encoded JPG images can be imported in URL format. Both HTTP and HTTPS protocols are supported. Currently, the maximum size of an image is 40 MB. |
|||
video_url |
Optional |
Indicates that the inference request content is a video. |
Local MP4, AVI, and WMV videos can be imported in URL format. Both HTTP and HTTPS protocols are supported. Currently, the maximum size of a video file is 512 MB. |
|||
audio_url |
Optional |
The inference request content is audio. |
Local MP3, WAV, and FLAC audio files can be imported in URL format. Both HTTP and HTTPS protocols are supported. Currently, the maximum size of an audio file is 40 MB. |
|||
tool_calls |
Optional |
Indicates the tool call by a model. |
The type is List[dict]. It indicates the call of a model to the tool when role is assistant. |
|||
- |
function |
Mandatory |
Indicates the tool called by a model. |
Dict type.
|
||
id |
Mandatory |
Indicates the ID of a tool called by a model. |
Character string. |
|||
type |
Mandatory |
Indicates the type of the tool called. |
Character string. Only "function" is supported. |
|||
tool_call_id |
Mandatory when role is set to tool. Otherwise, it is optional. |
Indicates the ID of a tool called by a model. |
Character string. |
|||
stream |
Optional |
Indicates whether the returned result is text inference or streaming inference. |
The value is of the Boolean type. The default value is false.
|
|||
presence_penalty |
Optional |
There is a penalty between -2.0 and 2.0, which affects how the model punishes new tokens based on whether they appear in text so far. Positive values increase the probability that the model talks about new topics by punishing words that have been used. You are not advised to change this value together with repetition_penalty or frequency_penalty. |
The value is of the float type. The value range is [-2.0, 2.0]. The default value is 0.0. |
|||
frequency_penalty |
Optional |
The frequency penalty is between -2.0 and 2.0, which affects how the model punishes new words based on the existing frequency of words in the text. Positive values reduce the probability of repeated words in a row of the model by punishing words that have been frequently used. You are not advised to change this value together with repetition_penalty or presence_penalty. |
The value is of the float type. The value range is [-2.0, 2.0]. The default value is 0.0. |
|||
repetition_penalty |
Optional |
Reduces the probability of duplicate fragments during text generation. It penalizes previously generated text, making the model more inclined to choose new, non-repeated content. You are not advised to change this value together with presence_penalty or frequency_penalty. |
The value is of the float type. The value range is (0.0, 2.0]. The default value is 1.0. |
|||
temperature |
Optional |
Controls the randomness of generation. Higher values produce more diversified outputs. |
The value is of the float type. The value range is [0.0, 2.0]. The default value is 1.0. A larger value indicates greater randomness of the result, and the function call may not be triggered. You are advised to use a value greater than or equal to 0.001. If the value is less than 0.001, the text quality may be poor. |
|||
top_p |
Optional |
Controls the vocabulary range considered during model generation and selects candidate words using the cumulative probability until it exceeds a given threshold. This parameter can also control the diversity of generated results. |
The value is of the float type. The value range is (1e-6, 1.0]. The default value is 1.0. |
|||
top_k |
Optional |
Controls the vocabulary range considered during model generation. Only k candidate words with the highest probability are selected. |
The uint32_t type. The value range is (0, 2147483647]. If the field is not set, the default value is determined by the backend model.
If the value is greater than or equal to vocabSize, the default value is vocabSize. The value of vocabSize is the same as that of vocab_size or padded_vocab_size in the config.json file in the modelWeightPath directory. If vocab_size or padded_vocab_size does not exist, the default value 0 is used. You are advised to add vocab_size or padded_vocab_size to the config.json file. Otherwise, the inference may fail. |
|||
seed |
Optional |
Specifies the random seed of the inference process. The same seed value ensures the reproducibility of the inference result, and different seed values improve the randomness of the inference result. |
The value is of the uint64_t type. The value range is [0, 18446744073709551615]. If this parameter is not passed, the system generates a random seed value. When the value of seed is close to the maximum value, a warning is generated, which does not affect normal use. To delete the warning, decrease the value of seed. |
|||
stop |
Optional |
Indicates the text for stopping inference. By default, the output result does not contain the stop word list text. |
The value is of the List[string] or string type. The default value is null.
This parameter cannot be used together with the function call feature, and does not support PD disaggregation and chain-of-thought content parsing. |
|||
stop_token_ids |
Optional |
Indicates the ID list of tokens for stopping inference. By default, the output does not contain the token ID in the list for stopping inference. |
The value is of the List[int32] type. Elements whose data type is not int32 will be ignored. The default value is null. This parameter is not supported in the PD disaggregation scenario, and cannot be used together with the function call feature. |
|||
include_stop_str_in_output |
Optional |
Determines whether to include the stop string in the generated inference text. |
The value is of the Boolean type. The default value is false.
If stop or stop_token_ids is not passed, this field will be ignored. This parameter cannot be used together with the function call feature, and does not support PD disaggregation and chain-of-thought content parsing. |
|||
skip_special_tokens |
Optional |
Indicates whether to skip special tokens in the text generated by inference. |
The value is of the Boolean type. The default value is true.
|
|||
ignore_eos |
Optional |
Indicates whether to ignore the eos_token terminator during inference text generation. |
The value is of the Boolean type. The default value is false.
|
|||
max_tokens |
Optional |
Indicates the maximum number of tokens that can be generated during inference. The number of generated tokens is also affected by the maxIterTimes parameter in the configuration file. The number of inference tokens is less than or equal to the value of min(maxIterTimes, max_tokens). |
The value is of the integer type. The value range is (0, 2147483647]. The default value is the value of maxIterTimes. |
|||
use_beam_search |
Optional |
Indicates whether to enable beam search. |
The value is of the Boolean type. The default value is false. This parameter cannot be used together with the stop and stop_token_ids parameters.
This parameter cannot be used together with the MTP, function call, parallel decoding, PD disaggregation, and asynchronous scheduling features. This parameter does not support DeepSeek series models. |
|||
best_of |
Optional |
Returns best_of sequences when beam search is disabled. |
This parameter will be removed in later versions. The value is of the integer type. The value range is [1, 128]. The default value is 1. Also, the value can be null. When best_of is set to a value greater than 1, the temperature value must be greater than 0.
This parameter cannot be used together with the MTP, function call, parallel decoding, PD disaggregation, and asynchronous scheduling features. This parameter does not support DeepSeek series models. |
|||
n |
Optional |
When best_of is set to null or not set, or beam search is enabled, n sequences are returned. |
The value is of the integer type. The value range is [1, 128]. The default value is 1. Also, the value can be null. When n is set to a value greater than 1, the temperature value must be greater than 0.
This parameter cannot be used together with the MTP, function call, parallel decoding, and PD disaggregation features. When n is set to a value greater than 1, this parameter cannot be used together with the asynchronous scheduling feature. This parameter does not support DeepSeek series models. |
|||
logprobs |
Optional |
Indicates whether the inference result contains logprobs information. |
The value is of the Boolean type. The default value is false. This parameter cannot be used together with the function call, SplitFuse, prefix cache, parallel decoding, and PD disaggregation features. |
|||
top_logprobs |
Optional |
Specifies the number of logprobs carried by each token in the inference result. |
The value is of the integer type. The value range is [0, 20]. The default value is 0. If top_logprobs is assigned a valid value and logprobs is not assigned a value, logprobs is set to true. This parameter cannot be used together with the function call, SplitFuse, prefix cache, parallel decoding, and PD disaggregation features. |
|||
chat_template_kwargs |
Optional |
Indicates the chat template parameter. |
Dict type. |
|||
- |
enable_thinking |
Optional |
Indicates whether to enable the model chain-of-thought function. |
Bool type. |
||
tools |
Optional |
Indicates the list of tools that may be used. |
List[dict] type. |
|||
- |
type |
Mandatory |
Indicates the tool type. |
Only the character string "function" is supported. |
||
function |
Mandatory |
Indicates the function description. |
Dict type. |
|||
- |
name |
Mandatory |
Indicates the function name. |
Character string. |
||
strict |
Optional |
Indicates whether the generated tool calls strictly comply with the schema format. |
The value is of the Boolean type. The default value is false. |
|||
description |
Optional |
Describes the functions and usage. |
Character string. |
|||
parameters |
Optional |
Indicates the parameters accepted by the function. |
JSON schema format. |
|||
- |
type |
Optional |
Indicates the property type of function parameters. |
Character string. |
||
properties |
Optional |
Indicates the properties of function parameters. Each key indicates a parameter name, which is user-defined. The value is of the dict type and indicates the parameter description, including the type and description parameters. |
Dict type. |
|||
required |
Optional |
Indicates the list of mandatory function parameters. |
List[string] type. |
|||
additionalProperties |
Optional |
Indicates whether to allow the use of other parameters that are not mentioned. |
The value is of the Boolean type. The default value is false.
|
|||
tool_choice |
Optional |
Controls the tool call by a model. |
The value is of the string or dict type and can be null. The default value is "auto".
|
|||
Usage Example
Request example:
POST https://{ip}:{port}/v1/chat/completions
- Single-turn dialogue:
- Single-modal model:
{ "model": "deepseek", "messages": [{ "role": "user", "content": "You are a helpful assistant." }], "stream": false, "presence_penalty": 1.03, "frequency_penalty": 1.0, "repetition_penalty": 1.0, "temperature": 0.5, "top_p": 0.95, "top_k": 0, "seed": null, "stop": ["stop1", "stop2"], "stop_token_ids": [2, 13], "include_stop_str_in_output": false, "skip_special_tokens": true, "ignore_eos": false, "max_tokens": 20, "best_of": 1, "n": 1, "logprobs": false, "top_logprobs":null } - Multimodal model:
Change the value of image_url as needed.
{ "model": "qwen2.5_vl", "messages": [{ "role": "user", "content": [ {"type": "text", "text": "My name is Olivier and I"}, {"type": "image_url", "image_url": "/xxxx/test.png"} ] }], "stream": false, "presence_penalty": 1.03, "frequency_penalty": 1.0, "repetition_penalty": 1.0, "temperature": 0.5, "top_p": 0.95, "top_k": 0, "seed": null, "stop": ["stop1", "stop2"], "stop_token_ids": [2, 13], "include_stop_str_in_output": false, "skip_special_tokens": true, "ignore_eos": false, "max_tokens": 20, "best_of": null, "n": null, "logprobs": false, "top_logprobs":null }
- Single-modal model:
- Multi-turn dialogue:
- First-turn request example:
{ "model": "qwen", "messages": [ {"role": "system", "content": "You are a helpful customer support assistant."}, {"role": "user", "content": "Hi, can you tell me what is the best city in China? Just tell me the answer."} ], "stream": false, "presence_penalty": 1.03, "frequency_penalty": 1.0, "repetition_penalty": 1.0, "temperature": 0.5, "top_p": 0.95, "top_k": 1, "seed": null, "stop": ["stop1", "stop2"], "stop_token_ids": [2, 13], "include_stop_str_in_output": false, "skip_special_tokens": true, "ignore_eos": false, "max_tokens": 200, "chat_template_kwargs": {"enable_thinking": false} } - Second-turn request example:
{ "model": "qwen", "messages": [ {"role": "system", "content": "You are a helpful customer support assistant."}, {"role": "user", "content": "Hi, can you tell me what is the best city in China? Just tell me the answer."}, {"role": "assistant", "content": "The best city in China is subjective and depends on personal preferences, but **Shanghai** is often considered one of the most vibrant and dynamic cities in the country. It is a global financial hub, known for its modern skyline, cultural diversity, and international influence."}, {"role": "user", "content": "What is the best thing in this city?"} ], "stream": false, "presence_penalty": 1.03, "frequency_penalty": 1.0, "repetition_penalty": 1.0, "temperature": 0.5, "top_p": 0.95, "top_k": 1, "seed": null, "stop": ["stop1", "stop2"], "stop_token_ids": [2, 13], "include_stop_str_in_output": false, "skip_special_tokens": true, "ignore_eos": false, "max_tokens": 20, "best_of": null, "chat_template_kwargs": {"enable_thinking": false} }
- First-turn request example:
- Text inference (stream = false):
- Single-turn dialogue:
{ "id": "chatcmpl-123", "object": "chat.completion", "created": 1677652288, "model": "deepseek", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "\n\nHello there, how may I assist you today?" }, "finish_reason": "stop" } ], "usage": { "prompt_tokens": 9, "prompt_tokens_details": {"cached_tokens": 0}, "completion_tokens": 12, "total_tokens": 21, "batch_size": [1,1,1,1,1,1,1,1,1,1,1,1], "queue_wait_time": [5149,60,39,213,28,30,38,68,430,61,48,39] }, "prefill_time": 200, "decode_time_arr": [56, 28, 28, 28, 32, 28, 28, 41, 28, 25, 28] } - Multi-turn dialogue:
- First-turn response example:
{ "id":"endpoint_common_16", "object":"chat.completion", "created":1762330491, "model":"qwen", "choices":[ { "index":0, "message":{ "role":"assistant", "content":"The best city in China is subjective and depends on personal preferences, but **Shanghai** is often considered one of the most vibrant and dynamic cities in the country.", "tool_calls":[] }, "logprobs":null, "finish_reason":"stop" } ], "usage": { "prompt_tokens":45, "prompt_tokens_details": {"cached_tokens": 0}, "completion_tokens":54, "total_tokens":78, "batch_size":[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1], "queue_wait_time":[5148,1045,923,737,764,723,796,832,809,798,814,790,683,665,808,765,568,646,744,780,773,790,780,735,785,715,722,726,750,760,791,755,791] }, "prefill_time":39, "decode_time_arr":[26,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27] } - Second-turn response example:
{ "id":"endpoint_common_17", "object":"chat.completion", "created":1762330612, "model":"qwen", "choices":[ { "index":0, "message":{ "role":"assistant", "content":"The best thing in Shanghai is its unique blend of **modern innovation** and **rich history**. offering", "tool_calls":[] }, "logprobs":null, "finish_reason":"length" } ], "usage":{ "prompt_tokens":117, "prompt_tokens_details": {"cached_tokens": 0}, "completion_tokens":20, "total_tokens":137, "batch_size":[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1], "queue_wait_time":[5161,361,441,576,714,672,685,707,693,684,590,514,660,685,711,776,698,710,699,671] }, "prefill_time":47, "decode_time_arr":[28,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27] }
- First-turn response example:
- Single-turn dialogue:
- Streaming inference
- Streaming inference 1 (stream = true, returned in SSE format):
data: {"id":"endpoint_common_8","object":"chat.completion.chunk","created":1729614610,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":"\t"},"finish_reason":null}]} data: {"id":"endpoint_common_8","object":"chat.completion.chunk","created":1729614610,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]} data: {"id":"endpoint_common_8","object":"chat.completion.chunk","created":1729614610,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]} data: {"id":"endpoint_common_8","object":"chat.completion.chunk","created":1729614610,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]} data: {"id":"endpoint_common_8","object":"chat.completion.chunk","created":1729614610,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]} data: {"id":"endpoint_common_8","object":"chat.completion.chunk","created":1729614610,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":"\t"},"finish_reason":null}]} data: {"id":"endpoint_common_8","object":"chat.completion.chunk","created":1729614610,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]} data: {"id":"endpoint_common_8","object":"chat.completion.chunk","created":1729614610,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]} data: {"id":"endpoint_common_8","object":"chat.completion.chunk","created":1729614610,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]} data: {"id":"endpoint_common_8","object":"chat.completion.chunk","created":1729614610,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]} data: {"id":"endpoint_common_8","object":"chat.completion.chunk","created":1729614610,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]} data: {"id":"endpoint_common_8","object":"chat.completion.chunk","created":1729614610,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]} data: {"id":"endpoint_common_8","object":"chat.completion.chunk","created":1729614610,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]} data: {"id":"endpoint_common_8","object":"chat.completion.chunk","created":1729614610,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]} data: {"id":"endpoint_common_8","object":"chat.completion.chunk","created":1729614610,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]} data: {"id":"endpoint_common_8","object":"chat.completion.chunk","created":1729614610,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]} data: {"id":"endpoint_common_8","object":"chat.completion.chunk","created":1729614610,"model":"deepseek","usage":{"prompt_tokens":54,"prompt_tokens_details": {"cached_tokens": 0},"completion_tokens":17,"total_tokens":71, "batch_size":[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1],"queue_wait_time":[5318,117,82,72,196,64,60,55,53,54,56,67,61,64,53,52,49]},"choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":"stop"}]} data: [DONE] - Streaming inference 2 (stream = true, fullTextEnabled = true, returned in SSE format):
data: {"id":"endpoint_common_11","object":"chat.completion.chunk","created":1730184192,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":"Hello"},"finish_reason":null}]} data: {"id":"endpoint_common_11","object":"chat.completion.chunk","created":1730184192,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":"Hello!"},"finish_reason":null}]} data: {"id":"endpoint_common_11","object":"chat.completion.chunk","created":1730184192,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":"Hello! How"},"finish_reason":null}]} data: {"id":"endpoint_common_11","object":"chat.completion.chunk","created":1730184192,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":"Hello! How can"},"finish_reason":null}]} data: {"id":"endpoint_common_11","object":"chat.completion.chunk","created":1730184192,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":"Hello! How can I"},"finish_reason":null}]} data: {"id":"endpoint_common_11","object":"chat.completion.chunk","created":1730184192,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":"Hello! How can I assist"},"finish_reason":null}]} data: {"id":"endpoint_common_11","object":"chat.completion.chunk","created":1730184192,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":"Hello! How can I assist you"},"finish_reason":null}]} data: {"id":"endpoint_common_11","object":"chat.completion.chunk","created":1730184192,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":"Hello! How can I assist you today"},"finish_reason":null}]} data: {"id":"endpoint_common_11","object":"chat.completion.chunk","created":1730184192,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":"Hello! How can I assist you today?"},"finish_reason":null}]} data: {"id":"endpoint_common_11","object":"chat.completion.chunk","created":1730184192,"model":"deepseek","full_text":"Hello! How can I assist you today?","usage":{"prompt_tokens":31,"prompt_tokens_details": {"cached_tokens": 0},"completion_tokens":10,"total_tokens":41, "batch_size":[1,1,1,1,1,1,1,1,1,1],"queue_wait_time":[5318,117,82,72,196,64,60,55,53,54]},"choices":[{"index":0,"delta":{"role":"assistant","content":"Hello! How can I assist you today?"},"finish_reason":"length"}]} data: [DONE]
- Streaming inference 1 (stream = true, returned in SSE format):