Inference API

The version of Transformers in the operating environment cannot be earlier than 4.34.1. The tokenizer of an earlier version does not support the chat_template method.
The tokenizer_config.json file in the inference model weight path must contain the chat_template field and its implementation.
Currently, the tool_call_id, tool_calls, tools, and tool_choice parameters related to function call support only some models. An error may be reported if an unsupported model is used. Currently, the following models are supported: ChatGLM3-6B, DeepSeek-R1, Qwen2.5 series, and Qwen3 series.

Function

Processes text/streaming inference.

Format

Operation type: POST

URL: https://{ip}:{port}/v1/chat/completions

Replace {ip} and {port} with the IP address and port number of the service plane, that is, ipAddress and port.
This URL is the same as the URL in v1/chat. You need to use the openAiSupport parameter in the config.json file to distinguish them.
- If the value is vllm or the configuration field is missing, the OpenAI API compatible with vLLM are used.
- This API is used when the value contains other characters.
For details, see the ServerConfig parameter description in "Core Concepts and Configurations" > "Configuration Parameters (Serving)" in MindIE LLM Development Guide.

Request Parameters

Parameter				Mandatory/Optional	Description	Value
model				Mandatory	Indicates the model name.	The value must be the same as the value of modelName in the MindIE Server configuration file.
messages				Mandatory	Indicates the structure of the inference request message.	The value is of the list type. The character length is greater than 0 KB but less than and equal to 4 MB. Chinese and English are supported. The number of tokens after prompt tokenization is less than or equal to the minimum value among maxInputTokenLen, maxSeqLen-1, max_position_embeddings, and 1 MB. Obtain the max_position_embeddings from the weight file config.json, and other related parameters from the configuration file.
-	role			Mandatory	Indicates the role of the inference request message.	Character string type. The available roles are as follows: system: system role user: use role assistant: assistant role tool: tool role
	content			Mandatory	Inference request content. The value is of the string type or the list type for a single-modal text model and of the list type for a multi-modal model.	string: When role is set to assistant and tool_calls is not empty, content can be left empty and other roles cannot be empty. In other cases, content is not empty. list: For details, see the multimodal model example in Usage Example.
	-	type		Optional	Indicates the inference request content type.	text: text image_url: image video_url: video audio_url: audio Instructions for using multimedia files: HTTP/HTTPS access mode: Configure the whitelist environment variable ALLOWED_MEDIA_DOMAINS_ENV first. The following is an example (replace xxx.xxx.xxx.xxx with the actual IP address of the resource): export ALLOWED_MEDIA_DOMAINS_ENV="upload.xxxmedia.org,cxxx.xxx.com,xxx.xxx.xxx.xxx" Local file mode: Place the multimedia file in the following directory: /data/multimodal_inputs/ NOTE: Security warning: Before using the multimedia file, ensure that its source is reliable and the content is secure to avoid potential risks. Prevent resolution to local or internal IP addresses, and do not use domain names (such as nip.io and sslip.io) that can resolve to any IP address. Before using the multimedia file, ensure that the disk space is sufficient for downloading it. The formula for calculating the reserved space is as follows: Maximum size of a single file × Maximum number of concurrent requests × 1.5 (reserved coefficient) For example, if the maximum size of a single file is 512 MB and the maximum number of concurrent requests is 1000, ensure that the remaining disk space is greater than 750 GB.
		text		Optional	Indicates that the inference request content is text.	The value cannot be empty. Both Chinese and English are supported.
		image_url		Optional	Indicates that the inference request content is an image.	Local JPG, PNG, JPEG, and Base64-encoded JPG images can be imported in URL format. Both HTTP and HTTPS protocols are supported. Currently, the maximum size of an image is 40 MB.
		video_url		Optional	Indicates that the inference request content is a video.	Local MP4, AVI, and WMV videos can be imported in URL format. Both HTTP and HTTPS protocols are supported. Currently, the maximum size of a video file is 512 MB.
		audio_url		Optional	The inference request content is audio.	Local MP3, WAV, and FLAC audio files can be imported in URL format. Both HTTP and HTTPS protocols are supported. Currently, the maximum size of an audio file is 40 MB.
	tool_calls			Optional	Indicates the tool call by a model.	The type is List[dict]. It indicates the call of a model to the tool when role is assistant.
	-	function		Mandatory	Indicates the tool called by a model.	Dict type. arguments (mandatory): parameters to call a function. The value is a string in JSON format. name (mandatory): character string, which indicates the name of the function called.
		id		Mandatory	Indicates the ID of a tool called by a model.	Character string.
		type		Mandatory	Indicates the type of the tool called.	Character string. Only "function" is supported.
	tool_call_id			Mandatory when role is set to tool. Otherwise, it is optional.	Indicates the ID of a tool called by a model.	Character string.
stream				Optional	Indicates whether the returned result is text inference or streaming inference.	The value is of the Boolean type. The default value is false. true: streaming inference false: text inference
presence_penalty				Optional	There is a penalty between -2.0 and 2.0, which affects how the model punishes new tokens based on whether they appear in text so far. Positive values increase the probability that the model talks about new topics by punishing words that have been used. You are not advised to change this value together with repetition_penalty or frequency_penalty.	The value is of the float type. The value range is [-2.0, 2.0]. The default value is 0.0.
frequency_penalty				Optional	The frequency penalty is between -2.0 and 2.0, which affects how the model punishes new words based on the existing frequency of words in the text. Positive values reduce the probability of repeated words in a row of the model by punishing words that have been frequently used. You are not advised to change this value together with repetition_penalty or presence_penalty.	The value is of the float type. The value range is [-2.0, 2.0]. The default value is 0.0.
repetition_penalty				Optional	Reduces the probability of duplicate fragments during text generation. It penalizes previously generated text, making the model more inclined to choose new, non-repeated content. You are not advised to change this value together with presence_penalty or frequency_penalty.	The value is of the float type. The value range is (0.0, 2.0]. The default value is 1.0.
temperature				Optional	Controls the randomness of generation. Higher values produce more diversified outputs.	The value is of the float type. The value range is [0.0, 2.0]. The default value is 1.0. A larger value indicates greater randomness of the result, and the function call may not be triggered. You are advised to use a value greater than or equal to 0.001. If the value is less than 0.001, the text quality may be poor.
top_p				Optional	Controls the vocabulary range considered during model generation and selects candidate words using the cumulative probability until it exceeds a given threshold. This parameter can also control the diversity of generated results.	The value is of the float type. The value range is (1e-6, 1.0]. The default value is 1.0.
top_k				Optional	Controls the vocabulary range considered during model generation. Only k candidate words with the highest probability are selected.	The uint32_t type. The value range is (0, 2147483647]. If the field is not set, the default value is determined by the backend model. atb (ATB Models): The configuration files are generation_config.json and config.json. generation_config.json has a higher priority. If top_k is not specified by you or model weights, top_k is set to 1000 to balance performance and inference effect. ms (MindSpore): The file ends with .yaml is its configuration file. If top_k is not specified by you or model weights, top_k is set to 0. If the value is greater than or equal to vocabSize, the default value is vocabSize. The value of vocabSize is the same as that of vocab_size or padded_vocab_size in the config.json file in the modelWeightPath directory. If vocab_size or padded_vocab_size does not exist, the default value 0 is used. You are advised to add vocab_size or padded_vocab_size to the config.json file. Otherwise, the inference may fail.
seed				Optional	Specifies the random seed of the inference process. The same seed value ensures the reproducibility of the inference result, and different seed values improve the randomness of the inference result.	The value is of the uint64_t type. The value range is [0, 18446744073709551615]. If this parameter is not passed, the system generates a random seed value. When the value of seed is close to the maximum value, a warning is generated, which does not affect normal use. To delete the warning, decrease the value of seed.
stop				Optional	Indicates the text for stopping inference. By default, the output result does not contain the stop word list text.	The value is of the List[string] or string type. The default value is null. List[string]: The list can contain a maximum of 1024 elements. The length of each element ranges from 1 to 1024 characters. The total length of the list elements cannot exceed 32768 (256 x 128) characters. If the list is empty, the value is equivalent to null. string: The length ranges from 1 to 1024 characters. This parameter cannot be used together with the function call feature, and does not support PD disaggregation and chain-of-thought content parsing.
stop_token_ids				Optional	Indicates the ID list of tokens for stopping inference. By default, the output does not contain the token ID in the list for stopping inference.	The value is of the List[int32] type. Elements whose data type is not int32 will be ignored. The default value is null. This parameter is not supported in the PD disaggregation scenario, and cannot be used together with the function call feature.
include_stop_str_in_output				Optional	Determines whether to include the stop string in the generated inference text.	The value is of the Boolean type. The default value is false. true: The stop string is included. false: The stop string is not included. If stop or stop_token_ids is not passed, this field will be ignored. This parameter cannot be used together with the function call feature, and does not support PD disaggregation and chain-of-thought content parsing.
skip_special_tokens				Optional	Indicates whether to skip special tokens in the text generated by inference.	The value is of the Boolean type. The default value is true. true: Special tokens are skipped. false: Special tokens are reserved.
ignore_eos				Optional	Indicates whether to ignore the eos_token terminator during inference text generation.	The value is of the Boolean type. The default value is false. true: Ignore the eos_token terminator. false: Do not ignore the eos_token terminator.
max_tokens				Optional	Indicates the maximum number of tokens that can be generated during inference. The number of generated tokens is also affected by the maxIterTimes parameter in the configuration file. The number of inference tokens is less than or equal to the value of min(maxIterTimes, max_tokens).	The value is of the integer type. The value range is (0, 2147483647]. The default value is the value of maxIterTimes.
use_beam_search				Optional	Indicates whether to enable beam search.	The value is of the Boolean type. The default value is false. This parameter cannot be used together with the stop and stop_token_ids parameters. Non-streaming Returns n sequences. If n is null, one answer is returned. Streaming Pseudo streaming: The inference result is returned at a time. The excessive sequences increase data quantity. Returns n sequences. If n is null, one answer is returned. This parameter cannot be used together with the MTP, function call, parallel decoding, PD disaggregation, and asynchronous scheduling features. This parameter does not support DeepSeek series models.
best_of				Optional	Returns best_of sequences when beam search is disabled.	This parameter will be removed in later versions. The value is of the integer type. The value range is [1, 128]. The default value is 1. Also, the value can be null. When best_of is set to a value greater than 1, the temperature value must be greater than 0. When use_beam_search is set to false, null, or not set: In non-streaming inference scenarios, if best_of and n are both set, the value of best_of must be greater than or equal to that of n. In streaming inference scenarios, best_of and n must be set to the same value, and best_of cannot be set separately. When use_beam_search is set to true, the value of best_of is not verified. This parameter cannot be used together with the MTP, function call, parallel decoding, PD disaggregation, and asynchronous scheduling features. This parameter does not support DeepSeek series models.
n				Optional	When best_of is set to null or not set, or beam search is enabled, n sequences are returned.	The value is of the integer type. The value range is [1, 128]. The default value is 1. Also, the value can be null. When n is set to a value greater than 1, the temperature value must be greater than 0. When use_beam_search is set to false, null, or not set: In non-streaming inference scenarios, if best_of and n are both set, the value of best_of must be greater than or equal to that of n. In streaming inference scenarios, best_of and n must be set to the same value, and best_of cannot be set separately. When use_beam_search is set to true, the value of best_of is not verified. This parameter cannot be used together with the MTP, function call, parallel decoding, and PD disaggregation features. When n is set to a value greater than 1, this parameter cannot be used together with the asynchronous scheduling feature. This parameter does not support DeepSeek series models.
logprobs				Optional	Indicates whether the inference result contains logprobs information.	The value is of the Boolean type. The default value is false. This parameter cannot be used together with the function call, SplitFuse, prefix cache, parallel decoding, and PD disaggregation features.
top_logprobs				Optional	Specifies the number of logprobs carried by each token in the inference result.	The value is of the integer type. The value range is [0, 20]. The default value is 0. If top_logprobs is assigned a valid value and logprobs is not assigned a value, logprobs is set to true. This parameter cannot be used together with the function call, SplitFuse, prefix cache, parallel decoding, and PD disaggregation features.
chat_template_kwargs				Optional	Indicates the chat template parameter.	Dict type.
-	enable_thinking			Optional	Indicates whether to enable the model chain-of-thought function.	Bool type.
tools				Optional	Indicates the list of tools that may be used.	List[dict] type.
-	type			Mandatory	Indicates the tool type.	Only the character string "function" is supported.
	function			Mandatory	Indicates the function description.	Dict type.
	-	name		Mandatory	Indicates the function name.	Character string.
		strict		Optional	Indicates whether the generated tool calls strictly comply with the schema format.	The value is of the Boolean type. The default value is false.
		description		Optional	Describes the functions and usage.	Character string.
		parameters		Optional	Indicates the parameters accepted by the function.	JSON schema format.
		-	type	Optional	Indicates the property type of function parameters.	Character string.
			properties	Optional	Indicates the properties of function parameters. Each key indicates a parameter name, which is user-defined. The value is of the dict type and indicates the parameter description, including the type and description parameters.	Dict type.
			required	Optional	Indicates the list of mandatory function parameters.	List[string] type.
			additionalProperties	Optional	Indicates whether to allow the use of other parameters that are not mentioned.	The value is of the Boolean type. The default value is false. true: Other parameters that are not mentioned can be used. false: Other parameters that are not mentioned cannot be used.
tool_choice				Optional	Controls the tool call by a model.	The value is of the string or dict type and can be null. The default value is "auto". "none": A model does not call any tool but generates a message. "auto": A model can generate messages or call one or more tools.

Usage Example

Request example:

POST https://{ip}:{port}/v1/chat/completions

Request body:

Single-turn dialogue:

Single-modal model:

{
    "model": "deepseek",
    "messages": [{
        "role": "user",
        "content": "You are a helpful assistant."
    }],
    "stream": false,
    "presence_penalty": 1.03,
    "frequency_penalty": 1.0,
    "repetition_penalty": 1.0,
    "temperature": 0.5,
    "top_p": 0.95,
    "top_k": 0,
    "seed": null,
    "stop": ["stop1", "stop2"],
    "stop_token_ids": [2, 13],
    "include_stop_str_in_output": false,
    "skip_special_tokens": true,
    "ignore_eos": false,
    "max_tokens": 20,
    "best_of": 1,
    "n": 1,
    "logprobs": false,
    "top_logprobs":null
}

Multimodal model:

Change the value of image_url as needed.

{
    "model": "qwen2.5_vl",
    "messages": [{
        "role": "user",
        "content": [
           {"type": "text", "text": "My name is Olivier and I"},
           {"type": "image_url", "image_url": "/xxxx/test.png"}
        ]
    }],
    "stream": false,
    "presence_penalty": 1.03,
    "frequency_penalty": 1.0,
    "repetition_penalty": 1.0,
    "temperature": 0.5,
    "top_p": 0.95,
    "top_k": 0,
    "seed": null,
    "stop": ["stop1", "stop2"],
    "stop_token_ids": [2, 13],
    "include_stop_str_in_output": false,
    "skip_special_tokens": true,
    "ignore_eos": false,
    "max_tokens": 20,
    "best_of": null,
    "n": null,
    "logprobs": false,
    "top_logprobs":null
}

Multi-turn dialogue:

First-turn request example:

{
    "model": "qwen",
    "messages": [
      {"role": "system", "content": "You are a helpful customer support assistant."},
      {"role": "user", "content": "Hi, can you tell me what is the best city in China? Just tell me the answer."}
    ],
    "stream": false,
    "presence_penalty": 1.03,
    "frequency_penalty": 1.0,
    "repetition_penalty": 1.0,
    "temperature": 0.5,
    "top_p": 0.95,
    "top_k": 1,
    "seed": null,
    "stop": ["stop1", "stop2"],
    "stop_token_ids": [2, 13],
    "include_stop_str_in_output": false,
    "skip_special_tokens": true,
    "ignore_eos": false,
    "max_tokens": 200,
    "chat_template_kwargs": {"enable_thinking": false}
}

Second-turn request example:

{
    "model": "qwen",
    "messages": [
      {"role": "system", "content": "You are a helpful customer support assistant."},
      {"role": "user", "content": "Hi, can you tell me what is the best city in China? Just tell me the answer."}, 
      {"role": "assistant", "content": "The best city in China is subjective and depends on personal preferences, but **Shanghai** is often considered one of the most vibrant and dynamic cities in the country. It is a global financial hub, known for its modern skyline, cultural diversity, and international influence."}, 
      {"role": "user", "content": "What is the best thing in this city?"}
    ],
    "stream": false,
    "presence_penalty": 1.03,
    "frequency_penalty": 1.0,
    "repetition_penalty": 1.0,
    "temperature": 0.5,
    "top_p": 0.95,
    "top_k": 1,
    "seed": null,
    "stop": ["stop1", "stop2"],
    "stop_token_ids": [2, 13],
    "include_stop_str_in_output": false,
    "skip_special_tokens": true,
    "ignore_eos": false,
    "max_tokens": 20,
    "best_of": null,
    "chat_template_kwargs": {"enable_thinking": false}
}

Response example:

Text inference (stream = false):

Single-turn dialogue:

{
    "id": "chatcmpl-123",
    "object": "chat.completion",
    "created": 1677652288,
    "model": "deepseek",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "\n\nHello there, how may I assist you today?"
            },
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 9,
        "prompt_tokens_details": {"cached_tokens": 0},
        "completion_tokens": 12,
        "total_tokens": 21,
        "batch_size": [1,1,1,1,1,1,1,1,1,1,1,1],
        "queue_wait_time": [5149,60,39,213,28,30,38,68,430,61,48,39]
    },
    "prefill_time": 200,
    "decode_time_arr": [56, 28, 28, 28, 32, 28, 28, 41, 28, 25, 28]
}

Multi-turn dialogue:

First-turn response example:

{
    "id":"endpoint_common_16",
    "object":"chat.completion",
    "created":1762330491,
    "model":"qwen",
    "choices":[
        {
            "index":0,
            "message":{
                "role":"assistant",
                "content":"The best city in China is subjective and depends on personal preferences, but **Shanghai** is often considered one of the most vibrant and dynamic cities in the country.",
                "tool_calls":[]
            },
            "logprobs":null,
            "finish_reason":"stop"
        }
    ],
    "usage": {
        "prompt_tokens":45,
        "prompt_tokens_details": {"cached_tokens": 0},
        "completion_tokens":54,
        "total_tokens":78,
        "batch_size":[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1],
        "queue_wait_time":[5148,1045,923,737,764,723,796,832,809,798,814,790,683,665,808,765,568,646,744,780,773,790,780,735,785,715,722,726,750,760,791,755,791]
    },
    "prefill_time":39,
    "decode_time_arr":[26,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27]
}

Second-turn response example:

{
    "id":"endpoint_common_17",
    "object":"chat.completion",
    "created":1762330612,
    "model":"qwen",
    "choices":[
        {
            "index":0,
            "message":{
                "role":"assistant",
                "content":"The best thing in Shanghai is its unique blend of **modern innovation** and **rich history**. offering",
                "tool_calls":[]
            },
            "logprobs":null,
            "finish_reason":"length"
        }
    ],
    "usage":{
        "prompt_tokens":117,
        "prompt_tokens_details": {"cached_tokens": 0},
        "completion_tokens":20,
        "total_tokens":137,
        "batch_size":[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1],
        "queue_wait_time":[5161,361,441,576,714,672,685,707,693,684,590,514,660,685,711,776,698,710,699,671]
    },
    "prefill_time":47,
    "decode_time_arr":[28,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27]
}

Streaming inference

Streaming inference 1 (stream = true, returned in SSE format):

data: {"id":"endpoint_common_8","object":"chat.completion.chunk","created":1729614610,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":"\t"},"finish_reason":null}]}

data: {"id":"endpoint_common_8","object":"chat.completion.chunk","created":1729614610,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}

data: {"id":"endpoint_common_8","object":"chat.completion.chunk","created":1729614610,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}

data: {"id":"endpoint_common_8","object":"chat.completion.chunk","created":1729614610,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}

data: {"id":"endpoint_common_8","object":"chat.completion.chunk","created":1729614610,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}

data: {"id":"endpoint_common_8","object":"chat.completion.chunk","created":1729614610,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":"\t"},"finish_reason":null}]}

data: {"id":"endpoint_common_8","object":"chat.completion.chunk","created":1729614610,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}

data: {"id":"endpoint_common_8","object":"chat.completion.chunk","created":1729614610,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}

data: {"id":"endpoint_common_8","object":"chat.completion.chunk","created":1729614610,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}

data: {"id":"endpoint_common_8","object":"chat.completion.chunk","created":1729614610,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}

data: {"id":"endpoint_common_8","object":"chat.completion.chunk","created":1729614610,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}

data: {"id":"endpoint_common_8","object":"chat.completion.chunk","created":1729614610,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}

data: {"id":"endpoint_common_8","object":"chat.completion.chunk","created":1729614610,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}

data: {"id":"endpoint_common_8","object":"chat.completion.chunk","created":1729614610,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}

data: {"id":"endpoint_common_8","object":"chat.completion.chunk","created":1729614610,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}

data: {"id":"endpoint_common_8","object":"chat.completion.chunk","created":1729614610,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}

data: {"id":"endpoint_common_8","object":"chat.completion.chunk","created":1729614610,"model":"deepseek","usage":{"prompt_tokens":54,"prompt_tokens_details": {"cached_tokens": 0},"completion_tokens":17,"total_tokens":71, "batch_size":[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1],"queue_wait_time":[5318,117,82,72,196,64,60,55,53,54,56,67,61,64,53,52,49]},"choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":"stop"}]}

data: [DONE]

Streaming inference 2 (stream = true, fullTextEnabled = true, returned in SSE format):

data: {"id":"endpoint_common_11","object":"chat.completion.chunk","created":1730184192,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":"Hello"},"finish_reason":null}]}

data: {"id":"endpoint_common_11","object":"chat.completion.chunk","created":1730184192,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":"Hello!"},"finish_reason":null}]}

data: {"id":"endpoint_common_11","object":"chat.completion.chunk","created":1730184192,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":"Hello! How"},"finish_reason":null}]}

data: {"id":"endpoint_common_11","object":"chat.completion.chunk","created":1730184192,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":"Hello! How can"},"finish_reason":null}]}

data: {"id":"endpoint_common_11","object":"chat.completion.chunk","created":1730184192,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":"Hello! How can I"},"finish_reason":null}]}

data: {"id":"endpoint_common_11","object":"chat.completion.chunk","created":1730184192,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":"Hello! How can I assist"},"finish_reason":null}]}

data: {"id":"endpoint_common_11","object":"chat.completion.chunk","created":1730184192,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":"Hello! How can I assist you"},"finish_reason":null}]}

data: {"id":"endpoint_common_11","object":"chat.completion.chunk","created":1730184192,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":"Hello! How can I assist you today"},"finish_reason":null}]}

data: {"id":"endpoint_common_11","object":"chat.completion.chunk","created":1730184192,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":"Hello! How can I assist you today?"},"finish_reason":null}]}

data: {"id":"endpoint_common_11","object":"chat.completion.chunk","created":1730184192,"model":"deepseek","full_text":"Hello! How can I assist you today?","usage":{"prompt_tokens":31,"prompt_tokens_details": {"cached_tokens": 0},"completion_tokens":10,"total_tokens":41, "batch_size":[1,1,1,1,1,1,1,1,1,1],"queue_wait_time":[5318,117,82,72,196,64,60,55,53,54]},"choices":[{"index":0,"delta":{"role":"assistant","content":"Hello! How can I assist you today?"},"finish_reason":"length"}]}

data: [DONE]

Output Description

**Table 1** Text inference result description
Parameter					Type	Description
id					String	Request ID.
Object					String	Currently, all the returned result types are chat.completion.
created					Integer	Inference request timestamp, accurate to second.
model					String	Inference model used.
choices					List	Inference result list.
-	index				Integer	Index of the choices message. Currently, the value can only be 0.
	message				Object	Inference message.
	-	role			String	Role. Currently, "assistant" is returned.
		content			String	Inference text result.
		reasoning_content			String	Chain-of-thought content. When this parameter matches a stop word specified in the stop parameter, the request parameter include_stop_str_in_output will not take effect, and the matched stop word will be included in the reasoning_content output.
		tool_calls			List	Output of the tool calls performed by a model.
		-	function		Dict	Description of the function call.
			-	arguments	String	Parameters to call a function, which is a character string in JSON format.
			-	name	String	Name of the called function.
			id		String	ID of the tool called by the model.
			type		String	Tool type. Currently, only function is supported.
	logprobs				Object	Logprobs information.
	-	content			List	Logprobs information with a high probability.
		-	token		String	Word corresponding to the selected token.
			logprob		Float	Logprob of the selected token.
			bytes		List	UTF-8 encoding of the word corresponding to the selected token.
			top_logprobs		List	Logprobs of the candidate token.
			-	token	String	Word corresponding to the candidate token.
				logprob	Float	Logprob of the candidate token.
				bytes	List	UTF-8 code of the word corresponding to the candidate token.
	finish_reason				String	End cause. stop: A request is canceled or stopped, and the response is deprecated, with the user unware of it. An error occurs during request execution. The response output is empty, and err_msg is not empty. An error occurs during request input verification. The response output is empty, and err_msg is not empty. The request ends normally when the EOS terminator is met. length: A request ends because its maximum sequence length is reached, and the response is the output of the last iteration. A request ends because its maximum output length (including the request parameter max_tokens and model parameters maxIterTimes, maxSeqLen, and max_position_embeddings) is reached, and the response is the output of the last iteration. tool_calls: The model calls the tool.
usage					Object	Inference result statistics.
-	prompt_tokens				Integer	Token length corresponding to the prompt text entered by a user.
	prompt_tokens_details				Object	Token details corresponding to the prompt text entered by a user.
	-	cached_tokens			Integer	Length of the cache token hit during inference based on the prompt text entered by a user. If the prefix cache feature is enabled, the actual value is displayed. If it is disabled, the default value 0 is displayed.
	completion_tokens				Integer	Number of tokens in the inference result. Total number of tokens in the prefill and decode inference results. When the maximum inference length of a request is the value of maxIterTimes, the value of completion_tokens in the response of the decode node is the value of maxIterTimes plus 1, that is, the number of first tokens in the Prefill inference result is added.
	completion_tokens_details				Object	Token details in the inference result.
	-	reasoning_tokens			Integer	Token length of the chain-of-thought content. This field is generated only when a model that supports deep thinking is called. For details about the supported models, see "Constraints" in "Feature Description" > "Interaction Features" > "Thinking Analysis" in MindIE LLM Development Guide.
	total_tokens				Integer	Total number of tokens for request and inference.
	batch_size				List	Batch size when each token is generated during inference. The array length is the number of tokens in the generated sequence. When multiple sequences are generated at the same time, this parameter indicates the common batch size of all sequences. The array length is the number of tokens of the longest sequence. (Each batch size represents the batch size of all sequences in the current round.)
	queue_wait_time				List	Queue waiting time when each token is generated during inference, in μs. The array length is the number of tokens in the generated sequence. When multiple sequences are generated at the same time, this parameter indicates the common queue waiting latency of all sequences. The array length is the number of tokens of the longest sequence. (Each queue waiting time represents the queue waiting time of all sequences in the current round.)
prefill_time					Float	Time to first token of the inference. When multiple sequences are generated, this parameter indicates the time to first token of all sequences.
decode_time_arr					List	Inference decode latency array. When multiple sequences are generated, this parameter indicates the common decode latency of all sequences. The length of the latency array is the number of decode tokens of the longest sequence.

**Table 2** Streaming inference result description
Parameter						Type	Description
data						Object	Result returned by a single inference.
-	id					String	Request ID.
	Object					String	Currently, "chat.completion.chunk" is returned.
	created					Integer	Inference request timestamp, accurate to second.
	model					String	Inference model used.
	full_text					String	Full text result. This parameter is returned only when fullTextEnabled is set to true.
	usage					Object	Inference result statistics.
	-	prompt_tokens				Integer	Token length corresponding to the prompt text entered by a user.
		prompt_tokens_details				Object	Token details corresponding to the prompt text entered by a user.
		-	cached_tokens			Integer	Length of the cache token hit during inference based on the prompt text entered by a user. If the prefix cache feature is enabled, the actual value is displayed. If it is disabled, the default value 0 is displayed.
		completion_tokens				Integer	Number of tokens in the inference result. Total number of tokens in the prefill and decode inference results. When the maximum inference length of a request is the value of maxIterTimes, the value of completion_tokens in the response of the decode node is the value of maxIterTimes plus 1, that is, the number of first tokens in the prefill inference result is added.
		completion_tokens_details				Object	Token details in the inference result.
		-	reasoning_tokens			Integer	Token length of the chain-of-thought content. This field is generated only when a model that supports deep thinking is called. For details about the supported models, see "Constraints" in "Feature Description" > "Interaction Features" > "Thinking Analysis" in MindIE LLM Development Guide.
		total_tokens				Integer	Total number of tokens for request and inference.
		batch_size				List	Batch size when each token is generated during inference. The array length is the number of tokens in the generated sequence. When multiple sequences are generated at the same time, this parameter indicates the common batch size of all sequences. The array length is the number of tokens of the longest sequence. (Each batch size represents the batch size of all sequences in the current round.)
		queue_wait_time				List	Queue waiting time when each token is generated during inference, in μs. The array length is the number of tokens in the generated sequence. When multiple sequences are generated at the same time, this parameter indicates the common queue waiting latency of all sequences. The array length is the number of tokens of the longest sequence. (Each queue waiting time represents the queue waiting time of all sequences in the current round.)
	choices					List	Streaming inference result.
	-	index				Integer	Index of the choices message. Currently, the value can only be 0.
		delta				Object	Inference result. The last response is empty.
		-	role			String	Role. Currently, "assistant" is returned.
			content			String	Inference text result.
			reasoning_content			String	Chain-of-thought content. When this parameter matches a stop word specified in the stop parameter, the request parameter include_stop_str_in_output will not take effect, and the matched stop word will be included in the reasoning_content output.
			tool_calls			List	Output of the tool calls performed by a model.
			-	function		Dict	Description of the function call.
				-	arguments	String	Parameters to call a function, which is a character string in JSON format.
				-	name	String	Name of the called function.
				id		String	ID of the tool called by the model.
				type		String	Tool type. Currently, only function is supported.
		logprobs				Object	Logprobs information.
		-	content			List	Logprobs information with a high probability.
			-	token		String	Word corresponding to the selected token.
				logprob		Float	Logprob of the selected token.
				bytes		List	UTF-8 encoding of the word corresponding to the selected token.
				top_logprobs		List	Logprobs of the candidate token.
				-	token	String	Word corresponding to the candidate token.
					logprob	Float	Logprob of the candidate token.
					bytes	List	UTF-8 code of the word corresponding to the candidate token.
		finish_reason				String	End cause, which is returned only in the last inference result. stop: A request is canceled or stopped, and the response is deprecated, with the user unware of it. An error occurs during request execution. The response output is empty, and err_msg is not empty. An error occurs during request input verification. The response output is empty, and err_msg is not empty. The request ends normally when the EOS terminator is met. length: A request ends because its maximum sequence length is reached, and the response is the output of the last iteration. A request ends because its maximum output length (including the request parameter max_tokens and model parameters maxIterTimes, maxSeqLen, and max_position_embeddings) is reached, and the response is the output of the last iteration.

Parent topic: APIs Compatible with OpenAI