Inference API

  • The version of Transformers in the operating environment cannot be earlier than 4.34.1. The tokenizer of an earlier version does not support the chat_template method.
  • The tokenizer_config.json file in the inference model weight path must contain the chat_template field and its implementation.
  • Currently, the tool_call_id, tool_calls, tools, and tool_choice parameters related to function call support only some models. An error may be reported if an unsupported model is used. Currently, the following models are supported: ChatGLM3-6B, DeepSeek-R1, Qwen2.5 series, and Qwen3 series.

Function

Processes text/streaming inference.

Format

Operation type: POST

URL: https://{ip}:{port}/v1/chat/completions

  • Replace {ip} and {port} with the IP address and port number of the service plane, that is, ipAddress and port.
  • This URL is the same as the URL in v1/chat. You need to use the openAiSupport parameter in the config.json file to distinguish them.
    • If the value is vllm or the configuration field is missing, the OpenAI API compatible with vLLM are used.
    • This API is used when the value contains other characters.

    For details, see the ServerConfig parameter description in "Core Concepts and Configurations" > "Configuration Parameters (Serving)" in MindIE LLM Development Guide.

Request Parameters

Parameter

Mandatory/Optional

Description

Value

model

Mandatory

Indicates the model name.

The value must be the same as the value of modelName in the MindIE Server configuration file.

messages

Mandatory

Indicates the structure of the inference request message.

The value is of the list type. The character length is greater than 0 KB but less than and equal to 4 MB. Chinese and English are supported. The number of tokens after prompt tokenization is less than or equal to the minimum value among maxInputTokenLen, maxSeqLen-1, max_position_embeddings, and 1 MB. Obtain the max_position_embeddings from the weight file config.json, and other related parameters from the configuration file.

-

role

Mandatory

Indicates the role of the inference request message.

Character string type. The available roles are as follows:

  • system: system role
  • user: use role
  • assistant: assistant role
  • tool: tool role

content

Mandatory

Inference request content.

The value is of the string type or the list type for a single-modal text model and of the list type for a multi-modal model.

  • string:
    • When role is set to assistant and tool_calls is not empty, content can be left empty and other roles cannot be empty.
    • In other cases, content is not empty.
  • list: For details, see the multimodal model example in Usage Example.

-

type

Optional

Indicates the inference request content type.

  • text: text
  • image_url: image
  • video_url: video
  • audio_url: audio
Instructions for using multimedia files:
  • HTTP/HTTPS access mode: Configure the whitelist environment variable ALLOWED_MEDIA_DOMAINS_ENV first. The following is an example (replace xxx.xxx.xxx.xxx with the actual IP address of the resource):
    export ALLOWED_MEDIA_DOMAINS_ENV="upload.xxxmedia.org,cxxx.xxx.com,xxx.xxx.xxx.xxx"
  • Local file mode: Place the multimedia file in the following directory:
    /data/multimodal_inputs/
NOTE:

Security warning:

  • Before using the multimedia file, ensure that its source is reliable and the content is secure to avoid potential risks.
  • Prevent resolution to local or internal IP addresses, and do not use domain names (such as nip.io and sslip.io) that can resolve to any IP address.
  • Before using the multimedia file, ensure that the disk space is sufficient for downloading it. The formula for calculating the reserved space is as follows:

    Maximum size of a single file × Maximum number of concurrent requests × 1.5 (reserved coefficient)

    For example, if the maximum size of a single file is 512 MB and the maximum number of concurrent requests is 1000, ensure that the remaining disk space is greater than 750 GB.

text

Optional

Indicates that the inference request content is text.

The value cannot be empty. Both Chinese and English are supported.

image_url

Optional

Indicates that the inference request content is an image.

Local JPG, PNG, JPEG, and Base64-encoded JPG images can be imported in URL format. Both HTTP and HTTPS protocols are supported. Currently, the maximum size of an image is 40 MB.

video_url

Optional

Indicates that the inference request content is a video.

Local MP4, AVI, and WMV videos can be imported in URL format. Both HTTP and HTTPS protocols are supported. Currently, the maximum size of a video file is 512 MB.

audio_url

Optional

The inference request content is audio.

Local MP3, WAV, and FLAC audio files can be imported in URL format. Both HTTP and HTTPS protocols are supported. Currently, the maximum size of an audio file is 40 MB.

tool_calls

Optional

Indicates the tool call by a model.

The type is List[dict]. It indicates the call of a model to the tool when role is assistant.

-

function

Mandatory

Indicates the tool called by a model.

Dict type.

  • arguments (mandatory): parameters to call a function. The value is a string in JSON format.
  • name (mandatory): character string, which indicates the name of the function called.

id

Mandatory

Indicates the ID of a tool called by a model.

Character string.

type

Mandatory

Indicates the type of the tool called.

Character string. Only "function" is supported.

tool_call_id

Mandatory when role is set to tool. Otherwise, it is optional.

Indicates the ID of a tool called by a model.

Character string.

stream

Optional

Indicates whether the returned result is text inference or streaming inference.

The value is of the Boolean type. The default value is false.

  • true: streaming inference
  • false: text inference

presence_penalty

Optional

There is a penalty between -2.0 and 2.0, which affects how the model punishes new tokens based on whether they appear in text so far. Positive values increase the probability that the model talks about new topics by punishing words that have been used.

You are not advised to change this value together with repetition_penalty or frequency_penalty.

The value is of the float type. The value range is [-2.0, 2.0]. The default value is 0.0.

frequency_penalty

Optional

The frequency penalty is between -2.0 and 2.0, which affects how the model punishes new words based on the existing frequency of words in the text. Positive values reduce the probability of repeated words in a row of the model by punishing words that have been frequently used.

You are not advised to change this value together with repetition_penalty or presence_penalty.

The value is of the float type. The value range is [-2.0, 2.0]. The default value is 0.0.

repetition_penalty

Optional

Reduces the probability of duplicate fragments during text generation. It penalizes previously generated text, making the model more inclined to choose new, non-repeated content.

You are not advised to change this value together with presence_penalty or frequency_penalty.

The value is of the float type. The value range is (0.0, 2.0]. The default value is 1.0.

temperature

Optional

Controls the randomness of generation. Higher values produce more diversified outputs.

The value is of the float type. The value range is [0.0, 2.0]. The default value is 1.0.

A larger value indicates greater randomness of the result, and the function call may not be triggered. You are advised to use a value greater than or equal to 0.001. If the value is less than 0.001, the text quality may be poor.

top_p

Optional

Controls the vocabulary range considered during model generation and selects candidate words using the cumulative probability until it exceeds a given threshold. This parameter can also control the diversity of generated results.

The value is of the float type. The value range is (1e-6, 1.0]. The default value is 1.0.

top_k

Optional

Controls the vocabulary range considered during model generation. Only k candidate words with the highest probability are selected.

The uint32_t type. The value range is (0, 2147483647].

If the field is not set, the default value is determined by the backend model.

  • atb (ATB Models): The configuration files are generation_config.json and config.json. generation_config.json has a higher priority. If top_k is not specified by you or model weights, top_k is set to 1000 to balance performance and inference effect.
  • ms (MindSpore): The file ends with .yaml is its configuration file. If top_k is not specified by you or model weights, top_k is set to 0.

If the value is greater than or equal to vocabSize, the default value is vocabSize. The value of vocabSize is the same as that of vocab_size or padded_vocab_size in the config.json file in the modelWeightPath directory. If vocab_size or padded_vocab_size does not exist, the default value 0 is used. You are advised to add vocab_size or padded_vocab_size to the config.json file. Otherwise, the inference may fail.

seed

Optional

Specifies the random seed of the inference process. The same seed value ensures the reproducibility of the inference result, and different seed values improve the randomness of the inference result.

The value is of the uint64_t type. The value range is [0, 18446744073709551615]. If this parameter is not passed, the system generates a random seed value.

When the value of seed is close to the maximum value, a warning is generated, which does not affect normal use. To delete the warning, decrease the value of seed.

stop

Optional

Indicates the text for stopping inference. By default, the output result does not contain the stop word list text.

The value is of the List[string] or string type. The default value is null.

  • List[string]: The list can contain a maximum of 1024 elements. The length of each element ranges from 1 to 1024 characters. The total length of the list elements cannot exceed 32768 (256 x 128) characters. If the list is empty, the value is equivalent to null.
  • string: The length ranges from 1 to 1024 characters.

This parameter cannot be used together with the function call feature, and does not support PD disaggregation and chain-of-thought content parsing.

stop_token_ids

Optional

Indicates the ID list of tokens for stopping inference. By default, the output does not contain the token ID in the list for stopping inference.

The value is of the List[int32] type. Elements whose data type is not int32 will be ignored. The default value is null.

This parameter is not supported in the PD disaggregation scenario, and cannot be used together with the function call feature.

include_stop_str_in_output

Optional

Determines whether to include the stop string in the generated inference text.

The value is of the Boolean type. The default value is false.

  • true: The stop string is included.
  • false: The stop string is not included.

If stop or stop_token_ids is not passed, this field will be ignored.

This parameter cannot be used together with the function call feature, and does not support PD disaggregation and chain-of-thought content parsing.

skip_special_tokens

Optional

Indicates whether to skip special tokens in the text generated by inference.

The value is of the Boolean type. The default value is true.

  • true: Special tokens are skipped.
  • false: Special tokens are reserved.

ignore_eos

Optional

Indicates whether to ignore the eos_token terminator during inference text generation.

The value is of the Boolean type. The default value is false.

  • true: Ignore the eos_token terminator.
  • false: Do not ignore the eos_token terminator.

max_tokens

Optional

Indicates the maximum number of tokens that can be generated during inference. The number of generated tokens is also affected by the maxIterTimes parameter in the configuration file. The number of inference tokens is less than or equal to the value of min(maxIterTimes, max_tokens).

The value is of the integer type. The value range is (0, 2147483647]. The default value is the value of maxIterTimes.

use_beam_search

Optional

Indicates whether to enable beam search.

The value is of the Boolean type. The default value is false. This parameter cannot be used together with the stop and stop_token_ids parameters.

  • Non-streaming

    Returns n sequences. If n is null, one answer is returned.

  • Streaming
    • Pseudo streaming: The inference result is returned at a time. The excessive sequences increase data quantity.
    • Returns n sequences. If n is null, one answer is returned.

This parameter cannot be used together with the MTP, function call, parallel decoding, PD disaggregation, and asynchronous scheduling features.

This parameter does not support DeepSeek series models.

best_of

Optional

Returns best_of sequences when beam search is disabled.

This parameter will be removed in later versions.

The value is of the integer type. The value range is [1, 128]. The default value is 1. Also, the value can be null. When best_of is set to a value greater than 1, the temperature value must be greater than 0.

  • When use_beam_search is set to false, null, or not set:
    • In non-streaming inference scenarios, if best_of and n are both set, the value of best_of must be greater than or equal to that of n.
    • In streaming inference scenarios, best_of and n must be set to the same value, and best_of cannot be set separately.
  • When use_beam_search is set to true, the value of best_of is not verified.

This parameter cannot be used together with the MTP, function call, parallel decoding, PD disaggregation, and asynchronous scheduling features.

This parameter does not support DeepSeek series models.

n

Optional

When best_of is set to null or not set, or beam search is enabled, n sequences are returned.

The value is of the integer type. The value range is [1, 128]. The default value is 1. Also, the value can be null. When n is set to a value greater than 1, the temperature value must be greater than 0.

  • When use_beam_search is set to false, null, or not set:
    • In non-streaming inference scenarios, if best_of and n are both set, the value of best_of must be greater than or equal to that of n.
    • In streaming inference scenarios, best_of and n must be set to the same value, and best_of cannot be set separately.
  • When use_beam_search is set to true, the value of best_of is not verified.

This parameter cannot be used together with the MTP, function call, parallel decoding, and PD disaggregation features.

When n is set to a value greater than 1, this parameter cannot be used together with the asynchronous scheduling feature.

This parameter does not support DeepSeek series models.

logprobs

Optional

Indicates whether the inference result contains logprobs information.

The value is of the Boolean type. The default value is false.

This parameter cannot be used together with the function call, SplitFuse, prefix cache, parallel decoding, and PD disaggregation features.

top_logprobs

Optional

Specifies the number of logprobs carried by each token in the inference result.

The value is of the integer type. The value range is [0, 20]. The default value is 0.

If top_logprobs is assigned a valid value and logprobs is not assigned a value, logprobs is set to true.

This parameter cannot be used together with the function call, SplitFuse, prefix cache, parallel decoding, and PD disaggregation features.

chat_template_kwargs

Optional

Indicates the chat template parameter.

Dict type.

-

enable_thinking

Optional

Indicates whether to enable the model chain-of-thought function.

Bool type.

tools

Optional

Indicates the list of tools that may be used.

List[dict] type.

-

type

Mandatory

Indicates the tool type.

Only the character string "function" is supported.

function

Mandatory

Indicates the function description.

Dict type.

-

name

Mandatory

Indicates the function name.

Character string.

strict

Optional

Indicates whether the generated tool calls strictly comply with the schema format.

The value is of the Boolean type. The default value is false.

description

Optional

Describes the functions and usage.

Character string.

parameters

Optional

Indicates the parameters accepted by the function.

JSON schema format.

-

type

Optional

Indicates the property type of function parameters.

Character string.

properties

Optional

Indicates the properties of function parameters. Each key indicates a parameter name, which is user-defined. The value is of the dict type and indicates the parameter description, including the type and description parameters.

Dict type.

required

Optional

Indicates the list of mandatory function parameters.

List[string] type.

additionalProperties

Optional

Indicates whether to allow the use of other parameters that are not mentioned.

The value is of the Boolean type. The default value is false.

  • true: Other parameters that are not mentioned can be used.
  • false: Other parameters that are not mentioned cannot be used.

tool_choice

Optional

Controls the tool call by a model.

The value is of the string or dict type and can be null. The default value is "auto".

  • "none": A model does not call any tool but generates a message.
  • "auto": A model can generate messages or call one or more tools.

Usage Example

Request example:

POST https://{ip}:{port}/v1/chat/completions
Request body:
  • Single-turn dialogue:
    • Single-modal model:
      {
          "model": "deepseek",
          "messages": [{
              "role": "user",
              "content": "You are a helpful assistant."
          }],
          "stream": false,
          "presence_penalty": 1.03,
          "frequency_penalty": 1.0,
          "repetition_penalty": 1.0,
          "temperature": 0.5,
          "top_p": 0.95,
          "top_k": 0,
          "seed": null,
          "stop": ["stop1", "stop2"],
          "stop_token_ids": [2, 13],
          "include_stop_str_in_output": false,
          "skip_special_tokens": true,
          "ignore_eos": false,
          "max_tokens": 20,
          "best_of": 1,
          "n": 1,
          "logprobs": false,
          "top_logprobs":null
      }
    • Multimodal model:

      Change the value of image_url as needed.

      {
          "model": "qwen2.5_vl",
          "messages": [{
              "role": "user",
              "content": [
                 {"type": "text", "text": "My name is Olivier and I"},
                 {"type": "image_url", "image_url": "/xxxx/test.png"}
              ]
          }],
          "stream": false,
          "presence_penalty": 1.03,
          "frequency_penalty": 1.0,
          "repetition_penalty": 1.0,
          "temperature": 0.5,
          "top_p": 0.95,
          "top_k": 0,
          "seed": null,
          "stop": ["stop1", "stop2"],
          "stop_token_ids": [2, 13],
          "include_stop_str_in_output": false,
          "skip_special_tokens": true,
          "ignore_eos": false,
          "max_tokens": 20,
          "best_of": null,
          "n": null,
          "logprobs": false,
          "top_logprobs":null
      }
  • Multi-turn dialogue:
    • First-turn request example:
      {
          "model": "qwen",
          "messages": [
            {"role": "system", "content": "You are a helpful customer support assistant."},
            {"role": "user", "content": "Hi, can you tell me what is the best city in China? Just tell me the answer."}
          ],
          "stream": false,
          "presence_penalty": 1.03,
          "frequency_penalty": 1.0,
          "repetition_penalty": 1.0,
          "temperature": 0.5,
          "top_p": 0.95,
          "top_k": 1,
          "seed": null,
          "stop": ["stop1", "stop2"],
          "stop_token_ids": [2, 13],
          "include_stop_str_in_output": false,
          "skip_special_tokens": true,
          "ignore_eos": false,
          "max_tokens": 200,
          "chat_template_kwargs": {"enable_thinking": false}
      }
    • Second-turn request example:
      {
          "model": "qwen",
          "messages": [
            {"role": "system", "content": "You are a helpful customer support assistant."},
            {"role": "user", "content": "Hi, can you tell me what is the best city in China? Just tell me the answer."}, 
            {"role": "assistant", "content": "The best city in China is subjective and depends on personal preferences, but **Shanghai** is often considered one of the most vibrant and dynamic cities in the country. It is a global financial hub, known for its modern skyline, cultural diversity, and international influence."}, 
            {"role": "user", "content": "What is the best thing in this city?"}
          ],
          "stream": false,
          "presence_penalty": 1.03,
          "frequency_penalty": 1.0,
          "repetition_penalty": 1.0,
          "temperature": 0.5,
          "top_p": 0.95,
          "top_k": 1,
          "seed": null,
          "stop": ["stop1", "stop2"],
          "stop_token_ids": [2, 13],
          "include_stop_str_in_output": false,
          "skip_special_tokens": true,
          "ignore_eos": false,
          "max_tokens": 20,
          "best_of": null,
          "chat_template_kwargs": {"enable_thinking": false}
      }
Response example:
  • Text inference (stream = false):
    • Single-turn dialogue:
      {
          "id": "chatcmpl-123",
          "object": "chat.completion",
          "created": 1677652288,
          "model": "deepseek",
          "choices": [
              {
                  "index": 0,
                  "message": {
                      "role": "assistant",
                      "content": "\n\nHello there, how may I assist you today?"
                  },
                  "finish_reason": "stop"
              }
          ],
          "usage": {
              "prompt_tokens": 9,
              "prompt_tokens_details": {"cached_tokens": 0},
              "completion_tokens": 12,
              "total_tokens": 21,
              "batch_size": [1,1,1,1,1,1,1,1,1,1,1,1],
              "queue_wait_time": [5149,60,39,213,28,30,38,68,430,61,48,39]
          },
          "prefill_time": 200,
          "decode_time_arr": [56, 28, 28, 28, 32, 28, 28, 41, 28, 25, 28]
      }
    • Multi-turn dialogue:
      • First-turn response example:
        {
            "id":"endpoint_common_16",
            "object":"chat.completion",
            "created":1762330491,
            "model":"qwen",
            "choices":[
                {
                    "index":0,
                    "message":{
                        "role":"assistant",
                        "content":"The best city in China is subjective and depends on personal preferences, but **Shanghai** is often considered one of the most vibrant and dynamic cities in the country.",
                        "tool_calls":[]
                    },
                    "logprobs":null,
                    "finish_reason":"stop"
                }
            ],
            "usage": {
                "prompt_tokens":45,
                "prompt_tokens_details": {"cached_tokens": 0},
                "completion_tokens":54,
                "total_tokens":78,
                "batch_size":[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1],
                "queue_wait_time":[5148,1045,923,737,764,723,796,832,809,798,814,790,683,665,808,765,568,646,744,780,773,790,780,735,785,715,722,726,750,760,791,755,791]
            },
            "prefill_time":39,
            "decode_time_arr":[26,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27]
        }
      • Second-turn response example:
        {
            "id":"endpoint_common_17",
            "object":"chat.completion",
            "created":1762330612,
            "model":"qwen",
            "choices":[
                {
                    "index":0,
                    "message":{
                        "role":"assistant",
                        "content":"The best thing in Shanghai is its unique blend of **modern innovation** and **rich history**. offering",
                        "tool_calls":[]
                    },
                    "logprobs":null,
                    "finish_reason":"length"
                }
            ],
            "usage":{
                "prompt_tokens":117,
                "prompt_tokens_details": {"cached_tokens": 0},
                "completion_tokens":20,
                "total_tokens":137,
                "batch_size":[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1],
                "queue_wait_time":[5161,361,441,576,714,672,685,707,693,684,590,514,660,685,711,776,698,710,699,671]
            },
            "prefill_time":47,
            "decode_time_arr":[28,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27]
        }
  • Streaming inference
    • Streaming inference 1 (stream = true, returned in SSE format):
      data: {"id":"endpoint_common_8","object":"chat.completion.chunk","created":1729614610,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":"\t"},"finish_reason":null}]}
      
      data: {"id":"endpoint_common_8","object":"chat.completion.chunk","created":1729614610,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}
      
      data: {"id":"endpoint_common_8","object":"chat.completion.chunk","created":1729614610,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}
      
      data: {"id":"endpoint_common_8","object":"chat.completion.chunk","created":1729614610,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}
      
      data: {"id":"endpoint_common_8","object":"chat.completion.chunk","created":1729614610,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}
      
      data: {"id":"endpoint_common_8","object":"chat.completion.chunk","created":1729614610,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":"\t"},"finish_reason":null}]}
      
      data: {"id":"endpoint_common_8","object":"chat.completion.chunk","created":1729614610,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}
      
      data: {"id":"endpoint_common_8","object":"chat.completion.chunk","created":1729614610,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}
      
      data: {"id":"endpoint_common_8","object":"chat.completion.chunk","created":1729614610,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}
      
      data: {"id":"endpoint_common_8","object":"chat.completion.chunk","created":1729614610,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}
      
      data: {"id":"endpoint_common_8","object":"chat.completion.chunk","created":1729614610,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}
      
      data: {"id":"endpoint_common_8","object":"chat.completion.chunk","created":1729614610,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}
      
      data: {"id":"endpoint_common_8","object":"chat.completion.chunk","created":1729614610,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}
      
      data: {"id":"endpoint_common_8","object":"chat.completion.chunk","created":1729614610,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}
      
      data: {"id":"endpoint_common_8","object":"chat.completion.chunk","created":1729614610,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}
      
      data: {"id":"endpoint_common_8","object":"chat.completion.chunk","created":1729614610,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}
      
      data: {"id":"endpoint_common_8","object":"chat.completion.chunk","created":1729614610,"model":"deepseek","usage":{"prompt_tokens":54,"prompt_tokens_details": {"cached_tokens": 0},"completion_tokens":17,"total_tokens":71, "batch_size":[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1],"queue_wait_time":[5318,117,82,72,196,64,60,55,53,54,56,67,61,64,53,52,49]},"choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":"stop"}]}
      
      data: [DONE]
    • Streaming inference 2 (stream = true, fullTextEnabled = true, returned in SSE format):
      data: {"id":"endpoint_common_11","object":"chat.completion.chunk","created":1730184192,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":"Hello"},"finish_reason":null}]}
      
      data: {"id":"endpoint_common_11","object":"chat.completion.chunk","created":1730184192,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":"Hello!"},"finish_reason":null}]}
      
      data: {"id":"endpoint_common_11","object":"chat.completion.chunk","created":1730184192,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":"Hello! How"},"finish_reason":null}]}
      
      data: {"id":"endpoint_common_11","object":"chat.completion.chunk","created":1730184192,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":"Hello! How can"},"finish_reason":null}]}
      
      data: {"id":"endpoint_common_11","object":"chat.completion.chunk","created":1730184192,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":"Hello! How can I"},"finish_reason":null}]}
      
      data: {"id":"endpoint_common_11","object":"chat.completion.chunk","created":1730184192,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":"Hello! How can I assist"},"finish_reason":null}]}
      
      data: {"id":"endpoint_common_11","object":"chat.completion.chunk","created":1730184192,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":"Hello! How can I assist you"},"finish_reason":null}]}
      
      data: {"id":"endpoint_common_11","object":"chat.completion.chunk","created":1730184192,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":"Hello! How can I assist you today"},"finish_reason":null}]}
      
      data: {"id":"endpoint_common_11","object":"chat.completion.chunk","created":1730184192,"model":"deepseek","choices":[{"index":0,"delta":{"role":"assistant","content":"Hello! How can I assist you today?"},"finish_reason":null}]}
      
      data: {"id":"endpoint_common_11","object":"chat.completion.chunk","created":1730184192,"model":"deepseek","full_text":"Hello! How can I assist you today?","usage":{"prompt_tokens":31,"prompt_tokens_details": {"cached_tokens": 0},"completion_tokens":10,"total_tokens":41, "batch_size":[1,1,1,1,1,1,1,1,1,1],"queue_wait_time":[5318,117,82,72,196,64,60,55,53,54]},"choices":[{"index":0,"delta":{"role":"assistant","content":"Hello! How can I assist you today?"},"finish_reason":"length"}]}
      
      data: [DONE]

Output Description

Table 1 Text inference result description

Parameter

Type

Description

id

String

Request ID.

Object

String

Currently, all the returned result types are chat.completion.

created

Integer

Inference request timestamp, accurate to second.

model

String

Inference model used.

choices

List

Inference result list.

-

index

Integer

Index of the choices message. Currently, the value can only be 0.

message

Object

Inference message.

-

role

String

Role. Currently, "assistant" is returned.

content

String

Inference text result.

reasoning_content

String

Chain-of-thought content.

When this parameter matches a stop word specified in the stop parameter, the request parameter include_stop_str_in_output will not take effect, and the matched stop word will be included in the reasoning_content output.

tool_calls

List

Output of the tool calls performed by a model.

-

function

Dict

Description of the function call.

-

arguments

String

Parameters to call a function, which is a character string in JSON format.

name

String

Name of the called function.

id

String

ID of the tool called by the model.

type

String

Tool type. Currently, only function is supported.

logprobs

Object

Logprobs information.

-

content

List

Logprobs information with a high probability.

-

token

String

Word corresponding to the selected token.

logprob

Float

Logprob of the selected token.

bytes

List

UTF-8 encoding of the word corresponding to the selected token.

top_logprobs

List

Logprobs of the candidate token.

-

token

String

Word corresponding to the candidate token.

logprob

Float

Logprob of the candidate token.

bytes

List

UTF-8 code of the word corresponding to the candidate token.

finish_reason

String

End cause.

  • stop:
    • A request is canceled or stopped, and the response is deprecated, with the user unware of it.
    • An error occurs during request execution. The response output is empty, and err_msg is not empty.
    • An error occurs during request input verification. The response output is empty, and err_msg is not empty.
    • The request ends normally when the EOS terminator is met.
  • length:
    • A request ends because its maximum sequence length is reached, and the response is the output of the last iteration.
    • A request ends because its maximum output length (including the request parameter max_tokens and model parameters maxIterTimes, maxSeqLen, and max_position_embeddings) is reached, and the response is the output of the last iteration.
  • tool_calls: The model calls the tool.

usage

Object

Inference result statistics.

-

prompt_tokens

Integer

Token length corresponding to the prompt text entered by a user.

prompt_tokens_details

Object

Token details corresponding to the prompt text entered by a user.

-

cached_tokens

Integer

Length of the cache token hit during inference based on the prompt text entered by a user.

If the prefix cache feature is enabled, the actual value is displayed. If it is disabled, the default value 0 is displayed.

completion_tokens

Integer

Number of tokens in the inference result. Total number of tokens in the prefill and decode inference results. When the maximum inference length of a request is the value of maxIterTimes, the value of completion_tokens in the response of the decode node is the value of maxIterTimes plus 1, that is, the number of first tokens in the Prefill inference result is added.

completion_tokens_details

Object

Token details in the inference result.

-

reasoning_tokens

Integer

Token length of the chain-of-thought content.

This field is generated only when a model that supports deep thinking is called. For details about the supported models, see "Constraints" in "Feature Description" > "Interaction Features" > "Thinking Analysis" in MindIE LLM Development Guide.

total_tokens

Integer

Total number of tokens for request and inference.

batch_size

List

Batch size when each token is generated during inference. The array length is the number of tokens in the generated sequence.

When multiple sequences are generated at the same time, this parameter indicates the common batch size of all sequences. The array length is the number of tokens of the longest sequence. (Each batch size represents the batch size of all sequences in the current round.)

queue_wait_time

List

Queue waiting time when each token is generated during inference, in μs. The array length is the number of tokens in the generated sequence.

When multiple sequences are generated at the same time, this parameter indicates the common queue waiting latency of all sequences. The array length is the number of tokens of the longest sequence. (Each queue waiting time represents the queue waiting time of all sequences in the current round.)

prefill_time

Float

Time to first token of the inference.

When multiple sequences are generated, this parameter indicates the time to first token of all sequences.

decode_time_arr

List

Inference decode latency array.

When multiple sequences are generated, this parameter indicates the common decode latency of all sequences. The length of the latency array is the number of decode tokens of the longest sequence.

Table 2 Streaming inference result description

Parameter

Type

Description

data

Object

Result returned by a single inference.

-

id

String

Request ID.

Object

String

Currently, "chat.completion.chunk" is returned.

created

Integer

Inference request timestamp, accurate to second.

model

String

Inference model used.

full_text

String

Full text result. This parameter is returned only when fullTextEnabled is set to true.

usage

Object

Inference result statistics.

-

prompt_tokens

Integer

Token length corresponding to the prompt text entered by a user.

prompt_tokens_details

Object

Token details corresponding to the prompt text entered by a user.

-

cached_tokens

Integer

Length of the cache token hit during inference based on the prompt text entered by a user.

If the prefix cache feature is enabled, the actual value is displayed. If it is disabled, the default value 0 is displayed.

completion_tokens

Integer

Number of tokens in the inference result. Total number of tokens in the prefill and decode inference results. When the maximum inference length of a request is the value of maxIterTimes, the value of completion_tokens in the response of the decode node is the value of maxIterTimes plus 1, that is, the number of first tokens in the prefill inference result is added.

completion_tokens_details

Object

Token details in the inference result.

-

reasoning_tokens

Integer

Token length of the chain-of-thought content.

This field is generated only when a model that supports deep thinking is called. For details about the supported models, see "Constraints" in "Feature Description" > "Interaction Features" > "Thinking Analysis" in MindIE LLM Development Guide.

total_tokens

Integer

Total number of tokens for request and inference.

batch_size

List

Batch size when each token is generated during inference. The array length is the number of tokens in the generated sequence.

When multiple sequences are generated at the same time, this parameter indicates the common batch size of all sequences. The array length is the number of tokens of the longest sequence. (Each batch size represents the batch size of all sequences in the current round.)

queue_wait_time

List

Queue waiting time when each token is generated during inference, in μs. The array length is the number of tokens in the generated sequence.

When multiple sequences are generated at the same time, this parameter indicates the common queue waiting latency of all sequences. The array length is the number of tokens of the longest sequence. (Each queue waiting time represents the queue waiting time of all sequences in the current round.)

choices

List

Streaming inference result.

-

index

Integer

Index of the choices message. Currently, the value can only be 0.

delta

Object

Inference result. The last response is empty.

-

role

String

Role. Currently, "assistant" is returned.

content

String

Inference text result.

reasoning_content

String

Chain-of-thought content.

When this parameter matches a stop word specified in the stop parameter, the request parameter include_stop_str_in_output will not take effect, and the matched stop word will be included in the reasoning_content output.

tool_calls

List

Output of the tool calls performed by a model.

-

function

Dict

Description of the function call.

-

arguments

String

Parameters to call a function, which is a character string in JSON format.

name

String

Name of the called function.

id

String

ID of the tool called by the model.

type

String

Tool type. Currently, only function is supported.

logprobs

Object

Logprobs information.

-

content

List

Logprobs information with a high probability.

-

token

String

Word corresponding to the selected token.

logprob

Float

Logprob of the selected token.

bytes

List

UTF-8 encoding of the word corresponding to the selected token.

top_logprobs

List

Logprobs of the candidate token.

-

token

String

Word corresponding to the candidate token.

logprob

Float

Logprob of the candidate token.

bytes

List

UTF-8 code of the word corresponding to the candidate token.

finish_reason

String

End cause, which is returned only in the last inference result.

  • stop:
    • A request is canceled or stopped, and the response is deprecated, with the user unware of it.
    • An error occurs during request execution. The response output is empty, and err_msg is not empty.
    • An error occurs during request input verification. The response output is empty, and err_msg is not empty.
    • The request ends normally when the EOS terminator is met.
  • length:
    • A request ends because its maximum sequence length is reached, and the response is the output of the last iteration.
    • A request ends because its maximum output length (including the request parameter max_tokens and model parameters maxIterTimes, maxSeqLen, and max_position_embeddings) is reached, and the response is the output of the last iteration.