Token Inference API

Function

Processes token inference.

This API is scheduled for deprecation. The OpenAI API is recommended.

Format

Operation type: POST

URL: https://{ip}:{port}/v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]/infer

  • Replace {ip} and {port} with the IP address and port number of the service plane, that is, ipAddress and port.
  • The ${MODEL_NAME} field specifies the name of the model to be queried.
  • The [/versions/${MODEL_VERSION}] field is not supported currently and is not passed.

Request Parameters

Parameter

Mandatory/Optional

Description

Value Range

id

Optional

Inference request ID.

The value is a string of a maximum of 256 characters. Only underscores (_), hyphens (-), uppercase letters, lowercase letters, and digits are allowed.

inputs

Mandatory

An array with only one element.

The length is 1.

-

name

Mandatory

Input name, which is fixed to input0.

The value contains a maximum of 256 characters.

shape

Mandatory

Parameter dimension. In one-dimensional mode, this parameter indicates the data length. In two-dimensional mode, it indicates one row and n columns. n indicates the data length.

The data length range is (0, min(1024 × 1024, maxInputTokenLen, maxSeqLen-1, max_position_embeddings)]. Obtain the max_position_embeddings from the weight file config.json, and other related parameters from the configuration file.

datatype

Mandatory

Data type. Currently, only UINT32 is supported, and tokenid is passed.

UINT32

data

Mandatory

Array, indicating the input token IDs.

The data length is the same as that passed in shape.

The value of tokenId must be within the range of the model vocabulary.

outputs

Mandatory

Output structure of the inference result.

The length of outputs must be the same as that of inputs.

-

name

Mandatory

Output name of the inference result.

Character string.

parameters

Optional

Parameters related to model inference postprocessing.

-

-

temperature

Optional

Controls the randomness of generation. Higher values produce more diversified outputs.

The value is of the float type. The value is greater than 1e-6. The default value is 1.0.

A larger value indicates greater randomness of the result. You are advised to use a value greater than or equal to 0.001. If the value is less than 0.001, the text quality may be poor.

It is recommended that the maximum value be set to 2.0. The value depends on the model.

top_k

Optional

Controls the vocabulary range considered during model generation. Only k candidate words with the highest probability are selected.

The value is of the int32_t type. The value range is (0, 2147483647].

If the field is not set, the default value is determined by the backend model.

  • atb (ATB Models): The configuration files are generation_config.json and config.json. generation_config.json has a higher priority. If top_k is not specified by you or model weights, top_k is set to 1000 to balance performance and inference effect.
  • ms (MindSpore): The file ends with .yaml is its configuration file. If top_k is not specified by you or model weights, top_k is set to 0.

If the value is greater than or equal to vocabSize, the default value is vocabSize. The value of vocabSize is the same as that of vocab_size or padded_vocab_size in the config.json file in the modelWeightPath directory. If vocab_size or padded_vocab_size does not exist, the default value 0 is used. You are advised to add vocab_size or padded_vocab_size to the config.json file. Otherwise, the inference may fail.

top_p

Optional

Controls the vocabulary range considered during model generation and selects candidate words using the cumulative probability until it exceeds a given threshold. This parameter can also control the diversity of generated results.

The value is of the float type. The value range is (1e-6, 1.0]. The default value is 1.0.

do_sample

Optional

Indicates whether to perform sampling.

The value is of the Boolean type. If this parameter is not passed, other postprocessing parameters determine whether sampling should be performed.

  • true: Sampling is performed.
  • false: Sampling is not performed.

seed

Optional

Specifies the random seed of the inference process. The same seed value ensures the reproducibility of the inference result, and different seed values improve the randomness of the inference result.

The value is of the uint64_t type. The value range is (0, 18446744073709551615]. If this parameter is not passed, the system generates a random seed value.

When the value of seed is close to the maximum value, a warning is generated, which does not affect normal use. To delete the warning, decrease the value of seed.

repetition_penalty

Optional

Uses repetition penalty to reduce the probability of duplicate fragments during text generation. It penalizes previously generated text, making the model more inclined to choose new, non-repeated content.

The value is of the float type. The default value is 1.0. The value must be greater than 0.0.

  • A value smaller than 1.0 indicates that repetition is rewarded.
  • The value 1.0 indicates that repetition penalty is not performed.
  • A value greater than 1.0 indicates that repetition penalty is performed.

It is recommended that the maximum value be set to 2.0. The value depends on the model.

max_new_tokens

Optional

Specifies the maximum number of tokens that can be generated during inference. The number of generated tokens is also affected by the maxIterTimes parameter in the configuration file. The number of inference tokens is less than or equal to the value of Min(maxIterTimes, max_new_tokens).

The value is of the int type. The value range is (0, 2147483647]. The default value is 20.

watermark

Optional

Indicates whether to add a model watermark.

Currently, postprocessing is not supported.

The value is of the Boolean type. The default value is false.

  • true: The model watermark is added.
  • false: The model watermark is not added.

details

Optional

Indicates whether to return the detailed inference output result.

This configuration item is used for Triton text inference and is invalid for Triton token inference. Therefore, you are not advised to pass this configuration item.

The value is of the Boolean type. The default value is false.

batch_size

Optional

Batch size of an inference request.

This configuration item is used for Triton text inference and is invalid for Triton token inference. Therefore, you are not advised to pass this configuration item.

The value is of the int type. The value range is (0, 2147483647]. The default value is 1.

priority

Optional

Sets the request priority.

The value is of the uint64_t type. The value range is [1, 5]. The default value is 5.

A smaller value indicates a higher priority. The highest priority is 1.

timeout

Optional

Sets the waiting time. If times out, a request is disconnected.

The value is of the uint64_t type. The value range is (0, 3600] (unit: second). The default value is 600.

Usage Example

Request example:

POST https://{ip}:{port}/v2/models/llama3-70b/infer

Request body:

{
    "id": "42",
    "inputs": [{
        "name": "input0",
        "shape": [
            1,
            10
        ],
        "datatype": "UINT32",
        "data": [
            396, 319, 13996, 29877, 29901, 29907, 3333, 20718, 316, 23924
        ]
    }],
    "outputs": [{
        "name": "output0"
    }],
    "parameters": {
        "temperature": 0.5,
        "top_k": 10,
        "top_p": 0.95,
        "do_sample": true,
        "seed": null,
        "repetition_penalty": 1.03,
        "max_new_tokens": 20,
        "watermark": true,
        "priority": 5,
        "timeout": 10
    }
}

Response example:

{
    "id": "42",
    "outputs": [
        {
            "name": "output0",
            "shape": [
                1,
                20
            ],
            "datatype": "UINT32",
            "data": [
                1,
                396,
                319,
                13996,
                29877,
                29901,
                29907,
                3333,
                20718,
                316,
                23924,
                562,
                2142,
                1702,
                425,
                14015,
                16060,
                316,
                383,
                19498
            ]
        }
    ]
}

Output Description

Return Value

Type

Description

id

String

Request ID.

outputs

List

Inference result list.

-

name

String

The default value is output0.

shape

List

The structure is [1, n]. The value 1 indicates a 1-dimensional array, and n indicates the length of the token result in the data field.

datatype

String

UINT32

data

List

Token ID set generated after inference.