Token Inference API

Function

Processes token inference.

This API is scheduled for deprecation. The OpenAI API is recommended.

Format

Operation type: POST

URL: https://{ip}:{port}/v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]/infer

Replace {ip} and {port} with the IP address and port number of the service plane, that is, ipAddress and port.
The ${MODEL_NAME} field specifies the name of the model to be queried.
The [/versions/${MODEL_VERSION}] field is not supported currently and is not passed.

Request Parameters

Parameter		Mandatory/Optional	Description	Value Range
id		Optional	Inference request ID.	The value is a string of a maximum of 256 characters. Only underscores (_), hyphens (-), uppercase letters, lowercase letters, and digits are allowed.
inputs		Mandatory	An array with only one element.	The length is 1.
-	name	Mandatory	Input name, which is fixed to input0.	The value contains a maximum of 256 characters.
	shape	Mandatory	Parameter dimension. In one-dimensional mode, this parameter indicates the data length. In two-dimensional mode, it indicates one row and n columns. n indicates the data length.	The data length range is (0, min(1024 × 1024, maxInputTokenLen, maxSeqLen-1, max_position_embeddings)]. Obtain the max_position_embeddings from the weight file config.json, and other related parameters from the configuration file.
	datatype	Mandatory	Data type. Currently, only UINT32 is supported, and tokenid is passed.	UINT32
	data	Mandatory	Array, indicating the input token IDs.	The data length is the same as that passed in shape. The value of tokenId must be within the range of the model vocabulary.
outputs		Mandatory	Output structure of the inference result.	The length of outputs must be the same as that of inputs.
-	name	Mandatory	Output name of the inference result.	Character string.
parameters		Optional	Parameters related to model inference postprocessing.	-
-	temperature	Optional	Controls the randomness of generation. Higher values produce more diversified outputs.	The value is of the float type. The value is greater than 1e-6. The default value is 1.0. A larger value indicates greater randomness of the result. You are advised to use a value greater than or equal to 0.001. If the value is less than 0.001, the text quality may be poor. It is recommended that the maximum value be set to 2.0. The value depends on the model.
	top_k	Optional	Controls the vocabulary range considered during model generation. Only k candidate words with the highest probability are selected.	The value is of the int32_t type. The value range is (0, 2147483647]. If the field is not set, the default value is determined by the backend model. atb (ATB Models): The configuration files are generation_config.json and config.json. generation_config.json has a higher priority. If top_k is not specified by you or model weights, top_k is set to 1000 to balance performance and inference effect. ms (MindSpore): The file ends with .yaml is its configuration file. If top_k is not specified by you or model weights, top_k is set to 0. If the value is greater than or equal to vocabSize, the default value is vocabSize. The value of vocabSize is the same as that of vocab_size or padded_vocab_size in the config.json file in the modelWeightPath directory. If vocab_size or padded_vocab_size does not exist, the default value 0 is used. You are advised to add vocab_size or padded_vocab_size to the config.json file. Otherwise, the inference may fail.
	top_p	Optional	Controls the vocabulary range considered during model generation and selects candidate words using the cumulative probability until it exceeds a given threshold. This parameter can also control the diversity of generated results.	The value is of the float type. The value range is (1e-6, 1.0]. The default value is 1.0.
	do_sample	Optional	Indicates whether to perform sampling.	The value is of the Boolean type. If this parameter is not passed, other postprocessing parameters determine whether sampling should be performed. true: Sampling is performed. false: Sampling is not performed.
	seed	Optional	Specifies the random seed of the inference process. The same seed value ensures the reproducibility of the inference result, and different seed values improve the randomness of the inference result.	The value is of the uint64_t type. The value range is (0, 18446744073709551615]. If this parameter is not passed, the system generates a random seed value. When the value of seed is close to the maximum value, a warning is generated, which does not affect normal use. To delete the warning, decrease the value of seed.
	repetition_penalty	Optional	Uses repetition penalty to reduce the probability of duplicate fragments during text generation. It penalizes previously generated text, making the model more inclined to choose new, non-repeated content.	The value is of the float type. The default value is 1.0. The value must be greater than 0.0. A value smaller than 1.0 indicates that repetition is rewarded. The value 1.0 indicates that repetition penalty is not performed. A value greater than 1.0 indicates that repetition penalty is performed. It is recommended that the maximum value be set to 2.0. The value depends on the model.
	max_new_tokens	Optional	Specifies the maximum number of tokens that can be generated during inference. The number of generated tokens is also affected by the maxIterTimes parameter in the configuration file. The number of inference tokens is less than or equal to the value of Min(maxIterTimes, max_new_tokens).	The value is of the int type. The value range is (0, 2147483647]. The default value is 20.
	watermark	Optional	Indicates whether to add a model watermark. Currently, postprocessing is not supported.	The value is of the Boolean type. The default value is false. true: The model watermark is added. false: The model watermark is not added.
	details	Optional	Indicates whether to return the detailed inference output result. This configuration item is used for Triton text inference and is invalid for Triton token inference. Therefore, you are not advised to pass this configuration item.	The value is of the Boolean type. The default value is false.
	batch_size	Optional	Batch size of an inference request. This configuration item is used for Triton text inference and is invalid for Triton token inference. Therefore, you are not advised to pass this configuration item.	The value is of the int type. The value range is (0, 2147483647]. The default value is 1.
	priority	Optional	Sets the request priority.	The value is of the uint64_t type. The value range is [1, 5]. The default value is 5. A smaller value indicates a higher priority. The highest priority is 1.
	timeout	Optional	Sets the waiting time. If times out, a request is disconnected.	The value is of the uint64_t type. The value range is (0, 3600] (unit: second). The default value is 600.

Usage Example

Request example:

POST https://{ip}:{port}/v2/models/llama3-70b/infer

Request body:

{
    "id": "42",
    "inputs": [{
        "name": "input0",
        "shape": [
            1,
            10
        ],
        "datatype": "UINT32",
        "data": [
            396, 319, 13996, 29877, 29901, 29907, 3333, 20718, 316, 23924
        ]
    }],
    "outputs": [{
        "name": "output0"
    }],
    "parameters": {
        "temperature": 0.5,
        "top_k": 10,
        "top_p": 0.95,
        "do_sample": true,
        "seed": null,
        "repetition_penalty": 1.03,
        "max_new_tokens": 20,
        "watermark": true,
        "priority": 5,
        "timeout": 10
    }
}

Response example:

{
    "id": "42",
    "outputs": [
        {
            "name": "output0",
            "shape": [
                1,
                20
            ],
            "datatype": "UINT32",
            "data": [
                1,
                396,
                319,
                13996,
                29877,
                29901,
                29907,
                3333,
                20718,
                316,
                23924,
                562,
                2142,
                1702,
                425,
                14015,
                16060,
                316,
                383,
                19498
            ]
        }
    ]
}

Output Description

Return Value		Type	Description
id		String	Request ID.
outputs		List	Inference result list.
-	name	String	The default value is output0.
	shape	List	The structure is [1, n]. The value 1 indicates a 1-dimensional array, and n indicates the length of the token result in the data field.
	datatype	String	UINT32
	data	List	Token ID set generated after inference.

Parent topic: APIs Compatible with Triton