Token Inference API
Function
Processes token inference.
This API is scheduled for deprecation. The OpenAI API is recommended.
Format
Operation type: POST
URL: https://{ip}:{port}/v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]/infer
- Replace {ip} and {port} with the IP address and port number of the service plane, that is, ipAddress and port.
- The ${MODEL_NAME} field specifies the name of the model to be queried.
- The [/versions/${MODEL_VERSION}] field is not supported currently and is not passed.
Request Parameters
Parameter |
Mandatory/Optional |
Description |
Value Range |
|
|---|---|---|---|---|
id |
Optional |
Inference request ID. |
The value is a string of a maximum of 256 characters. Only underscores (_), hyphens (-), uppercase letters, lowercase letters, and digits are allowed. |
|
inputs |
Mandatory |
An array with only one element. |
The length is 1. |
|
- |
name |
Mandatory |
Input name, which is fixed to input0. |
The value contains a maximum of 256 characters. |
shape |
Mandatory |
Parameter dimension. In one-dimensional mode, this parameter indicates the data length. In two-dimensional mode, it indicates one row and n columns. n indicates the data length. |
The data length range is (0, min(1024 × 1024, maxInputTokenLen, maxSeqLen-1, max_position_embeddings)]. Obtain the max_position_embeddings from the weight file config.json, and other related parameters from the configuration file. |
|
datatype |
Mandatory |
Data type. Currently, only UINT32 is supported, and tokenid is passed. |
UINT32 |
|
data |
Mandatory |
Array, indicating the input token IDs. |
The data length is the same as that passed in shape. The value of tokenId must be within the range of the model vocabulary. |
|
outputs |
Mandatory |
Output structure of the inference result. |
The length of outputs must be the same as that of inputs. |
|
- |
name |
Mandatory |
Output name of the inference result. |
Character string. |
parameters |
Optional |
Parameters related to model inference postprocessing. |
- |
|
- |
temperature |
Optional |
Controls the randomness of generation. Higher values produce more diversified outputs. |
The value is of the float type. The value is greater than 1e-6. The default value is 1.0. A larger value indicates greater randomness of the result. You are advised to use a value greater than or equal to 0.001. If the value is less than 0.001, the text quality may be poor. It is recommended that the maximum value be set to 2.0. The value depends on the model. |
top_k |
Optional |
Controls the vocabulary range considered during model generation. Only k candidate words with the highest probability are selected. |
The value is of the int32_t type. The value range is (0, 2147483647]. If the field is not set, the default value is determined by the backend model.
If the value is greater than or equal to vocabSize, the default value is vocabSize. The value of vocabSize is the same as that of vocab_size or padded_vocab_size in the config.json file in the modelWeightPath directory. If vocab_size or padded_vocab_size does not exist, the default value 0 is used. You are advised to add vocab_size or padded_vocab_size to the config.json file. Otherwise, the inference may fail. |
|
top_p |
Optional |
Controls the vocabulary range considered during model generation and selects candidate words using the cumulative probability until it exceeds a given threshold. This parameter can also control the diversity of generated results. |
The value is of the float type. The value range is (1e-6, 1.0]. The default value is 1.0. |
|
do_sample |
Optional |
Indicates whether to perform sampling. |
The value is of the Boolean type. If this parameter is not passed, other postprocessing parameters determine whether sampling should be performed.
|
|
seed |
Optional |
Specifies the random seed of the inference process. The same seed value ensures the reproducibility of the inference result, and different seed values improve the randomness of the inference result. |
The value is of the uint64_t type. The value range is (0, 18446744073709551615]. If this parameter is not passed, the system generates a random seed value. When the value of seed is close to the maximum value, a warning is generated, which does not affect normal use. To delete the warning, decrease the value of seed. |
|
repetition_penalty |
Optional |
Uses repetition penalty to reduce the probability of duplicate fragments during text generation. It penalizes previously generated text, making the model more inclined to choose new, non-repeated content. |
The value is of the float type. The default value is 1.0. The value must be greater than 0.0.
It is recommended that the maximum value be set to 2.0. The value depends on the model. |
|
max_new_tokens |
Optional |
Specifies the maximum number of tokens that can be generated during inference. The number of generated tokens is also affected by the maxIterTimes parameter in the configuration file. The number of inference tokens is less than or equal to the value of Min(maxIterTimes, max_new_tokens). |
The value is of the int type. The value range is (0, 2147483647]. The default value is 20. |
|
watermark |
Optional |
Indicates whether to add a model watermark. Currently, postprocessing is not supported. |
The value is of the Boolean type. The default value is false.
|
|
details |
Optional |
Indicates whether to return the detailed inference output result. This configuration item is used for Triton text inference and is invalid for Triton token inference. Therefore, you are not advised to pass this configuration item. |
The value is of the Boolean type. The default value is false. |
|
batch_size |
Optional |
Batch size of an inference request. This configuration item is used for Triton text inference and is invalid for Triton token inference. Therefore, you are not advised to pass this configuration item. |
The value is of the int type. The value range is (0, 2147483647]. The default value is 1. |
|
priority |
Optional |
Sets the request priority. |
The value is of the uint64_t type. The value range is [1, 5]. The default value is 5. A smaller value indicates a higher priority. The highest priority is 1. |
|
timeout |
Optional |
Sets the waiting time. If times out, a request is disconnected. |
The value is of the uint64_t type. The value range is (0, 3600] (unit: second). The default value is 600. |
|
Usage Example
Request example:
POST https://{ip}:{port}/v2/models/llama3-70b/infer
Request body:
{
"id": "42",
"inputs": [{
"name": "input0",
"shape": [
1,
10
],
"datatype": "UINT32",
"data": [
396, 319, 13996, 29877, 29901, 29907, 3333, 20718, 316, 23924
]
}],
"outputs": [{
"name": "output0"
}],
"parameters": {
"temperature": 0.5,
"top_k": 10,
"top_p": 0.95,
"do_sample": true,
"seed": null,
"repetition_penalty": 1.03,
"max_new_tokens": 20,
"watermark": true,
"priority": 5,
"timeout": 10
}
}
Response example:
{
"id": "42",
"outputs": [
{
"name": "output0",
"shape": [
1,
20
],
"datatype": "UINT32",
"data": [
1,
396,
319,
13996,
29877,
29901,
29907,
3333,
20718,
316,
23924,
562,
2142,
1702,
425,
14015,
16060,
316,
383,
19498
]
}
]
}
Output Description
Return Value |
Type |
Description |
|
|---|---|---|---|
id |
String |
Request ID. |
|
outputs |
List |
Inference result list. |
|
- |
name |
String |
The default value is output0. |
shape |
List |
The structure is [1, n]. The value 1 indicates a 1-dimensional array, and n indicates the length of the token result in the data field. |
|
datatype |
String |
UINT32 |
|
data |
List |
Token ID set generated after inference. |
|