Token推理接口
接口功能
实现token输入的文本/流式推理。
接口格式
操作类型:POST
URL:https://{ip}:{port}/infer_token
{ip}和{port}请使用业务面的IP地址和端口号,即“ipAddress”和“port”。
请求参数
使用样例
请求样例:
POST https://{ip}:{port}/infer_token
请求消息体:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | { "input_id": [5618, 19678, 701, 9072, 13], "stream": false, "parameters": { "temperature": 0.5, "top_k": 10, "top_p": 0.95, "max_new_tokens": 20, "do_sample": true, "seed": null, "repetition_penalty": 1.03, "details": true, "typical_p": 0.5, "watermark": false, "priority": 5, "timeout": 10 } } |
响应样例:
- 文本推理(“stream”=false):
1 2 3 4 5 6 7 8
{ "generated_text": "am a French native speaker. I am looking for a job in the hospitality industry. I", "details": { "finish_reason": "length", "generated_tokens": 20, "seed": 846930886 } }
- 流式推理(“stream”=true,使用sse格式返回):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
data: {"prefill_time":45.54,"decode_time":null,"token":{"id":[626],"text":"am"}} data: {"prefill_time":null,"decode_time":128.32,"token":{"id":[263],"text":" a"}} data: {"prefill_time":null,"decode_time":18.17,"token":{"id":[5176],"text":" French"}} data: {"prefill_time":null,"decode_time":16.80,"token":{"id":[17739],"text":" photograph"}} data: {"prefill_time":null,"decode_time":16.80,"token":{"id":[261],"text":"er"}} data: {"prefill_time":null,"decode_time":16.80,"token":{"id":[2729],"text":" based"}} data: {"prefill_time":null,"decode_time":16.80,"token":{"id":[297],"text":" in"}} data: {"prefill_time":null,"decode_time":16.80,"token":{"id":[3681],"text":" Paris"}} data: {"prefill_time":null,"decode_time":16.80,"token":{"id":[29889],"text":"."}} data: {"prefill_time":null,"decode_time":16.80,"token":{"id":[13],"text":"\n"}} data: {"prefill_time":null,"decode_time":16.80,"token":{"id":[29902],"text":"I"}} data: {"prefill_time":null,"decode_time":16.80,"token":{"id":[505],"text":" have"}} data: {"prefill_time":null,"decode_time":16.80,"token":{"id":[1063],"text":" been"}} data: {"prefill_time":null,"decode_time":16.80,"token":{"id":[27904],"text":" shooting"}} data: {"prefill_time":null,"decode_time":16.80,"token":{"id":[1951],"text":" since"}} data: {"prefill_time":null,"decode_time":16.80,"token":{"id":[306],"text":" I"}} data: {"prefill_time":null,"decode_time":16.80,"token":{"id":[471],"text":" was"}} data: {"prefill_time":null,"decode_time":16.80,"token":{"id":[29871],"text":" "}} data: {"prefill_time":null,"decode_time":16.80,"token":{"id":[29896],"text":"1"}} data: {"prefill_time":null,"decode_time":16.80,"generated_text":"am a French photographer based in Paris.\nI have been shooting since I was 15","details":{"finish_reason":"length","generated_tokens":20,"seed":846930886},"token":{"id":[29945],"text":null}}
输出说明
返回值 |
类型 |
说明 |
|
|---|---|---|---|
generated_text |
string |
推理返回结果。 |
|
details |
object |
推理details结果。目前定义以下字段,支持扩展。 |
|
- |
finish_reason |
string |
推理结束原因。
|
generated_tokens |
int |
推理结果token数量。PD场景下统计P和D推理结果的总token数量。当一个请求的推理长度上限取maxIterTimes的值时,D节点响应中generated_tokens数量为maxIterTimes+1,即增加了P推理结果的首token数量。 |
|
seed |
int |
如果请求指定了sampling seed,返回该seed值。 |
|
返回值 |
类型 |
说明 |
||
|---|---|---|---|---|
data |
object |
一次推理返回的结果。 |
||
- |
prefill_time |
float |
流式推理下首token时延,单位:ms。 |
|
decode_time |
float |
流式推理下非首token的token时延,单位:ms。 |
||
generated_text |
string |
推理文本结果,只在最后一次推理结果才返回。 |
||
details |
object |
推理details结果,只在最后一次推理结果返回,支持扩展。 |
||
- |
finish_reason |
string |
推理结束原因,只在最后一次推理结果返回。
|
|
generated_tokens |
int |
推理结果token数量。PD场景下统计P和D推理结果的总token数量。当一个请求的推理长度上限取maxIterTimes的值时,D节点响应中generated_tokens数量为maxIterTimes+1,即增加了P推理结果的首token数量。 |
||
seed |
int |
如果请求指定了sampling seed,返回该seed值。 |
||
token |
List[token] |
每一次推理的token。 |
||
- |
id |
list |
生成的token id组成的列表。 |
|
text |
string |
token对应文本。 |
||