推理接口

接口功能

提供文本/流式推理处理功能。

接口格式

操作类型:POST

URL:https://{ip}:{port}/v1/chat/completions

请求参数

参数

是否必选

说明

取值要求

model

必选

模型名。

由字母、数字、点、中划线和下划线组成,且不以点、中划线和下划线作为开头和结尾,字符串长度小于或等于256。

messages

必选

推理请求消息结构。

list类型,0KB<messages内容包含的字符数<=512KB,支持中英文。tokenizer之后的token数量<=(maxSeqLen-maxIterTimes)和max_position_embeddings之间的较小值(相关参数从配置文件中获取)。

role

必选

推理请求消息角色。

  • system:系统角色
  • user:用户角色
  • assistant:助手角色
  • tool:工具角色

content

必选

推理请求文本。

非空。

max_tokens

可选

允许推理生成的最大token个数。该字段受到配置文件maxIterTimes参数影响,推理token个数<=maxIterTimes。

uint_64类型,取值范围(0, maxIterTimes]。默认值maxIterTimes。

presence_penalty

可选

存在惩罚介于-2.0和2.0之间,它影响模型如何根据到目前为止是否出现在文本中来惩罚新token。正值将通过惩罚已经使用的词,增加模型谈论新主题的可能性。

float类型,取值范围[-2.0, 2.0],默认值0.0。

frequency_penalty

可选

频率惩罚介于-2.0和2.0之间,它影响模型如何根据文本中词汇的现有频率惩罚新词汇。正值将通过惩罚已经频繁使用的词来降低模型一行中重复用词的可能性。

float类型,取值范围[-2.0, 2.0],默认值0.0。

seed

可选

用于指定推理过程的随机种子,相同的seed值可以确保推理结果的可重现性,不同的seed值会提升推理结果的随机性。

uint_64类型,取值范围(0, 18446744073709551615],不传递该参数,系统会产生一个随机seed值。

temperature

可选

控制生成的随机性,较高的值会产生更多样化的输出。

float类型,取值范围(0.0, 2.0],默认值1.0。

取值越大,结果的随机性越大。推荐使用大于或等于0.001的值,小于0.001可能会导致文本质量不佳。

top_p

可选

控制模型生成过程中考虑的词汇范围,使用累计概率选择候选词,直到累计概率超过给定的阈值。该参数也可以控制生成结果的多样性,它基于累积概率选择候选词,直到累计概率超过给定的阈值为止。

float类型,取值范围(0.0, 1.0],默认值1.0。

stream

可选

指定返回结果是文本推理还是流式推理。

bool类型参数,默认值false。

  • 运行环境的transformers版本不可低于4.34.0,低版本tokenizer不支持"chat_template"方法。
  • 推理模型权重路径下的tokenizer_config.json需要包含"chat_template"字段及其实现。

使用样例

请求样例:

POST https://{ip}:{port}/v1/chat/completions
请求消息体:
  • 单轮对话
    {
        "model": "gpt-3.5-turbo",
        "messages": [{
            "role": "user",
            "content": "You are a helpful assistant."
        }],
        "max_tokens": 20,
        "presence_penalty": 1.03,
        "frequency_penalty": 1.0,
        "seed": null,
        "temperature": 0.5,
        "top_p": 0.95,
        "stream": false
    }
  • 多轮对话
    {
        "model": "gpt-3.5-turbo",
        "messages": [{
            "role": "system",
            "content": "You are a student who is good at math."
            },
            {
            "role": "user",
            "content": "what is your hobby?"
            }
        ],
        "max_tokens": 20,
        "presence_penalty": 1.03,
        "frequency_penalty": 1.0,
        "seed": null,
        "temperature": 0.5,
        "top_p": 0.95,
        "stream": false
    }
响应样例:

单轮对话与多轮对话的响应样例一致。

  • 文本推理(“stream”=false)
    {
        "id": "chatcmpl-123",
        "object": "chat.completion",
        "created": 1677652288,
        "model": "gpt-3.5-turbo-0613",
        "choices": [
            {
                "index": 0,
                "message": {
                    "role": "assistant",
                    "content": "\n\nHello there, how may I assist you today?"
                },
                "finish_reason": "eos_token"
            }
        ],
        "usage": {
            "prompt_tokens": 9,
            "completion_tokens": 12,
            "total_tokens": 21
        }
    }
  • 流式推理(“stream”=true,使用sse格式返回)
    data: {"id":"1","object":"chat.completion.chunk","created":1707275960,"model":"baichuan","choices":[{"index":0,"delta":{"role":"assistant","content":"am"},"finish_reason":null}]}
    
    data: {"id":"1","object":"chat.completion.chunk","created":1707275960,"model":"baichuan","choices":[{"index":0,"delta":{"role":"assistant","content":" a"},"finish_reason":null}]}
    
    data: {"id":"1","object":"chat.completion.chunk","created":1707275960,"model":"baichuan","choices":[{"index":0,"delta":{"role":"assistant","content":" French"},"finish_reason":null}]}
    
    data: {"id":"1","object":"chat.completion.chunk","created":1707275960,"model":"baichuan","choices":[{"index":0,"delta":{"role":"assistant","content":"man"},"finish_reason":null}]}
    
    data: {"id":"1","object":"chat.completion.chunk","created":1707275960,"model":"baichuan","choices":[{"index":0,"delta":{"role":"assistant","content":" living"},"finish_reason":null}]}
    
    data: {"id":"1","object":"chat.completion.chunk","created":1707275960,"model":"baichuan","choices":[{"index":0,"delta":{"role":"assistant","content":" in"},"finish_reason":null}]}
    
    data: {"id":"1","object":"chat.completion.chunk","created":1707275960,"model":"baichuan","choices":[{"index":0,"delta":{"role":"assistant","content":" the"},"finish_reason":null}]}
    
    data: {"id":"1","object":"chat.completion.chunk","created":1707275960,"model":"baichuan","choices":[{"index":0,"delta":{"role":"assistant","content":" UK"},"finish_reason":null}]}
    
    data: {"id":"1","object":"chat.completion.chunk","created":1707275960,"model":"baichuan","choices":[{"index":0,"delta":{"role":"assistant","content":"."},"finish_reason":null}]}
    
    data: {"id":"1","object":"chat.completion.chunk","created":1707275960,"model":"baichuan","choices":[{"index":0,"delta":{"role":"assistant","content":" I"},"finish_reason":null}]}
    
    data: {"id":"1","object":"chat.completion.chunk","created":1707275960,"model":"baichuan","choices":[{"index":0,"delta":{"role":"assistant","content":" am"},"finish_reason":null}]}
    
    data: {"id":"1","object":"chat.completion.chunk","created":1707275960,"model":"baichuan","choices":[{"index":0,"delta":{"role":"assistant","content":" a"},"finish_reason":null}]}
    
    data: {"id":"1","object":"chat.completion.chunk","created":1707275960,"model":"baichuan","choices":[{"index":0,"delta":{"role":"assistant","content":" keen"},"finish_reason":null}]}
    
    data: {"id":"1","object":"chat.completion.chunk","created":1707275960,"model":"baichuan","choices":[{"index":0,"delta":{"role":"assistant","content":" photograph"},"finish_reason":null}]}
    
    data: {"id":"1","object":"chat.completion.chunk","created":1707275960,"model":"baichuan","choices":[{"index":0,"delta":{"role":"assistant","content":"er"},"finish_reason":null}]}
    
    data: {"id":"1","object":"chat.completion.chunk","created":1707275960,"model":"baichuan","choices":[{"index":0,"delta":{"role":"assistant","content":" and"},"finish_reason":null}]}
    
    data: {"id":"1","object":"chat.completion.chunk","created":1707275960,"model":"baichuan","choices":[{"index":0,"delta":{"role":"assistant","content":" I"},"finish_reason":null}]}
    
    data: {"id":"1","object":"chat.completion.chunk","created":1707275960,"model":"baichuan","choices":[{"index":0,"delta":{"role":"assistant","content":" have"},"finish_reason":null}]}
    
    data: {"id":"1","object":"chat.completion.chunk","created":1707275960,"model":"baichuan","choices":[{"index":0,"delta":{"role":"assistant","content":" been"},"finish_reason":null}]}
    
    data: {"id":"1","object":"chat.completion.chunk","created":1707275960,"model":"baichuan","choices":[{"index":0,"delta":{},"finish_reason":"length"}]}
    
    data: [DONE]
    

输出说明

表1 文本推理结果说明

参数名

类型

说明

id

string

请求id。

object

string

返回结果类型目前都返回"chat.completion"。

created

integer

推理请求时间戳,精确到秒。

model

string

使用的推理模型。

choices

list

推理结果列表。

index

integer

choices消息index,当前只能为0。

message

object

推理消息。

role

string

角色,目前都返回"assistant"。

content

string

推理文本结果。

finish_reason

string

结束原因。

  • eos_token:请求正常结束。
  • stop_sequence:
    • 请求被主动CANCEL或STOP,用户不感知,丢弃响应。
    • 请求执行中出错,响应输出为空,err_msg非空。
    • 请求输入校验异常,响应输出为空,err_msg非空。
  • length:
    • 请求因达到最大序列长度而结束,响应为最后一轮迭代输出。
    • 请求因达到最大输出长度(包括请求和模型粒度)而结束,响应为最后一轮迭代输出。
  • invalid flag:无效标记。

usage

object

推理结果统计数据。

prompt_tokens

int

用户输入的prompt文本对应的token长度。

completion_tokens

int

推理token数量。

total_tokens

int

请求+推理总token数。

表2 流式推理结果说明

参数名

类型

说明

data

object

一次推理返回的结果。

id

string

请求id。

object

string

目前都返回"chat.completion.chunk"。

created

integer

推理请求时间戳,精确到秒。

model

string

使用的推理模型。

choices

list

流式推理结果。

finish_reason

string

结束原因,只在最后一次推理结果返回。

  • eos_token:请求正常结束。
  • stop_sequence:
    • 请求被主动CANCEL或STOP,用户不感知,丢弃响应。
    • 请求执行中出错,响应输出为空,err_msg非空。
    • 请求输入校验异常,响应输出为空,err_msg非空。
  • length:
    • 请求因达到最大序列长度而结束,响应为最后一轮迭代输出。
    • 请求因达到最大输出长度(包括请求和模型粒度)而结束,响应为最后一轮迭代输出。
  • invalid flag:无效标记。

index

integer

choices消息index,当前只能为0。

delta

object

推理返回结果,最后一个响应为空。

role

string

角色,目前都返回"assistant"。

content

string

推理文本结果。