Introduction to Servitization Interfaces
Scenario Description
Server provides EndPoint to encapsulate inference serving protocols and APIs. It is compatible with third-party framework APIs such as Triton, OpenAI, TGI, and vLLM. After Server is installed in single-server mode, you can use a client (Linux curl command, Postman tool, and etc.) to send HTTP/HTTPS requests to call APIs provided by EndPoint.
HTTPS is recommended, as it is more secure than HTTP.
Description of EndPoint RESTful APIs
The IP address and port number of an HTTP/HTTPS request URL are configured in the config.json file. For details, see Parameters in ServerConfig.
- URL format of a generate request sent by Linux curl:
- Operation type: POST
- URL: http[s]://{ip}:{port}/generate
- Inference request sent with HTTPS disabled:
curl -H "Accept: application/json" -H "Content-type: application/json" -X POST -d '{ "inputs": "My name is Olivier and I", "parameters": { "details": true, "do_sample": true, "repetition_penalty": 1.1, "return_full_text": false, "seed": null, "temperature": 1, "top_p": 0.99 } }' http://{ip}:{port}/generate - Request sent with HTTPS bidirectional authentication enabled:
curl --location --request POST 'https://{ip}:{port}/generate' \ --header 'Content-Type: application/json' \ --cacert /home/runs/static_conf/ca/ca.pem \ --cert /home/runs/static_conf/cert/client.pem \ --key /home/runs/static_conf/cert/client.key.pem \ --data-raw '{ "inputs": "My name is Olivier and I", "parameters": { "best_of": 1, "decoder_input_details": false, "details": false, "do_sample": true, "max_new_tokens": 20, "repetition_penalty": 2, "return_full_text": false, "seed": 12, "temperature": 0.1, "top_k": 1, "top_p": 0.9, "truncate": 1024 } }'
- --cacert: path of the signature verification certificate file.
- ca.pem: signature verification certificate or root certificate of Server.
- --cert: path of the client certificate file.
- client.pem: client certificate.
- --key: path of the client private key file.
- client.key.pem: private key of the client certificate. (The private key is not encrypted. You are advised to use an encrypted key.)
Change the parameters as needed.
API |
Type |
URL |
Description |
Framework |
|---|---|---|---|---|
Server Live |
GET |
/v2/health/live |
Checks whether the server is online. |
Triton |
Server Ready |
GET |
/v2/health/ready |
Checks whether the server is ready. |
Triton |
Model Ready |
GET |
/v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]/ready |
Checks whether the model is ready. |
Triton |
health |
GET |
/health |
Performs service health checks. |
|
TGI EndPoint information query |
GET |
/info |
Queries the TGI EndPoint information. |
TGI |
Slot statistics |
GET |
/v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]/getSlotCount |
Queries customized slot statistics based on the Triton format. |
Native |
Health probe |
GET |
/health/timed[-${TIMEOUT}] |
Checks whether an inference process is normal. |
Native |
Graceful exit |
GET |
/stopService |
Implements graceful exit of the entire service. When this API is called, the system stops a service until all requests that are being executed and waiting are complete. During the waiting, all inference APIs are unavailable. |
Native |
Collecting static configurations |
GET |
/v1/config |
Collects static configurations. |
Native |
Collecting dynamic status |
GET |
/v1/status |
Collects dynamic status. |
Native |
Specifying an instance role |
POST |
/v1/role/${role} |
Specifies an instance role. |
Native |
Collecting dynamic status |
GET |
/v2/status |
Collects dynamic status. |
Native |
Specifying an instance role |
POST |
/v2/role/${role} |
Specifies an instance role. |
Native |
Service metric API (JSON format) |
GET |
/metrics-json |
Obtains the dynamic average values of Time To First Token (TTFT) and Time Between Tokens (TBT) of nearly 1,000 requests by default, the number of requests being executed, the number of requests waiting, and the number of remaining NPU blocks during an inference service. |
Native |
Querying service management and control metrics (Prometheus format) |
GET |
/metrics |
Queries management and control metrics of inference servitization. |
Native |
Dynamically loading LoRA |
POST |
/v1/load_lora_adapter |
Dynamically loads LoRA. |
OpenAI |
Dynamically unloading LoRA |
POST |
/v1/unload_lora_adapter |
Dynamically unloads LoRA. |
OpenAI |
API |
Type |
URL |
Description |
Framework |
|---|---|---|---|---|
Model list |
GET |
/v1/models |
Lists available models. |
OpenAI |
Model details |
GET |
/v1/models/{model} |
Queries model information. |
OpenAI |
Service metadata query |
GET |
/v2 |
Obtains service metadata. |
Triton |
Model metadata query |
GET |
/v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}] |
Queries model metadata. |
Triton |
Model configuration query |
GET |
/v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]/config |
Queries model configurations. |
Triton |
API |
Type |
URL |
Description |
Framework |
|---|---|---|---|---|
Inference job |
POST |
/ |
TGI inference API. stream==false returns the text inference result, and stream==true returns the streaming inference result. |
TGI |
POST |
/generate |
Inference API of TGI and vLLM that uses request parameters to identify service types. |
|
|
POST |
/generate_stream |
TGI streaming inference API, which returns results in "Server-Sent Events" format. |
TGI |
|
POST |
/v1/chat/completions |
OpenAI text/streaming inference API. |
OpenAI |
|
POST |
/v1/completions |
vLLM-compatible OpenAI text/streaming inference API. |
OpenAI |
|
POST |
/infer |
Native inference API, which can return results in text or streaming mode. |
Native |
|
POST |
/infer_token |
Native inference API, which implements text or streaming inference based on input tokens. |
Native |
|
POST |
/v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]/infer |
Token inference API of Triton. |
Triton |
|
POST |
/v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]/stopInfer |
API for request termination in advance based on the Triton API definition. |
Native |
|
POST |
/v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]/generate |
Triton text inference API. |
Triton |
|
POST |
/v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]/generate_stream |
Triton streaming inference API. |
Triton |
|
POST |
/v1/tokenizer |
Calculation of the number of tokens. |
Native |
|
GET |
/dresult |
There is a persistent connection between the coordinator and the decode instance. Each time the decode instance generates an inference result, the result is returned to the coordinator through the persistent connection. |
Prefill-decode disaggregation |
- The ${MODEL_NAME} field specifies the name of the model to be queried.
- The [/versions/${MODEL_VERSION}] field is not supported currently and is not passed.