Introduction to Servitization Interfaces

Scenario Description

Server provides EndPoint to encapsulate inference serving protocols and APIs. It is compatible with third-party framework APIs such as Triton, OpenAI, TGI, and vLLM. After Server is installed in single-server mode, you can use a client (Linux curl command, Postman tool, and etc.) to send HTTP/HTTPS requests to call APIs provided by EndPoint.

HTTPS is recommended, as it is more secure than HTTP.

Description of EndPoint RESTful APIs

The IP address and port number of an HTTP/HTTPS request URL are configured in the config.json file. For details, see Parameters in ServerConfig.

URL format of a generate request sent by Linux curl:
- Operation type: POST
- URL: http[s]://{ip}:{port}/generate

Inference request sent with HTTPS disabled:

curl -H "Accept: application/json" -H "Content-type: application/json" -X POST -d '{
  "inputs": "My name is Olivier and I",
  "parameters": {
    "details": true,
    "do_sample": true,
    "repetition_penalty": 1.1,
    "return_full_text": false,
    "seed": null,
    "temperature": 1,
    "top_p": 0.99
  }
}' http://{ip}:{port}/generate

Request sent with HTTPS bidirectional authentication enabled:

curl --location --request POST 'https://{ip}:{port}/generate' \
--header 'Content-Type: application/json' \
--cacert /home/runs/static_conf/ca/ca.pem \
--cert /home/runs/static_conf/cert/client.pem \
--key /home/runs/static_conf/cert/client.key.pem \
--data-raw '{
    "inputs": "My name is Olivier and I",
    "parameters": {
        "best_of": 1,
        "decoder_input_details": false,
        "details": false,
        "do_sample": true,
        "max_new_tokens": 20,
        "repetition_penalty": 2,
        "return_full_text": false,
        "seed": 12,
        "temperature": 0.1,
        "top_k": 1,
        "top_p": 0.9,
        "truncate": 1024
    }
}'

--cacert: path of the signature verification certificate file.
ca.pem: signature verification certificate or root certificate of Server.
--cert: path of the client certificate file.
client.pem: client certificate.
--key: path of the client private key file.
client.key.pem: private key of the client certificate. (The private key is not encrypted. You are advised to use an encrypted key.)

Change the parameters as needed.

The following table lists the provided RESTful APIs.

**Table 1** Service status query APIs (internal interface query APIs)
API	Type	URL	Description	Framework
Server Live	GET	/v2/health/live	Checks whether the server is online.	Triton
Server Ready	GET	/v2/health/ready	Checks whether the server is ready.	Triton
Model Ready	GET	/v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]/ready	Checks whether the model is ready.	Triton
health	GET	/health	Performs service health checks.	TGI vLLM
TGI EndPoint information query	GET	/info	Queries the TGI EndPoint information.	TGI
Slot statistics	GET	/v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]/getSlotCount	Queries customized slot statistics based on the Triton format.	Native
Health probe	GET	/health/timed[-${TIMEOUT}]	Checks whether an inference process is normal.	Native
Graceful exit	GET	/stopService	Implements graceful exit of the entire service. When this API is called, the system stops a service until all requests that are being executed and waiting are complete. During the waiting, all inference APIs are unavailable.	Native
Collecting static configurations	GET	/v1/config	Collects static configurations.	Native
Collecting dynamic status	GET	/v1/status	Collects dynamic status.	Native
Specifying an instance role	POST	/v1/role/${role}	Specifies an instance role.	Native
Collecting dynamic status	GET	/v2/status	Collects dynamic status.	Native
Specifying an instance role	POST	/v2/role/${role}	Specifies an instance role.	Native
Service metric API (JSON format)	GET	/metrics-json	Obtains the dynamic average values of Time To First Token (TTFT) and Time Between Tokens (TBT) of nearly 1,000 requests by default, the number of requests being executed, the number of requests waiting, and the number of remaining NPU blocks during an inference service.	Native
Querying service management and control metrics (Prometheus format)	GET	/metrics	Queries management and control metrics of inference servitization.	Native
Dynamically loading LoRA	POST	/v1/load_lora_adapter	Dynamically loads LoRA.	OpenAI
Dynamically unloading LoRA	POST	/v1/unload_lora_adapter	Dynamically unloads LoRA.	OpenAI

**Table 2** Model/Service query APIs (service plane)
API	Type	URL	Description	Framework
Model list	GET	/v1/models	Lists available models.	OpenAI
Model details	GET	/v1/models/{model}	Queries model information.	OpenAI
Service metadata query	GET	/v2	Obtains service metadata.	Triton
Model metadata query	GET	/v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]	Queries model metadata.	Triton
Model configuration query	GET	/v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]/config	Queries model configurations.	Triton

**Table 3** Inference APIs (service plane)
API	Type	URL	Description	Framework
Inference job	POST	/	TGI inference API. stream==false returns the text inference result, and stream==true returns the streaming inference result.	TGI
	POST	/generate	Inference API of TGI and vLLM that uses request parameters to identify service types.	TGI vLLM
	POST	/generate_stream	TGI streaming inference API, which returns results in "Server-Sent Events" format.	TGI
	POST	/v1/chat/completions	OpenAI text/streaming inference API.	OpenAI
	POST	/v1/completions	vLLM-compatible OpenAI text/streaming inference API.	OpenAI
	POST	/infer	Native inference API, which can return results in text or streaming mode.	Native
	POST	/infer_token	Native inference API, which implements text or streaming inference based on input tokens.	Native
	POST	/v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]/infer	Token inference API of Triton.	Triton
	POST	/v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]/stopInfer	API for request termination in advance based on the Triton API definition.	Native
	POST	/v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]/generate	Triton text inference API.	Triton
	POST	/v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]/generate_stream	Triton streaming inference API.	Triton
	POST	/v1/tokenizer	Calculation of the number of tokens.	Native
	GET	/dresult	There is a persistent connection between the coordinator and the decode instance. Each time the decode instance generates an inference result, the result is returned to the coordinator through the persistent connection.	Prefill-decode disaggregation

The ${MODEL_NAME} field specifies the name of the model to be queried.
The [/versions/${MODEL_VERSION}] field is not supported currently and is not passed.

Parent topic: Online Servitization