使用差异

运行TGI需设置与可见设备号相关的环境变量，GPU为CUDA_VISIBLE_DEVICES；在昇腾环境下为ASCEND_RT_VISIBLE_DEVICES，以下提供一个启动TGI服务和发送请求的简单样例：

启动Nginx服务（请确保已按照环境准备设置好配置文件）。
```
service nginx start
```

服务端使用拉起服务脚本拉起TGI在线推理服务。

# 控制框架占用显存比例
export CUDA_MEMORY_FRACTION=0.9
# 系统可见设备id
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
# 本地模型权重路径或在Huggingface代码仓中的位置
model_path=/home/data/models/qwen2_7B_instruct
# 以下启动参数与原生TGI一致
text-generation-launcher \
 --model-id $model_path \
 --port 12347 \
 --max-input-length 2048 \
 --max-total-tokens 2560 \
 --sharded true \
 --num-shard 8 \
 --max-batch-prefill-tokens 8192 \
 --max-waiting-tokens 20 \
 --max-concurrent-requests 256 \
 --waiting-served-ratio 1.2

客户端向服务端发送基于HTTPS协议的推理请求并接收响应。

curl  https://127.0.0.1:12346/generate -X POST -d '{"inputs":"Please introduce yourself.","parameters":{"max_new_tokens":64,"repetition_penalty":1.2}}' -H 'Content-Type: application/json'

原生TGI基于HTTP协议提供推理服务，请求命令如下（由于HTTP协议安全性问题，不推荐此方式）：

curl  http://127.0.0.1:12347/generate -X POST -d '{"inputs":"Please introduce yourself.","parameters":{"max_new_tokens":64,"repetition_penalty":1.2}}' -H 'Content-Type: application/json'

父主题： TGI