使用差异

可见设备的环境变量，GPU为CUDA_VISIBLE_DEVICES；在昇腾环境下为ASCEND_RT_VISIBLE_DEVICES。

如设置0~7卡可用。

export ASCEND_RT_VISIBLE_DEVICES=0,1,…,7

在昇腾环境中，由于不支持cuda graph，启动服务需要配置参数enforce-eager；多卡模式下需开启worker-use-ray。示例如下：

#!/bin/bash

export ASCEND_RT_VISIBLE_DEVICES=3
python -m vllm.entrypoints.openai.api_server  \
       --model=/home/data/models/LLaMA3-8B \
       --trust-remote-code \
       --enforce-eager \
       --max-model-len 4096 \
       --worker-use-ray \
       -tp 1 \
       --port 8006 \
       --block-size 128

父主题： vLLM