前提条件

已参见《MindIE安装指南》中“安装驱动和固件”章节完成驱动和固件的安装。
已参见《MindIE安装指南》中“安装开发环境”章节完成CANN、Python 3.10.2、PyTorch 2.1.0框架和Torch_NPU 2.1.0插件的安装。
已参见《MindIE安装指南》中“物理机部署MindIE”章节完成MindIE的安装。

安装步骤

安装vllm_npu与vllm。

按照适配说明与样例代码章节的指导，将代码编写为文件并按照对应目录结构调整，以vLLM 0.3.3版本为例，将代码文件整理成以下目录结构。

.
├── cover
│   ├── vllm
│   │   └── __init__.py
│   ├── requirements.txt
│   └── setup.py
├── examples
│   ├── start_server.sh
│   ├── test_offline.py
│   └── test_offline.sh
├── install.sh
└── vllm_npu
    ├── requirements.txt
    ├── setup.py
    ├── tests
    │   ├── models
    │   │   ├── __init__.py
    │   │   └── test_models.py
    │   └── sampler
    │       └── test_sampler.py
    └── vllm_npu
        ├── config.py
        ├── core
        │   ├── __init__.py
        │   └── scheduler.py
        ├── engine
        │   ├── __init__.py
        │   ├── llm_engine.py
        │   └── ray_utils.py
        ├── __init__.py
        ├── model_executor
        │   ├── ascend_model_loader.py
        │   ├── __init__.py
        │   ├── layers
        │   │   ├── __init__.py
        │   │   └── sampler.py
        │   ├── models
        │   │   ├── ascend
        │   │   │   ├── __init__.py
        │   │   │   └── mindie_llm_wrapper.py
        │   │   └── __init__.py
        │   └── utils.py
        ├── npu_adaptor.py
        ├── utils.py
        └── worker
            ├── ascend_worker.py
            ├── cache_engine.py
            ├── __init__.py
            └── model_runner.py

其他版本的项目同样按照对应指南进行创建。创建完成后，进入该目录中，执行安装脚本。

请保持网络状态畅通，避免因网络问题而安装失败。
```
bash install.sh
```
脚本执行完毕后，可通过以下命令检查是否安装成功。
```
pip show vllm
pip show vllm_npu
```

若出现以下类似显示，则说明安装成功。

Name: vllm
Version: 0.3.3
Summary: A high-throughput and memory-efficient inference and serving engine for LLMs
Home-page: https://github.com/vllm-project/vllm
Author: vLLM Team
Author-email: 
License: Apache 2.0
Requires: fastapi, ninja, numpy, outlines, prometheus_client, psutil, pydantic, pynvml, ray, sentencepiece, transformers, uvicorn
Required-by:

Name: vllm-npu
Version: 0.3.3
Summary: A high-throughput and memory-efficient inference and serving engine for LLMs
Home-page: UNKNOWN
Author: Huawei
Author-email: 
License: Apache 2.0
Requires: absl-py, accelerate, attrs, cloudpickle, decorator, numpy, pandas, psutil, ray, scipy, tornado, transformers
Required-by:

拉起vLLM服务。
在1创建的文件夹下的examples路径中存在离线推理和在线推理的示例demo脚本，分别为test_offline.sh和start_server.sh，使用方法如下：
- 离线推理：首先修改test_offline.sh中的model_path参数指定模型路径，然后修改test_offline.py中的tensor_parallel_size参数指定使用的卡数，以及test_offline.py中的prompts变量添加需要推理的prompt内容，然后运行脚本。
```
bash test_offline.sh
```
- 在线推理：首先修改start_server.sh中的model参数指定模型路径，并添加-tp参数，后面跟数字指定使用的卡数，然后运行脚本，即可拉起vLLM服务。
```
bash start_server.sh
```
客户端使用curl、requests等方式向服务端发送推理请求。
- 如果需要使用https协议发送请求，需要在启动vLLM时配置对应的证书，以支持https协议访问（请参考vLLM文档，设置以下参数：--ssl-keyfile，--ssl-certfile，--ssl-ca-certs，--ssl-cert-reqs）。
- 这里的model_path需要与拉起服务指定的model参数保持一致。
```
curl https://localhost:8004/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "model_path",
    "max_tokens": 1,
    "temperature": 0,
    "top_p": 0.9,
    "prompt": "The future of AI is"
  }'
```

环境安装与启动服务

前提条件

安装步骤