. ├── cover │ ├── vllm │ │ └── __init__.py │ ├── requirements.txt │ └── setup.py ├── examples │ ├── start_server.sh │ ├── test_offline.py │ └── test_offline.sh ├── install.sh └── vllm_npu ├── requirements.txt ├── setup.py ├── tests │ ├── models │ │ ├── __init__.py │ │ └── test_models.py │ └── sampler │ └── test_sampler.py └── vllm_npu ├── config.py ├── core │ ├── __init__.py │ └── scheduler.py ├── engine │ ├── __init__.py │ ├── llm_engine.py │ └── ray_utils.py ├── __init__.py ├── model_executor │ ├── ascend_model_loader.py │ ├── __init__.py │ ├── layers │ │ ├── __init__.py │ │ └── sampler.py │ ├── models │ │ ├── ascend │ │ │ ├── __init__.py │ │ │ └── mindie_llm_wrapper.py │ │ └── __init__.py │ └── utils.py ├── npu_adaptor.py ├── utils.py └── worker ├── ascend_worker.py ├── cache_engine.py ├── __init__.py └── model_runner.py
请保持网络状态畅通,避免因网络问题而安装失败。
bash install.sh
pip show vllm pip show vllm_npu
Name: vllm Version: 0.3.3 Summary: A high-throughput and memory-efficient inference and serving engine for LLMs Home-page: https://github.com/vllm-project/vllm Author: vLLM Team Author-email: License: Apache 2.0 Requires: fastapi, ninja, numpy, outlines, prometheus_client, psutil, pydantic, pynvml, ray, sentencepiece, transformers, uvicorn Required-by:
Name: vllm-npu Version: 0.3.3 Summary: A high-throughput and memory-efficient inference and serving engine for LLMs Home-page: UNKNOWN Author: Huawei Author-email: License: Apache 2.0 Requires: absl-py, accelerate, attrs, cloudpickle, decorator, numpy, pandas, psutil, ray, scipy, tornado, transformers Required-by:
在1创建的文件夹下的examples路径中存在离线推理和在线推理的示例demo脚本,分别为test_offline.sh和start_server.sh,使用方法如下:
bash test_offline.sh
bash start_server.sh
curl https://localhost:8004/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "model_path", "max_tokens": 1, "temperature": 0, "top_p": 0.9, "prompt": "The future of AI is" }'