. ├── cover │ ├── vllm │ │ └── __init__.py │ ├── requirements.txt │ └── setup.py ├── examples │ ├── start_server.sh │ ├── test_offline.py │ └── test_offline.sh ├── install.sh └── vllm_npu ├── requirements.txt ├── setup.py ├── tests │ ├── models │ │ ├── __init__.py │ │ └── test_models.py │ └── sampler │ └── test_sampler.py └── vllm_npu ├── config.py ├── core │ ├── __init__.py │ └── scheduler.py ├── engine │ ├── __init__.py │ ├── llm_engine.py │ └── ray_utils.py ├── __init__.py ├── model_executor │ ├── ascend_model_loader.py │ ├── __init__.py │ ├── layers │ │ ├── __init__.py │ │ └── sampler.py │ ├── models │ │ ├── ascend │ │ │ ├── __init__.py │ │ │ └── mindie_llm_wrapper.py │ │ └── __init__.py │ └── utils.py ├── npu_adaptor.py ├── utils.py └── worker ├── ascend_worker.py ├── cache_engine.py ├── __init__.py └── model_runner.py
请保持网络状态畅通,避免因网络问题而安装失败。
bash install.sh
pip show vllm pip show vllm_npu
1 2 3 4 5 6 7 8 9 | Name: vllm Version: 0.3.3 Summary: A high-throughput and memory-efficient inference and serving engine for LLMs Home-page: https://github.com/vllm-project/vllm Author: vLLM Team Author-email: License: Apache 2.0 Requires: fastapi, ninja, numpy, outlines, prometheus_client, psutil, pydantic, pynvml, ray, sentencepiece, transformers, uvicorn Required-by: |
1 2 3 4 5 6 7 8 9 | Name: vllm-npu Version: 0.3.3 Summary: A high-throughput and memory-efficient inference and serving engine for LLMs Home-page: UNKNOWN Author: Huawei Author-email: License: Apache 2.0 Requires: absl-py, accelerate, attrs, cloudpickle, decorator, numpy, pandas, psutil, ray, scipy, tornado, transformers Required-by: |
对于vLLM 0.6.2 版本适配,仅需执行pip show vllm查看是否安装成功。pip show显示的vLLM版本可能并不是v0.6.2,是vLLM原生已知问题,不影响使用。
在1中创建的文件夹下的examples路径中存在离线推理和在线推理的示例demo脚本,分别为test_offline.sh和start_server.sh,使用方法如下:
bash test_offline.sh
import argparse from vllm import LLM, SamplingParams from vllm.logger import init_logger logger = init_logger(__name__) parser = argparse.ArgumentParser() parser.add_argument("--model-path", type=str, default="facebook/opt-125m") prompts = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is",] sampling_params = SamplingParams(max_tokens=64, temperature=0.7) args = parser.parse_args() model_path = args.model_pathllm = LLM( model=model_path, max_model_len=4096, # max length of prompt tensor_parallel_size=1, # number of NPUs to be used max_num_seqs=256, # max batch number enforce_eager=True, # disable CUDA graph mode trust_remote_code=True, # If the model is a custom model not yet available in the HuggingFace transformers library) outputs = llm.generate(prompts, sampling_params) for i, output in enumerate(outputs): prompt = output.prompt generated_text = output.outputs[0].text logger.info(f"req_num: {i}\nPrompt: {prompt!r}\nGenerated text: {generated_text!r}")
test_offline.sh文件示例代码如下:
export VLLM_NO_USAGE_STATS=1 # close vllm usage messages to avoid errors python3 offline_inference.py --model-path facebook/opt-125m
bash start_server.sh
start_server.sh的示例代码如下:
export VLLM_NO_USAGE_STATS=1 # close vllm usage messages to avoid errors python -m vllm.entrypoints.openai.api_server --model=facebook/opt-125m -tp 8 --trust-remote-code --enforce-eager --worker-use-ray
curl https://localhost:8004/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "model_path", "max_tokens": 1, "temperature": 0, "top_p": 0.9, "prompt": "The future of AI is" }'