昇腾社区首页
中文
注册

模型推理

根据需要,进行模型的离线推理或在线服务推理。

前提条件

请先完成环境变量设置。若安装路径为默认路径,可以运行以下命令初始化各组件环境变量。

1
2
3
4
# 配置CANN环境,默认安装在/usr/local目录下
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 配置加速库环境
source /usr/local/Ascend/nnal/atb/set_env.sh

离线推理

请参考vLLM离线推理示例文档进行推理。

  1. 以Qwen2.5-7B模型为例,进入容器后在当前目录下新建一个py脚本(如test.py),如下所示:
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    from vllm import LLM, SamplingParams
    
    # Sample prompts.
    prompts = [
        "Hello, my name is",
        "The president of the United States is",
        "The capital of France is",
        "The future of AI is",
    ]
    # Create a sampling params object.
    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
    
    # Create an LLM.
    llm = LLM(model="/home/weight/qwen2.5-7b")
    
    # Generate texts from the prompts. The output is a list of RequestOutput objects
    # that contain the prompt, generated text, and other information.
    outputs = llm.generate(prompts, sampling_params)
    # Print the outputs.
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
    

    /home/weight/”为模型权重文件路径,请根据实际情况修改。

  2. 执行如下命令,执行推理。
    python test.py
  3. 查看结果,回显如下则表明推理成功。
    1
    2
    3
    4
    Prompt: 'Hello, my name is', Generated text: ' Daniel and I am an 8th grade student at York Middle School. I'
    Prompt: 'The president of the United States is', Generated text: ' the commander-in-chief of the armed forces. When a state of war exists,'
    Prompt: 'The capital of France is', Generated text: ' ____.\nA. Geneva\nB. Strasbourg\nC. Paris\nD'
    Prompt: 'The future of AI is', Generated text: " now\n\nThe future of AI is now\n\nIf you're like most tech professionals"
    

启动在线服务推理

  1. 请根据vLLM在线服务示例文档启动服务。以Qwen2.5-7B简单示例,“/home/weight/”为模型权重文件路径,请根据实际情况修改:
    vllm serve /home/weight/Qwen2.5-7B --dtype auto --port 8000

    回显如下表明服务启动成功:

    1
    INFO:     Application startup complete.
    
  2. 服务启动后,重开一个窗口,使用以下命令发送请求:
    curl http://localhost:port/v1/completions \  #用户设置的推理服务侦听IP和Port,Port与服务端一致
      -H "Content-Type: application/json" \
      -d '{
        "model": "/home/weight/qwen2.5-7b",
        "max_tokens": 25,
        "temperature": 0,
        "top_p": 0.9,
        "prompt": "The future of AI is"
      }'
    • port端口号默认为“8000”。
    • /home/weight/”为模型权重文件路径,请根据实际情况修改。
  3. 查看结果,回显如下则表明请求发送成功。
    1
    "choices":[{"index":0,"text":" here. It’s not just a buzzword or a concept anymore. It’s a reality that’s transforming the way we live,"logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":5,"total_tokens":30,"completion_tokens":25,"prompt_tokens_details":null}}