昇腾社区首页
中文
注册

使用样例

限制与约束

  • Atlas 800I A2 推理服务器Atlas 300I Duo 推理卡支持此特性。
  • Qwen2系列、DeepSeek-R1和DeepSeek-V3模型支持对接此特性。
  • 当跨session公共前缀Token数大于等于Page Attention中的block size,才会进行公共前缀Token的KV Cache复用。
  • Prefix Cache支持的量化特性:W8A8量化与稀疏量化,其他量化特性暂不支持。
  • 该特性不能和Multi-LoRA、长序列以及多机推理特性同时使用。
  • 该特性可以和PD分离、并行解码、MTP、异步调度和SplitFuse特性同时使用。
  • PD分离场景下,P节点和D节点都需要开启该特性。
  • 前缀复用率低或者没有复用的情况下,不建议开启该特性。
  • DeepSeek-R1和DeepSeek-V3模型开启该特性时,需设置“export TASK_QUEUE_ENABLE=1”。

操作步骤

本章节以多轮对话为例,简单介绍Prefix Cache如何使用。

  1. 配置服务化参数,服务化参数说明请参见配置参数说明章节。
    cd {MindIE安装目录}/latest/mindie-service/
    vi conf/config.json

    Prefix Cache特性需要额外配置的参数:

    • 单独使用Prefix Cache特性时,在ModelDeployConfig中的ModelConfig字段下添加以下参数:
      "plugin_params": "{\"plugin_type\":\"prefix_cache\"}"
    • 如果需要特性叠加使用,如:Prefix Cache和MTP叠加,需使用英文逗号将特性名称隔开。在ModelDeployConfig中的ModelConfig字段下添加以下参数:
      "plugin_params": "{\"plugin_type\":\"mtp,prefix_cache\",\"num_speculative_tokens\": 1}"
    • DeepSeek模型对接此特性时,还需开启KV Cache NZ格式,在ModelDeployConfig中的ModelConfig字段下添加以下参数:
      "models": {
          "deepseekv2": {
              "kv_cache_options": {"enable_nz": true}
              }     
          }
    • (可选)在ScheduleConfig中添加以下参数:
      "enablePrefixCache": true

    保存修改后的配置,使用以下命令启动服务化:

    ./bin/mindieservice_daemon
  2. 第一次使用以下指令发送请求,prompt为第一轮问题。

    如需使用到Prefix Cache特性,第二次请求的prompt需要与第一次的prompt有一定长度的公共前缀,常见使用场景有多轮对话和few-shot学习等。

    curl https://127.0.0.1:1025/generate \
    -H "Content-Type: application/json" \
    --cacert ca.pem --cert client.pem  --key client.key.pem \
    -X POST \
    -d '{
    "inputs": "Question: Parents have complained to the principal about bullying during recess. The principal wants to quickly resolve this, instructing recess aides to be vigilant. Which situation should the aides report to the principal?\na) An unengaged girl is sitting alone on a bench, engrossed in a book and showing no interaction with her peers.\nb) Two boys engaged in a one-on-one basketball game are involved in a heated argument regarding the last scored basket.\nc) A group of four girls has surrounded another girl and appears to have taken possession of her backpack.\nd) Three boys are huddled over a handheld video game, which is against the rules and not permitted on school grounds.\nAnswer:",
    "parameters": {"max_new_tokens":512}
    }'
  3. 第二次发送请求,prompt为:第一轮问题+第一轮答案+第二轮问题,此时第一轮问题为可复用的公共前缀(实际复用部分可能不是第一轮问题的完整prompt;由于cache实现以block为单位,Prefix Cache以blocksize的倍数储存,如第一轮问题prompt的token数量为164,当blocksize为128时,实际复用部分只有前128token)。
    curl https://127.0.0.1:1025/generate \
    -H "Content-Type: application/json" \
    --cacert ca.pem --cert client.pem  --key client.key.pem \
    -X POST \
    -d '{
    "inputs": "Question: Parents have complained to the principal about bullying during recess. The principal wants to quickly resolve this, instructing recess aides to be vigilant. Which situation should the aides report to the principal?\na) An unengaged girl is sitting alone on a bench, engrossed in a book and showing no interaction with her peers.\nb) Two boys engaged in a one-on-one basketball game are involved in a heated argument regarding the last scored basket.\nc) A group of four girls has surrounded another girl and appears to have taken possession of her backpack.\nd) Three boys are huddled over a handheld video game, which is against the rules and not permitted on school grounds.\nAnswer:c) A group of four girls has surrounded another girl and appears to have taken possession of her backpack.\nExplanation: The principal wants to quickly resolve this, instructing recess aides to be vigilant. The principal is concerned about bullying during recess. The principal wants the aides to report any bullying behavior to him. The principal is not concerned about the other situations.\nQuestion: If the aides confront the group of girls from situation (c) and they deny bullying, stating that they were merely playing a game, what specific evidence should the aides look for to determine if this is a likely truth or a cover-up for bullying?\nAnswer:",
    "parameters": {"max_new_tokens":512}
    }'