W8A8

此量化方式对权重和激活值均进行量化，将高位浮点数转为8 bit，减少模型权重的体积。使用int8格式的数据进行计算，可以减少MatMul算子计算量，以提升推理性能。

仅支持LLaMA3、LLaMA3.1、Qwen2、Qwen2.5系列模型。

此量化方式支持量化float16或bfloat16类型的原始权重。在量化后，原始权重路径下的config.json将会新增“quantization_config”字段，其部分内容如下所示：

"quantization_config": {
    ...
    "model_quant_type": "W8A8",
    "model.embed_tokens.weight": "FLOAT",
    "model.layers.0.self_attn.q_proj.weight": "W8A8",
    "model.layers.0.self_attn.q_proj.input_scale": "W8A8",
    "model.layers.0.self_attn.q_proj.input_offset": "W8A8",
    "model.layers.0.self_attn.q_proj.quant_bias": "W8A8",
    "model.layers.0.self_attn.q_proj.deq_scale": "W8A8",
    ...
}

量化后的MatMul权重新增input_scale、input_offset、quant_bias和deq_scale。其中input_scale和input_offset用于对激活值进行量化。MatMul使用量化后的激活值和量化权重进行计算。quant_bias和deq_scale用于对MatMul的计算结果进行反量化。

图1 量化权重推理时流程
点击放大

表1 float16权重量化后dtype及shape信息（假设原始权重的shape为[n, k]）
Tensor信息	weight	input_scale	input_offset	quant_bias	deq_scale
dtype	int8	float16	float16	int32	int64
shape	[n, k]	[1]	[1]	[n]	[n]

表2 bfloat16权重量化后dtype及shape信息（假设原始权重的shape为[n, k]）
Tensor信息	weight	input_scale	input_offset	quant_bias	deq_scale
dtype	int8	bfloat16	bfloat16	int32	float32
shape	[n, k]	[1]	[1]	[n]	[n]

生成权重

以LLaMA3.1-70B-Instruct模型为例，参考LLaMA3.1-70B W8A8量化方法，您可以使用以下指令生成W8A8量化权重。

cd {msmodelslim安装路径}/example/Llama/
python3 quant_llama.py --model_path {浮点权重路径} --save_directory {W8A8量化权重路径} --calib_file ../common/boolq.jsonl  --device_type npu --disable_level L5 --anti_method m3 --act_method 3

使用实例

权重生成后，您可以参考以下步骤使用“--tp 8”启动vLLM服务化，并设置端口号为12345。其中，{模型名称}可自行设置。

python -m vllm.entrypoints.openai.api_server \
       --model={W8A8量化权重路径} \
       --served-model-name {模型名称} \
       --enforce-eager \
       --distributed_executor_backend "ray" \
       --tp 8 \
       --port 12345

待vLLM服务化成功拉起后，您可以参考以下指令发送请求，推理内容为"What's deep learning?"，最长输出16个token。

curl http://localhost:12345/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "{W8A8量化权重路径}",
        "max_tokens": 16,
        "prompt": "What's Deep Learning?"
    }'

父主题： 量化特性