Accuracy Tuning Guide for Inference Quantization
Introduction
This section provides a systematic methodology for quantization accuracy tuning, which follows the progressive path of "confirming accuracy issues → adjusting the outlier suppression algorithm → modifying the quantization policy → refining the calibration dataset → performing quantization rollback". It details the operation methods, algorithm comparisons, and configuration examples of each step to help users implement efficient model quantization deployment within an acceptable accuracy loss.
- Key points
- Preferentially use outlier suppression algorithms such as Iterative Smooth and Flex Smooth Quant to suppress activation outliers.
- Select a quantization method such as minmax, ssz, or autoround based on the scenario.
- Improve accuracy by optimizing calibration datasets and rolling back sensitive layers.
- Core objective
Achieve efficient deployment of model quantization with acceptable accuracy loss.
Tuning Procedure
- Confirming the accuracy issue
Before starting the tuning, confirm that the issue persists after eliminating environmental interference. For details, see Table 1.
Table 1 Pre-tuning check items Verification Item
Procedure
Verifying the inference engine
Use the floating-point model for evaluation on the target inference engine and check whether the original accuracy can be reproduced.
Checking evaluation results
Check the output of the quantized model to verify that no non-quantization issues exist, such as context truncation and timeouts.
Determining the fluctuation range
Confirm the accuracy fluctuation range of the evaluation dataset and determine whether the current accuracy loss is abnormal.
- Adjusting the outlier suppression algorithm (key step)
Outliers in activation values greatly expand the quantization range and occupy valid quantization bits, leading to accuracy loss. You can use the outlier suppression algorithm to transfer the quantization difficulty from activations to weights. For details, see Table 2.
Table 2 Comparison of outlier suppression algorithms Algorithm
Feature
Applicable Scenario and Suggestion
Configuration Example Link
Smooth Quant
Smoothing is performed only on norm-linear subgraphs. Symmetric/Asymmetric processing is supported.
Not recommended. Popular models, such as Qwen and DeepSeek, have poor accuracy when using this algorithm.
-
Iterative Smooth
Resolves the issue of layers such as o_proj and down_proj failing to transfer the scale due to the absence of adjacent LayerNorm. Symmetric/Asymmetric processing is supported.
Recommended. This algorithm runs fast and delivers high accuracy. It is preferred for ultra-long sequence calibration sets. It can be optimized by adjusting alpha parameters.
Flex Smooth Quant
Automatically searches for the optimal alpha and beta parameters through two-stage grid search to implement more refined balancing.
Recommended when Iterative Smooth does not meet the requirements, quantization time constraints are relaxed, and the GPU/NPU memory is sufficient. This algorithm runs slowly.
QuaRot
By applying rotation transformations to the weights and activation values, outliers are dispersed across multiple channels for smooth distribution.
It can be used together with other algorithms as an alternative solution for higher accuracy.
Conclusions and Suggestions
- Iterative Smooth: It is preferred in most scenarios because it runs fast and achieves high accuracy.
- Symmetry selection: The asymmetric outlier suppression algorithm outperforms the symmetric solution in most cases. However, ensure that the inference engine has been adapted to the asymmetric outlier suppression algorithm beforehand.
- Parameter tuning: If the quantized model delivers lower-than-expected accuracy with the Iterative Smooth algorithm, you can tune the alpha parameter.
- Advanced solution: If the accuracy still cannot meet the requirements, you can try to enable the Flex Smooth Quant algorithm or use the QuaRot algorithm for collaborative tuning.
- Quantization algorithm selection
Select a suitable algorithm based on the quantization object (weight/activation) and the number of bits. Quantization algorithms include those listed in Table 3 Comparison of weight quantization methods and Table 4 Comparison of activation quantization methods.
Table 3 Comparison of weight quantization methods Quantization Method
Feature
Quantization Accuracy
Quantization Speed
Applicable Scenario and Suggestion
minmax
The minimum and maximum values of the weight tensor are calculated to determine the quantization range. This method is simple and computationally efficient.
Low
Fast
Recommended for INT8 quantization. This method is simple and fast, and usually can achieve good accuracy.
ssz
The optimal quantization parameters are iteratively searched to minimize the quantization error.
Medium
Medium
Recommended for low-bit quantization scenarios such as INT4. Compared with minmax, this method achieves higher quantization accuracy via a more refined search, but the quantization speed is slower.
autoround
Learnable rounding offset parameters, together with the SignSGD optimizer, adaptively adjust the rounding direction of each weight to obtain the optimal rounding compensation through training.
High
Slow
If the ssz method does not meet the accuracy requirement, you can use autoround to improve accuracy, especially in ultra-low bit scenarios.
Configuration example (YAML)
In the quantization configuration file, weight quantization is usually configured in the qconfig.weight part of the linear_quant processor.- type: "linear_quant" qconfig: weight: scope: "per_channel" # Quantization granularity: per_channel dtype: "int8" # Quantization data type: int8 or int4 symmetric: true # Whether to use symmetric quantization: Weight quantization usually uses symmetric quantization. method: "minmax" # Quantization method: minmax, ssz, or autoroundConclusions and Suggestions
- INT8 weight quantization: The minmax method is preferred, which ensures the fastest speed while ensuring accuracy.
- Low-bit weight quantization (INT4): The ssz method is preferred. If the accuracy is insufficient, try the autoround method.
- Quantization granularity (scope): per_channel is recommended for weight quantization, which is finer-grained than per_tensor and can achieve higher quantization accuracy.
- Symmetric: Weight quantization is usually set to true (symmetric quantization), which is simpler and computationally efficient.
Table 4 Comparison of activation quantization methods Quantization Method
Feature
Quantization Accuracy
Quantization Speed
Applicable Scenario and Suggestion
minmax
The minimum and maximum values of the activation tensor are collected to determine the quantization range. This method is simple and computationally efficient.
Low
Fast
Preferentially recommended. This method is simple and fast, and is applicable to most scenarios.
histogram
Quantization accuracy and model performance are improved by analyzing the histogram distribution of activation values, automatically searching for the optimal truncation interval, and filtering outliers.
High
Slow
If the minmax method falls below the accuracy requirement, you can use histogram to improve accuracy. However, the speed is slower.
Activation value quantization granularity
Activation value quantization supports multiple granularities, which directly affect accuracy and performance. For details, see Table 5.
Table 5 Comparison of activation value quantization granularities Granularity Type
Feature
Application Scenario
per_tensor
It is a static quantization method in which the entire tensor shares the same group of quantization parameters. The computing is simple, the performance is the best, and it is supported by almost all hardware. However, when the data distribution in a tensor varies greatly, the quantization error becomes significant.
Used when optimal performance is required.
per_token
Each token uses an independent group of quantization parameters. It is a dynamic quantization method. The quantization accuracy is higher with a finer quantization granularity. However, the computing is more complex and the performance is poorer.
Used when higher accuracy is required.
pd_mix
It is a hybrid quantization policy that uses per_token in the profiling phase and per_tensor in the decoding phase. It aims to balance accuracy and performance.
Used when both accuracy and performance need to be balanced.
Configuration example (YAML)
In the quantization configuration file, activation value quantization is usually configured in the qconfig.act part of the linear_quant processor.- type: "linear_quant" qconfig: act: scope: "per_tensor" # Quantization granularity: per_tensor, per_token, or pd_mix dtype: "int8" # Quantization data type: int8 or int4 symmetric: false # Whether to use symmetric quantization: Activation value quantization usually uses asymmetric quantization. method: "minmax" # Quantization method: minmax weight: scope: "per_channel" dtype: "int8" symmetric: true method: "minmax"Conclusions and Suggestions
- Quantization method: The minmax method is preferred. If the accuracy is insufficient, the histogram method can be used.
- Quantization granularity (scope):
- If performance is required, use per_tensor (static quantization).
- If accuracy is required, use per_token (dynamic quantization).
- If both accuracy and performance need to be balanced, use pd_mix (hybrid policy).
- Symmetric: Activation value quantization is usually set to false (asymmetric quantization) to better adapt to data distribution in non-zero centers.
- Calibration set adjustment
If algorithm adjustments have limited effect, optimize the calibration data to improve quantized model accuracy. The quality of the calibration set directly affects the accuracy of quantization parameters. For details, see Table 6.
Table 6 Calibration set adjustment policy Adjustment Policy
Procedure
Objective
Increasing data volume
Increase data volume (10 to 50 samples recommended).
Improve the accuracy of quantization parameters.
Matching application scenarios
Use data that matches the application scenario of the model (for example, use Chinese data for a Chinese model and code data for a code model).
Make the calibration data closer to the actual application scenario.
Balancing data distribution
Extract samples from multiple datasets to balance data distribution.
Improve the diversity and balance of data distribution.
Deleting abnormal data
Delete the calibration data that causes significant accuracy drop.
Reduce the interference of abnormal samples on quantization parameters.
Adding bad-case data
Add the bad case data of the model on the dataset to better reflect the actual input distribution of the model.
Help the quantized model learn hard samples to improve accuracy.
Conclusions and Suggestions
- Data volume: 10 to 50 samples are recommended. Too few samples may lead to inaccurate quantization parameter computation, while too many may increase quantization time.
- Scenario matching: Preferentially use the data that matches the model application scenario to ensure that the calibration dataset can represent the actual application scenario.
- Data quality: Delete abnormal data in a timely manner to avoid negative impact on quantization parameters.
- Hard samples: Add proper bad case data to improve the model's quantization accuracy on hard samples.
- Quantization rollback (last resort)
If the expected accuracy is still not achieved after algorithm and calibration set adjustments, you can roll back the most sensitive layer in the model to high precision (FP16/BF16), thereby mitigating the accuracy drop caused by quantization.
- Application scenarios
- The accuracy still cannot meet the requirements after steps 1 to 4 are performed.
- A more refined balance between accuracy and performance is required.
- Some layers are extremely sensitive to quantization and need to remain high precision.
- Procedure
- Analyzing sensitivity layers
Use the sensitive layer analysis tool provided by msModelSlim to identify quantization-sensitive layers. For details, see Quantization-Sensitive Layer Analysis Tool User Guide.
Function description
- Automatic evaluation: The tool automatically evaluates quantization sensitivity of the model's linear layers and generates a sensitivity score for each layer.
- Decision-making basis: Users can determine which highly sensitive layers need to be rolled back based on the generated sensitivity scores.
- Configuring rollback
Based on the sensitive layer analysis results in step 1, use the exclude field in the YAML configuration file of the quantization policy to exclude the high-sensitive layers that need to be rolled back.
- Analyzing sensitivity layers
- Example
- type: "linear_quant" qconfig: act: scope: "per_tensor" dtype: "int8" symmetric: false method: "minmax" weight: scope: "per_channel" dtype: "int8" symmetric: true method: "minmax" include: ["*"] exclude: ["*model.layers.*.mlp.down_proj*"] # 回退所有mlp.down_proj层 - Conclusions and Suggestions
- Rollback priority: According to experience, the mlp.down_proj layer is usually one of the most quantization-sensitive layers. It is advised to roll back this layer first.
- Tradeoff: The rollback partially reduces the performance improvement and memory saving benefits brought by quantization. You need to determine the number of layers to be rolled back and the rollback range based on the specific service objectives.
- Rollback policy: You are advised to use the top-down policy to gradually roll back the layers with the highest sensitivity to achieve the optimal balance between model accuracy and computing performance.
- Application scenarios