Accuracy Tuning Guide for Inference Quantization

Introduction

This section provides a systematic methodology for quantization accuracy tuning, which follows the progressive path of "confirming accuracy issues → adjusting the outlier suppression algorithm → modifying the quantization policy → refining the calibration dataset → performing quantization rollback". It details the operation methods, algorithm comparisons, and configuration examples of each step to help users implement efficient model quantization deployment within an acceptable accuracy loss.

  • Key points
    • Preferentially use outlier suppression algorithms such as Iterative Smooth and Flex Smooth Quant to suppress activation outliers.
    • Select a quantization method such as minmax, ssz, or autoround based on the scenario.
    • Improve accuracy by optimizing calibration datasets and rolling back sensitive layers.
  • Core objective

    Achieve efficient deployment of model quantization with acceptable accuracy loss.

Preparation

Install msModelSlim. For details, see msModelSlim Installation Guide.

Tuning Procedure

  1. Confirming the accuracy issue

    Before starting the tuning, confirm that the issue persists after eliminating environmental interference. For details, see Table 1.

    Table 1 Pre-tuning check items

    Verification Item

    Procedure

    Verifying the inference engine

    Use the floating-point model for evaluation on the target inference engine and check whether the original accuracy can be reproduced.

    Checking evaluation results

    Check the output of the quantized model to verify that no non-quantization issues exist, such as context truncation and timeouts.

    Determining the fluctuation range

    Confirm the accuracy fluctuation range of the evaluation dataset and determine whether the current accuracy loss is abnormal.

  2. Adjusting the outlier suppression algorithm (key step)

    Outliers in activation values greatly expand the quantization range and occupy valid quantization bits, leading to accuracy loss. You can use the outlier suppression algorithm to transfer the quantization difficulty from activations to weights. For details, see Table 2.

    Table 2 Comparison of outlier suppression algorithms

    Algorithm

    Feature

    Applicable Scenario and Suggestion

    Configuration Example Link

    Smooth Quant

    Smoothing is performed only on norm-linear subgraphs. Symmetric/Asymmetric processing is supported.

    Not recommended. Popular models, such as Qwen and DeepSeek, have poor accuracy when using this algorithm.

    -

    Iterative Smooth

    Resolves the issue of layers such as o_proj and down_proj failing to transfer the scale due to the absence of adjacent LayerNorm. Symmetric/Asymmetric processing is supported.

    Recommended. This algorithm runs fast and delivers high accuracy. It is preferred for ultra-long sequence calibration sets. It can be optimized by adjusting alpha parameters.

    Iterative_Smooth.md

    Flex Smooth Quant

    Automatically searches for the optimal alpha and beta parameters through two-stage grid search to implement more refined balancing.

    Recommended when Iterative Smooth does not meet the requirements, quantization time constraints are relaxed, and the GPU/NPU memory is sufficient. This algorithm runs slowly.

    Flex_Smooth_Quant.md

    QuaRot

    By applying rotation transformations to the weights and activation values, outliers are dispersed across multiple channels for smooth distribution.

    It can be used together with other algorithms as an alternative solution for higher accuracy.

    QuaRot.md

    Conclusions and Suggestions

    1. Iterative Smooth: It is preferred in most scenarios because it runs fast and achieves high accuracy.
    2. Symmetry selection: The asymmetric outlier suppression algorithm outperforms the symmetric solution in most cases. However, ensure that the inference engine has been adapted to the asymmetric outlier suppression algorithm beforehand.
    3. Parameter tuning: If the quantized model delivers lower-than-expected accuracy with the Iterative Smooth algorithm, you can tune the alpha parameter.
    4. Advanced solution: If the accuracy still cannot meet the requirements, you can try to enable the Flex Smooth Quant algorithm or use the QuaRot algorithm for collaborative tuning.
  3. Quantization algorithm selection

    Select a suitable algorithm based on the quantization object (weight/activation) and the number of bits. Quantization algorithms include those listed in Table 3 Comparison of weight quantization methods and Table 4 Comparison of activation quantization methods.

    Table 3 Comparison of weight quantization methods

    Quantization Method

    Feature

    Quantization Accuracy

    Quantization Speed

    Applicable Scenario and Suggestion

    minmax

    The minimum and maximum values of the weight tensor are calculated to determine the quantization range. This method is simple and computationally efficient.

    Low

    Fast

    Recommended for INT8 quantization. This method is simple and fast, and usually can achieve good accuracy.

    ssz

    The optimal quantization parameters are iteratively searched to minimize the quantization error.

    Medium

    Medium

    Recommended for low-bit quantization scenarios such as INT4. Compared with minmax, this method achieves higher quantization accuracy via a more refined search, but the quantization speed is slower.

    autoround

    Learnable rounding offset parameters, together with the SignSGD optimizer, adaptively adjust the rounding direction of each weight to obtain the optimal rounding compensation through training.

    High

    Slow

    If the ssz method does not meet the accuracy requirement, you can use autoround to improve accuracy, especially in ultra-low bit scenarios.

    Configuration example (YAML)

    In the quantization configuration file, weight quantization is usually configured in the qconfig.weight part of the linear_quant processor.
    - type: "linear_quant"
      qconfig:
        weight:
          scope: "per_channel" # Quantization granularity: per_channel
          dtype: "int8" # Quantization data type: int8 or int4
          symmetric: true # Whether to use symmetric quantization: Weight quantization usually uses symmetric quantization.
          method: "minmax" # Quantization method: minmax, ssz, or autoround

    Conclusions and Suggestions

    1. INT8 weight quantization: The minmax method is preferred, which ensures the fastest speed while ensuring accuracy.
    2. Low-bit weight quantization (INT4): The ssz method is preferred. If the accuracy is insufficient, try the autoround method.
    3. Quantization granularity (scope): per_channel is recommended for weight quantization, which is finer-grained than per_tensor and can achieve higher quantization accuracy.
    4. Symmetric: Weight quantization is usually set to true (symmetric quantization), which is simpler and computationally efficient.
    Table 4 Comparison of activation quantization methods

    Quantization Method

    Feature

    Quantization Accuracy

    Quantization Speed

    Applicable Scenario and Suggestion

    minmax

    The minimum and maximum values of the activation tensor are collected to determine the quantization range. This method is simple and computationally efficient.

    Low

    Fast

    Preferentially recommended. This method is simple and fast, and is applicable to most scenarios.

    histogram

    Quantization accuracy and model performance are improved by analyzing the histogram distribution of activation values, automatically searching for the optimal truncation interval, and filtering outliers.

    High

    Slow

    If the minmax method falls below the accuracy requirement, you can use histogram to improve accuracy. However, the speed is slower.

    Activation value quantization granularity

    Activation value quantization supports multiple granularities, which directly affect accuracy and performance. For details, see Table 5.

    Table 5 Comparison of activation value quantization granularities

    Granularity Type

    Feature

    Application Scenario

    per_tensor

    It is a static quantization method in which the entire tensor shares the same group of quantization parameters. The computing is simple, the performance is the best, and it is supported by almost all hardware. However, when the data distribution in a tensor varies greatly, the quantization error becomes significant.

    Used when optimal performance is required.

    per_token

    Each token uses an independent group of quantization parameters. It is a dynamic quantization method. The quantization accuracy is higher with a finer quantization granularity. However, the computing is more complex and the performance is poorer.

    Used when higher accuracy is required.

    pd_mix

    It is a hybrid quantization policy that uses per_token in the profiling phase and per_tensor in the decoding phase. It aims to balance accuracy and performance.

    Used when both accuracy and performance need to be balanced.

    Configuration example (YAML)

    In the quantization configuration file, activation value quantization is usually configured in the qconfig.act part of the linear_quant processor.
    - type: "linear_quant"
      qconfig:
        act:
          scope: "per_tensor" # Quantization granularity: per_tensor, per_token, or pd_mix
          dtype: "int8" # Quantization data type: int8 or int4
          symmetric: false # Whether to use symmetric quantization: Activation value quantization usually uses asymmetric quantization.
          method: "minmax" # Quantization method: minmax
        weight:
          scope: "per_channel"
          dtype: "int8"
          symmetric: true
          method: "minmax"

    Conclusions and Suggestions

    1. Quantization method: The minmax method is preferred. If the accuracy is insufficient, the histogram method can be used.
    2. Quantization granularity (scope):
      • If performance is required, use per_tensor (static quantization).
      • If accuracy is required, use per_token (dynamic quantization).
      • If both accuracy and performance need to be balanced, use pd_mix (hybrid policy).
    3. Symmetric: Activation value quantization is usually set to false (asymmetric quantization) to better adapt to data distribution in non-zero centers.
  4. Calibration set adjustment

    If algorithm adjustments have limited effect, optimize the calibration data to improve quantized model accuracy. The quality of the calibration set directly affects the accuracy of quantization parameters. For details, see Table 6.

    Table 6 Calibration set adjustment policy

    Adjustment Policy

    Procedure

    Objective

    Increasing data volume

    Increase data volume (10 to 50 samples recommended).

    Improve the accuracy of quantization parameters.

    Matching application scenarios

    Use data that matches the application scenario of the model (for example, use Chinese data for a Chinese model and code data for a code model).

    Make the calibration data closer to the actual application scenario.

    Balancing data distribution

    Extract samples from multiple datasets to balance data distribution.

    Improve the diversity and balance of data distribution.

    Deleting abnormal data

    Delete the calibration data that causes significant accuracy drop.

    Reduce the interference of abnormal samples on quantization parameters.

    Adding bad-case data

    Add the bad case data of the model on the dataset to better reflect the actual input distribution of the model.

    Help the quantized model learn hard samples to improve accuracy.

    Conclusions and Suggestions

    1. Data volume: 10 to 50 samples are recommended. Too few samples may lead to inaccurate quantization parameter computation, while too many may increase quantization time.
    2. Scenario matching: Preferentially use the data that matches the model application scenario to ensure that the calibration dataset can represent the actual application scenario.
    3. Data quality: Delete abnormal data in a timely manner to avoid negative impact on quantization parameters.
    4. Hard samples: Add proper bad case data to improve the model's quantization accuracy on hard samples.
  5. Quantization rollback (last resort)

    If the expected accuracy is still not achieved after algorithm and calibration set adjustments, you can roll back the most sensitive layer in the model to high precision (FP16/BF16), thereby mitigating the accuracy drop caused by quantization.

    • Application scenarios
      • The accuracy still cannot meet the requirements after steps 1 to 4 are performed.
      • A more refined balance between accuracy and performance is required.
      • Some layers are extremely sensitive to quantization and need to remain high precision.
    • Procedure
      1. Analyzing sensitivity layers

        Use the sensitive layer analysis tool provided by msModelSlim to identify quantization-sensitive layers. For details, see Quantization-Sensitive Layer Analysis Tool User Guide.

        Function description

        • Automatic evaluation: The tool automatically evaluates quantization sensitivity of the model's linear layers and generates a sensitivity score for each layer.
        • Decision-making basis: Users can determine which highly sensitive layers need to be rolled back based on the generated sensitivity scores.
      2. Configuring rollback

        Based on the sensitive layer analysis results in step 1, use the exclude field in the YAML configuration file of the quantization policy to exclude the high-sensitive layers that need to be rolled back.

    • Example
      - type: "linear_quant"
        qconfig:
          act:
            scope: "per_tensor"
            dtype: "int8"
            symmetric: false
            method: "minmax"
          weight:
            scope: "per_channel"
            dtype: "int8"
            symmetric: true
            method: "minmax"
        include: ["*"]
        exclude: ["*model.layers.*.mlp.down_proj*"] # 回退所有mlp.down_proj层
    • Conclusions and Suggestions
      1. Rollback priority: According to experience, the mlp.down_proj layer is usually one of the most quantization-sensitive layers. It is advised to roll back this layer first.
      2. Tradeoff: The rollback partially reduces the performance improvement and memory saving benefits brought by quantization. You need to determine the number of layers to be rolled back and the rollback range based on the specific service objectives.
      3. Rollback policy: You are advised to use the top-down policy to gradually roll back the layers with the highest sensitivity to achieve the optimal balance between model accuracy and computing performance.