Accuracy Tuning Example for Inference Quantization

This section uses the Qwen3-32B model as an example to describe how to perform quantization accuracy tuning in the W8A8 quantization scenario.

  • Initial status: When full static quantization (per-channel/per-tensor) is used together with the Smooth Quant outlier suppression algorithm, the quantized model produces garbled characters in the dialog, making it unusable.
  • Target status: Perform W8A8 quantization on the Qwen3-32B model to ensure that the accuracy loss of the quantized model is within a controllable range compared with the floating-point model.

Environment Preparation

Install msModelSlim. For details, see msModelSlim Installation Guide.

Tuning Procedure

  1. Confirming the accuracy issue

    Before starting the tuning, confirm that the issue persists after eliminating environmental interference.

    • Inference engine verification: The original accuracy of the floating-point model can be reproduced on the target inference engine.
    • Evaluation result check: The output of the quantized model is abnormal (garbled characters in the dialog), which is confirmed to be caused by quantization accuracy issues.
    • Fluctuation range determination: The current accuracy loss is abnormal with the AIME25 evaluation dataset.
  2. Adjusting the outlier suppression algorithm (primary step)

    The use of Smooth Quant algorithm in the initial configuration results in garbled characters in the dialog. Other outlier suppression algorithms are used in sequence based on the tuning policy.

    Table 1 Outlier suppression algorithm

    Outlier Suppression Algorithm

    AIME25 (Acc.)

    Quantization Time (s)

    Remarks

    Smooth Quant

    Garbled characters in the dialog

    326

    Significant accuracy drop with the initial configuration.

    Iterative Smooth (symmetric/alpha:0.5)

    53.33%

    324

    Higher accuracy, yet still insufficient.

    Iterative Smooth (asymmetric/alpha:0.5)

    63.33%

    305

    10% higher accuracy with the asymmetric solution, meeting the expectation.

    Iterative Smooth (symmetric/alpha:0.9)

    66.67%

    319

    Higher accuracy after alpha parameter adjustment.

    Flex Smooth Quant

    63.33%

    1,380

    Equivalent accuracy, longer quantization time, compared with Iterative Smooth (asymmetric/alpha:0.5).

    Tuning result: Considering accuracy and quantization time, the Iterative Smooth (symmetric/alpha:0.9) algorithm is selected. The analysis is as follows:

    1. Accuracy comparison and analysis
      • Iterative Smooth (symmetric/alpha:0.9): 66.67% accuracy, the highest among all solutions.
      • Iterative Smooth (symmetric/alpha:0.9) vs. Iterative Smooth (symmetric/alpha:0.5): 66.67% vs. 53.33% accuracy, an increase of 13.34 percentage points.
      • Iterative Smooth (symmetric/alpha:0.9) vs. Iterative Smooth (asymmetric/alpha:0.5): 66.67% vs. 63.33% accuracy.
      • Iterative Smooth (symmetric/alpha:0.9) vs. Flex Smooth Quant: 66.67% vs. 63.33% accuracy, an increase of 3.34 percentage points.
    2. Quantization time comparison and analysis
      • Iterative Smooth (symmetric/alpha:0.9) vs. Iterative Smooth (symmetric/alpha:0.5): 319s vs. 324s quantization time.
      • Iterative Smooth (symmetric/alpha:0.9) vs. Flex Smooth Quant: 319s vs. 1,380s quantization time, a 76.9% reduction.

    Final decision: Based on the preceding analysis, Iterative Smooth (symmetric/alpha:0.9) is the best choice in the current scenario.

  3. Quantization algorithm selection

    After the outlier suppression algorithm is determined, optimize the quantization algorithm configuration.

    Table 2 Quantization method comparison

    Weight Quantization Method

    Activation Quantization Granularity

    AIME25 (Acc.)

    Quantization Time (s)

    Remarks

    minmax

    per-tensor (static quantization)

    66.67%

    319

    Basic configuration (based on the result in step 2).

    minmax

    per-token (dynamic quantization)

    80.00%

    289

    The accuracy improved by 13.33 percentage points after activating per-token.

    ssz

    per-tensor (static quantization)

    63.33%

    408

    The accuracy of the model using ssz and weight quantization decreases.

    ssz

    per-token (dynamic quantization)

    70.00%

    348

    The ssz + per-token method delivered lower accuracy than the minmax + per-token method.

    Tuning result: Considering accuracy and quantization time, the minmax + per-token (dynamic quantization) configuration is selected. The analysis is as follows:

    1. Accuracy comparison and analysis
      • minmax + per-token (dynamic quantization) vs. minmax + per-tensor (static quantization): 80.00% vs. 66.67% accuracy, an increase of 13.33 percentage points.
      • ssz + per-tensor (static quantization) vs. minmax + per-tensor (static quantization): 63.33% vs. 66.67% accuracy, a reduction of 3.34 percentage points. This indicates that the ssz method is inferior to the minmax method in the INT8 quantization scenario.
      • ssz + per-token (dynamic quantization) vs. minmax + per-token (dynamic quantization): 70.00% vs. 80.00% accuracy, a reduction of 10 percentage points. This indicates that the minmax method is better than the ssz method in the INT8 dynamic quantization scenario.
    2. Quantization time comparison and analysis
      • minmax + per-token vs. minmax + per-tensor: 289s vs. 319s quantization time, 9.4% faster (30s shorter).
      • ssz + per-token vs. minmax + per-token: 348s vs. 289s quantization time, 20.4% slower (59s longer).
    3. Overall comparison and analysis
      • Accuracy: minmax + per-token (80.00%) is better than ssz + per-token (70.00%) and all static quantization solutions.
      • Quantization time: minmax + per-token (289s, shortest), ssz + per-token (348s, 59s longer, +17.0% increase), minmax + per-tensor (319s, 30s longer).
      • Complexity: The minmax method is easy to implement and computationally efficient. The computing of the ssz method is more complex due to iterative search. Therefore, the minmax method is preferred for INT8 quantization.

    Final decision: Based on the preceding analysis, minmax + per-token (dynamic quantization) achieves the optimal balance between accuracy and quantization time. minmax + per-token is the best choice in the current scenario. It not only achieves the highest accuracy (80.00%, 10% higher than ssz + per-token) but also delivers the shortest quantization time (289s, 59s shorter than ssz + per-token). In addition, it is easier to implement. The accuracy is 13.33 percentage points higher than the configuration in step 2 (66.67%), laying a foundation for subsequent tuning.

  4. Calibration set adjustment

    The accuracy reaches 80.00% in step 3, meeting the preset accuracy requirement. To demonstrate the complete tuning process and verify the effect of calibration set adjustment, this section performs testing and verification on the GPQA dataset. With more questions, the GPQA dataset clearly shows accuracy differences between different configurations. Iterative Smooth and static quantization are used as the baseline configuration. The quality of the calibration set directly affects the accuracy of quantization parameters.

    Table 3 Calibration set adjustment policy

    Adjustment Policy

    Procedure

    Calibration Set Change

    Objective

    Initializing calibration dataset

    10 random samples.

    10 pieces

    Establish the baseline configuration.

    Increasing data volume

    Increase samples from 10 to 30.

    10 → 30 pieces

    Improve the accuracy of quantization parameters.

    Matching application scenarios

    Replace random data with Chinese dialog data.

    30 pieces (Chinese dialogs)

    Make the calibration data closer to the actual application scenario.

    Balancing data distribution

    Extract samples from multiple datasets such as GPQA, C-Eval, and MMLU.

    30 pieces (multiple datasets)

    Improve the diversity and balance of data distribution.

    Removing abnormal data

    Remove three abnormal samples that cause a decrease in quantization accuracy.

    30 → 27 pieces

    Reduce the interference of abnormal samples on quantization parameters.

    Adding bad cases

    Add five bad case samples of the floating-point model on GPQA.

    27 → 32 pieces

    Help the quantized model learn hard samples to improve accuracy.

    Tuning process: Add bad case samples in the AISBench evaluation result to the quantization calibration set to regenerate the quantization weight. The procedure is as follows:

    1. Obtaining bad case samples
      Extract a small number of bad-case samples from the AISBench evaluation result. For example, a bad-case sample is as follows:
      What is the correct answer to this question: Two quantum states with energies E1 and E2 have a lifetime of 10^-9 sec and 10^-8 sec, respectively. We want to clearly distinguish these two energy levels. Which one of the following options could be their energy difference so that they can be clearly resolved?
      
      Choices:
      (A)10^-11 eV
      (B)10^-8 eV
      (C)10^-9 eV
      (D)10^-4 eV
      Format your response as follows: "The correct answer is (insert answer here)"
    2. Converting formats
      • JSONL format: Refer to msmodelslim/lab_calib/mix_calib.jsonl. For details, see Link. Place the text after the inputs_pretokenized field. The format is as follows:
        {
           "inputs_pretokenized":
             "What is the correct answer to this question: Two quantum states with energies E1 and E2 have a lifetime of 10^-9 sec and 10^-8 sec, respectively. We want to clearly distinguish these two energy levels. Which one of the following options could be their energy difference so that they can be clearly resolved?\n\nChoices:\n(A)10^-11 eV\n(B)10^-8 eV\n\n(C)10^-9 eV\n(D)10^-4 eV\nFormat your response as follows: \"The correct answer is (insert answer here)\""
        }
      • JSON format: Refer to msmodelslim/lab_calib/qwen3_cot_w4a4.json. For details, see Link. Directly add the text to the character string list.
    3. Performing requantization

      Use the adjusted calibration dataset for quantization to regenerate the quantization weight.

    Table 4 Tuning result

    Quantization Policy

    GPQA (Acc.)

    Remarks

    Iterative Smooth + static quantization

    46.97%

    Baseline configuration.

    Iterative Smooth + static quantization + adding bad cases for calibration set adjustment

    55.56%

    The accuracy is 8.59 percentage points higher than the baseline configuration, indicating that the quantized model can learn difficult samples with bad-case samples to improve quantization accuracy.

  5. Quantization rollback (alternative solution)

    Quantization rollback refers to retaining the original floating-point precision of quantization-sensitive layers to improve quantized model accuracy. If the accuracy still cannot meet the requirements after steps 1 to 4 are performed, you can use the quantization rollback policy for further tuning. This section verifies the quantization rollback effect on the GPQA dataset and demonstrates the complete tuning process.

    Application scenarios

    Quantization rollback applies to the following scenarios:

    • The accuracy still cannot meet the requirements after steps 1 to 4 are performed.
    • A more refined balance between accuracy and performance is required.
    • Some layers are extremely sensitive to quantization and need to remain high precision.

    Tuning process

    1. Analyzing sensitivity layers

      Use the sensitive layer analysis tool provided by msModelSlim to identify quantization-sensitive layers. For details, see Quantization-Sensitive Layer Analysis Tool User Guide.

      Run the analysis commands:
      msmodelslim analyze \
          --model_type Qwen3-32B \
          --model_path ${model_path}

      The layers with the highest quantization sensitivity are as follows, sorted by sensitivity score in descending order.

      layers.3.mlp.down_proj
      layers.63.mlp.down_proj
      layers.2.mlp.down_proj
      layers.1.mlp.down_proj
      layers.4.mlp.down_proj
      layers.6.mlp.down_proj
      layers.7.mlp.down_proj
      layers.5.mlp.down_proj
      layers.0.mlp.down_proj
      layers.31.mlp.down_proj
      layers.62.mlp.down_proj
      layers.5.mlp.gate_proj
      layers.5.mlp.up_proj
      layers.32.mlp.down_proj
      layers.8.mlp.gate_proj
      layers.8.mlp.up_proj
      layers.6.mlp.gate_proj
      layers.6.mlp.up_proj

      Analysis result: The mlp.down_proj layer ranks high in sensitivity and is difficult to quantize. Therefore, it should be preferentially rolled back.

    2. Modifying the quantization configuration
      In the quantization configuration in YAML format, use the exclude field to roll back the top 9 most sensitive layers (all mlp.down_proj layers).
      apiversion: modelslim_v1
      spec:
        process:
          - type: "iter_smooth"
            alpha: 0.9
            scale_min: 1e-5
            symmetric: True
            enable_subgraph_type:
              - 'norm-linear'
              - 'linear-linear'
              - 'ov'
              - 'up-down'
            include:
              - "*"
          - type: "linear_quant"
            qconfig:
              act:
                scope: "per_tensor"
                dtype: "int8"
                symmetric: false
                method: "minmax"
              weight:
                scope: "per_channel"
                dtype: "int8"
                symmetric: true
                method: "minmax"
            include: 
              - "*"
            exclude:
              - 'model.layers.3.mlp.down_proj'
              - 'model.layers.63.mlp.down_proj'
              - 'model.layers.2.mlp.down_proj'
              - 'model.layers.1.mlp.down_proj'
              - 'model.layers.4.mlp.down_proj'
              - 'model.layers.6.mlp.down_proj'
              - 'model.layers.7.mlp.down_proj'
              - 'model.layers.5.mlp.down_proj'
              - 'model.layers.0.mlp.down_proj'
        save:
          - type: "ascendv1_saver"
            part_file_size: 4
    3. Regenerating the quantization weight

      Perform quantization again based on the modified configuration to generate a quantized model that contains the layers to be rolled back.

      Table 5 Tuning result

      Quantization Policy

      GPQA (Acc.)

      Remarks

      Iterative Smooth + static quantization

      46.97%

      Baseline configuration.

      Iterative Smooth + static quantization + rolling back top 9 most sensitive layers

      51.51%

      The accuracy is 4.54 percentage points higher than the baseline configuration, indicating that rolling back quantization-sensitive layers can effectively improve quantization accuracy, but may increase performance overhead and model size.

Configuration Summary

  • Tuning steps
    Table 6 Tuning procedure

    Step

    Key Operation

    AIME25 (Acc.)

    Accuracy Improvement

    Remarks

    Initial state

    Smooth Quant + minmax + static quantization

    Garbled characters

    -

    The model is unusable with the initial configuration.

    Step 2

    Iterative Smooth (symmetric/alpha:0.9)

    66.67%

    +66.67%

    The outlier suppression algorithm resolves the garbled character issue.

    Step 3

    minmax + per-token (dynamic quantization)

    80.00%

    +13.33%

    Activation quantization granularity is optimized to meet the accuracy requirements.

    The accuracy reaches 80.00% in step 3, meeting the preset accuracy requirement. Steps 4 and 5 are performed on the GPQA dataset to verify the tuning effect of calibration set adjustment and quantization rollback.

  • Final configuration

    Algorithm configuration: Iterative Smooth (symmetric/alpha: 0.9), the outlier suppression algorithm.

    Quantization configuration:

    • Weight quantization: minmax method, per_channel granularity, int8 data type, and symmetric quantization.
    • Activation quantization: minmax method, per_token granularity (dynamic quantization), int8 data type, and symmetric quantization.
  • Tuning summary
    1. Outlier suppression algorithms are critical to accuracy improvement.

      Switching the algorithm from Smooth Quant to Iterative Smooth (symmetric/alpha: 0.9) improves the model's accuracy to 66.67%, resolving the garbled-character issue and making the quantized model basically usable.

    2. The selected activation quantization policy directly affects model performance.

      Switching the quantization granularity from per-tensor (static quantization) to per-token (dynamic quantization) improves the model's accuracy from 66.67% to 80.00%, an increase of 13.33 percentage points (about 20% relative improvement). However, dynamic quantization may cause inference performance loss.

    3. The selected quantization algorithm has a significant impact on accuracy and efficiency.

      In the INT8 quantization scenario, the minmax method outperforms the ssz method in terms of accuracy (80.00% vs 70.00%), quantization time (289s vs 348s), and implementation simplicity. Therefore, the minmax method is recommended in this scenario.

    4. The quality of calibration datasets has an important impact on quantized model accuracy.

      Adding bad-case samples to the calibration dataset increases the model's accuracy on the GPQA dataset from 46.97% to 55.56% (an increase of 8.59 percentage points). This indicates that using data that matches the target scenario, especially hard samples, can effectively improve quantized model accuracy.

    5. Quantization rollback can be used as a supplementary method to maintain accuracy.

      In the GPQA dataset testing, the model's accuracy improved from 46.97% to 51.51% (an increase of 4.54 percentage points) by rolling back nine quantization-sensitive layers. Although this policy can improve model accuracy to some extent, it increases model size and inference overhead. Therefore, you are advised to use this policy as a last resort.