Accuracy Tuning Example for Inference Quantization

This section uses the Qwen3-32B model as an example to describe how to perform quantization accuracy tuning in the W8A8 quantization scenario.

Initial status: When full static quantization (per-channel/per-tensor) is used together with the Smooth Quant outlier suppression algorithm, the quantized model produces garbled characters in the dialog, making it unusable.
Target status: Perform W8A8 quantization on the Qwen3-32B model to ensure that the accuracy loss of the quantized model is within a controllable range compared with the floating-point model.

Environment Preparation

Install msModelSlim. For details, see msModelSlim Installation Guide.

Tuning Procedure

Confirming the accuracy issue
Before starting the tuning, confirm that the issue persists after eliminating environmental interference.
- Inference engine verification: The original accuracy of the floating-point model can be reproduced on the target inference engine.
- Evaluation result check: The output of the quantized model is abnormal (garbled characters in the dialog), which is confirmed to be caused by quantization accuracy issues.
- Fluctuation range determination: The current accuracy loss is abnormal with the AIME25 evaluation dataset.

Adjusting the outlier suppression algorithm (primary step)

The use of Smooth Quant algorithm in the initial configuration results in garbled characters in the dialog. Other outlier suppression algorithms are used in sequence based on the tuning policy.

**Table 1** Outlier suppression algorithm
Outlier Suppression Algorithm	AIME25 (Acc.)	Quantization Time (s)	Remarks
Smooth Quant	Garbled characters in the dialog	326	Significant accuracy drop with the initial configuration.
Iterative Smooth (symmetric/alpha:0.5)	53.33%	324	Higher accuracy, yet still insufficient.
Iterative Smooth (asymmetric/alpha:0.5)	63.33%	305	10% higher accuracy with the asymmetric solution, meeting the expectation.
Iterative Smooth (symmetric/alpha:0.9)	66.67%	319	Higher accuracy after alpha parameter adjustment.
Flex Smooth Quant	63.33%	1,380	Equivalent accuracy, longer quantization time, compared with Iterative Smooth (asymmetric/alpha:0.5).

Tuning result: Considering accuracy and quantization time, the Iterative Smooth (symmetric/alpha:0.9) algorithm is selected. The analysis is as follows:

Accuracy comparison and analysis
- Iterative Smooth (symmetric/alpha:0.9): 66.67% accuracy, the highest among all solutions.
- Iterative Smooth (symmetric/alpha:0.9) vs. Iterative Smooth (symmetric/alpha:0.5): 66.67% vs. 53.33% accuracy, an increase of 13.34 percentage points.
- Iterative Smooth (symmetric/alpha:0.9) vs. Iterative Smooth (asymmetric/alpha:0.5): 66.67% vs. 63.33% accuracy.
- Iterative Smooth (symmetric/alpha:0.9) vs. Flex Smooth Quant: 66.67% vs. 63.33% accuracy, an increase of 3.34 percentage points.
Quantization time comparison and analysis
- Iterative Smooth (symmetric/alpha:0.9) vs. Iterative Smooth (symmetric/alpha:0.5): 319s vs. 324s quantization time.
- Iterative Smooth (symmetric/alpha:0.9) vs. Flex Smooth Quant: 319s vs. 1,380s quantization time, a 76.9% reduction.

Final decision: Based on the preceding analysis, Iterative Smooth (symmetric/alpha:0.9) is the best choice in the current scenario.

Quantization algorithm selection

After the outlier suppression algorithm is determined, optimize the quantization algorithm configuration.

**Table 2** Quantization method comparison
Weight Quantization Method	Activation Quantization Granularity	AIME25 (Acc.)	Quantization Time (s)	Remarks
minmax	per-tensor (static quantization)	66.67%	319	Basic configuration (based on the result in step 2).
minmax	per-token (dynamic quantization)	80.00%	289	The accuracy improved by 13.33 percentage points after activating per-token.
ssz	per-tensor (static quantization)	63.33%	408	The accuracy of the model using ssz and weight quantization decreases.
ssz	per-token (dynamic quantization)	70.00%	348	The ssz + per-token method delivered lower accuracy than the minmax + per-token method.

Tuning result: Considering accuracy and quantization time, the minmax + per-token (dynamic quantization) configuration is selected. The analysis is as follows:

Accuracy comparison and analysis
- minmax + per-token (dynamic quantization) vs. minmax + per-tensor (static quantization): 80.00% vs. 66.67% accuracy, an increase of 13.33 percentage points.
- ssz + per-tensor (static quantization) vs. minmax + per-tensor (static quantization): 63.33% vs. 66.67% accuracy, a reduction of 3.34 percentage points. This indicates that the ssz method is inferior to the minmax method in the INT8 quantization scenario.
- ssz + per-token (dynamic quantization) vs. minmax + per-token (dynamic quantization): 70.00% vs. 80.00% accuracy, a reduction of 10 percentage points. This indicates that the minmax method is better than the ssz method in the INT8 dynamic quantization scenario.
Quantization time comparison and analysis
- minmax + per-token vs. minmax + per-tensor: 289s vs. 319s quantization time, 9.4% faster (30s shorter).
- ssz + per-token vs. minmax + per-token: 348s vs. 289s quantization time, 20.4% slower (59s longer).
Overall comparison and analysis
- Accuracy: minmax + per-token (80.00%) is better than ssz + per-token (70.00%) and all static quantization solutions.
- Quantization time: minmax + per-token (289s, shortest), ssz + per-token (348s, 59s longer, +17.0% increase), minmax + per-tensor (319s, 30s longer).
- Complexity: The minmax method is easy to implement and computationally efficient. The computing of the ssz method is more complex due to iterative search. Therefore, the minmax method is preferred for INT8 quantization.

Final decision: Based on the preceding analysis, minmax + per-token (dynamic quantization) achieves the optimal balance between accuracy and quantization time. minmax + per-token is the best choice in the current scenario. It not only achieves the highest accuracy (80.00%, 10% higher than ssz + per-token) but also delivers the shortest quantization time (289s, 59s shorter than ssz + per-token). In addition, it is easier to implement. The accuracy is 13.33 percentage points higher than the configuration in step 2 (66.67%), laying a foundation for subsequent tuning.

Calibration set adjustment

The accuracy reaches 80.00% in step 3, meeting the preset accuracy requirement. To demonstrate the complete tuning process and verify the effect of calibration set adjustment, this section performs testing and verification on the GPQA dataset. With more questions, the GPQA dataset clearly shows accuracy differences between different configurations. Iterative Smooth and static quantization are used as the baseline configuration. The quality of the calibration set directly affects the accuracy of quantization parameters.

**Table 3** Calibration set adjustment policy
Adjustment Policy	Procedure	Calibration Set Change	Objective
Initializing calibration dataset	10 random samples.	10 pieces	Establish the baseline configuration.
Increasing data volume	Increase samples from 10 to 30.	10 → 30 pieces	Improve the accuracy of quantization parameters.
Matching application scenarios	Replace random data with Chinese dialog data.	30 pieces (Chinese dialogs)	Make the calibration data closer to the actual application scenario.
Balancing data distribution	Extract samples from multiple datasets such as GPQA, C-Eval, and MMLU.	30 pieces (multiple datasets)	Improve the diversity and balance of data distribution.
Removing abnormal data	Remove three abnormal samples that cause a decrease in quantization accuracy.	30 → 27 pieces	Reduce the interference of abnormal samples on quantization parameters.
Adding bad cases	Add five bad case samples of the floating-point model on GPQA.	27 → 32 pieces	Help the quantized model learn hard samples to improve accuracy.

Tuning process: Add bad case samples in the AISBench evaluation result to the quantization calibration set to regenerate the quantization weight. The procedure is as follows:

Obtaining bad case samples

Extract a small number of bad-case samples from the AISBench evaluation result. For example, a bad-case sample is as follows:

What is the correct answer to this question: Two quantum states with energies E1 and E2 have a lifetime of 10^-9 sec and 10^-8 sec, respectively. We want to clearly distinguish these two energy levels. Which one of the following options could be their energy difference so that they can be clearly resolved?

Choices:
(A)10^-11 eV
(B)10^-8 eV
(C)10^-9 eV
(D)10^-4 eV
Format your response as follows: "The correct answer is (insert answer here)"

Converting formats

JSONL format: Refer to msmodelslim/lab_calib/mix_calib.jsonl. For details, see Link. Place the text after the inputs_pretokenized field. The format is as follows:

{
   "inputs_pretokenized":
     "What is the correct answer to this question: Two quantum states with energies E1 and E2 have a lifetime of 10^-9 sec and 10^-8 sec, respectively. We want to clearly distinguish these two energy levels. Which one of the following options could be their energy difference so that they can be clearly resolved?\n\nChoices:\n(A)10^-11 eV\n(B)10^-8 eV\n\n(C)10^-9 eV\n(D)10^-4 eV\nFormat your response as follows: \"The correct answer is (insert answer here)\""
}

JSON format: Refer to msmodelslim/lab_calib/qwen3_cot_w4a4.json. For details, see Link. Directly add the text to the character string list.

Performing requantization
Use the adjusted calibration dataset for quantization to regenerate the quantization weight.

**Table 4** Tuning result
Quantization Policy	GPQA (Acc.)	Remarks
Iterative Smooth + static quantization	46.97%	Baseline configuration.
Iterative Smooth + static quantization + adding bad cases for calibration set adjustment	55.56%	The accuracy is 8.59 percentage points higher than the baseline configuration, indicating that the quantized model can learn difficult samples with bad-case samples to improve quantization accuracy.

Quantization rollback (alternative solution)

Quantization rollback refers to retaining the original floating-point precision of quantization-sensitive layers to improve quantized model accuracy. If the accuracy still cannot meet the requirements after steps 1 to 4 are performed, you can use the quantization rollback policy for further tuning. This section verifies the quantization rollback effect on the GPQA dataset and demonstrates the complete tuning process.

Application scenarios

Quantization rollback applies to the following scenarios:

The accuracy still cannot meet the requirements after steps 1 to 4 are performed.
A more refined balance between accuracy and performance is required.
Some layers are extremely sensitive to quantization and need to remain high precision.

Tuning process

Analyzing sensitivity layers

Use the sensitive layer analysis tool provided by msModelSlim to identify quantization-sensitive layers. For details, see Quantization-Sensitive Layer Analysis Tool User Guide.

Run the analysis commands:

msmodelslim analyze \
    --model_type Qwen3-32B \
    --model_path ${model_path}

The layers with the highest quantization sensitivity are as follows, sorted by sensitivity score in descending order.

layers.3.mlp.down_proj
layers.63.mlp.down_proj
layers.2.mlp.down_proj
layers.1.mlp.down_proj
layers.4.mlp.down_proj
layers.6.mlp.down_proj
layers.7.mlp.down_proj
layers.5.mlp.down_proj
layers.0.mlp.down_proj
layers.31.mlp.down_proj
layers.62.mlp.down_proj
layers.5.mlp.gate_proj
layers.5.mlp.up_proj
layers.32.mlp.down_proj
layers.8.mlp.gate_proj
layers.8.mlp.up_proj
layers.6.mlp.gate_proj
layers.6.mlp.up_proj

Analysis result: The mlp.down_proj layer ranks high in sensitivity and is difficult to quantize. Therefore, it should be preferentially rolled back.

Modifying the quantization configuration

In the quantization configuration in YAML format, use the exclude field to roll back the top 9 most sensitive layers (all mlp.down_proj layers).

apiversion: modelslim_v1
spec:
  process:
    - type: "iter_smooth"
      alpha: 0.9
      scale_min: 1e-5
      symmetric: True
      enable_subgraph_type:
        - 'norm-linear'
        - 'linear-linear'
        - 'ov'
        - 'up-down'
      include:
        - "*"
    - type: "linear_quant"
      qconfig:
        act:
          scope: "per_tensor"
          dtype: "int8"
          symmetric: false
          method: "minmax"
        weight:
          scope: "per_channel"
          dtype: "int8"
          symmetric: true
          method: "minmax"
      include: 
        - "*"
      exclude:
        - 'model.layers.3.mlp.down_proj'
        - 'model.layers.63.mlp.down_proj'
        - 'model.layers.2.mlp.down_proj'
        - 'model.layers.1.mlp.down_proj'
        - 'model.layers.4.mlp.down_proj'
        - 'model.layers.6.mlp.down_proj'
        - 'model.layers.7.mlp.down_proj'
        - 'model.layers.5.mlp.down_proj'
        - 'model.layers.0.mlp.down_proj'
  save:
    - type: "ascendv1_saver"
      part_file_size: 4

Regenerating the quantization weight

Perform quantization again based on the modified configuration to generate a quantized model that contains the layers to be rolled back.

**Table 5** Tuning result
Quantization Policy	GPQA (Acc.)	Remarks
Iterative Smooth + static quantization	46.97%	Baseline configuration.
Iterative Smooth + static quantization + rolling back top 9 most sensitive layers	51.51%	The accuracy is 4.54 percentage points higher than the baseline configuration, indicating that rolling back quantization-sensitive layers can effectively improve quantization accuracy, but may increase performance overhead and model size.

Configuration Summary

Tuning steps

**Table 6** Tuning procedure
Step	Key Operation	AIME25 (Acc.)	Accuracy Improvement	Remarks
Initial state	Smooth Quant + minmax + static quantization	Garbled characters	-	The model is unusable with the initial configuration.
Step 2	Iterative Smooth (symmetric/alpha:0.9)	66.67%	+66.67%	The outlier suppression algorithm resolves the garbled character issue.
Step 3	minmax + per-token (dynamic quantization)	80.00%	+13.33%	Activation quantization granularity is optimized to meet the accuracy requirements.

The accuracy reaches 80.00% in step 3, meeting the preset accuracy requirement. Steps 4 and 5 are performed on the GPQA dataset to verify the tuning effect of calibration set adjustment and quantization rollback.

Final configuration
Algorithm configuration: Iterative Smooth (symmetric/alpha: 0.9), the outlier suppression algorithm.

Quantization configuration:
- Weight quantization: minmax method, per_channel granularity, int8 data type, and symmetric quantization.
- Activation quantization: minmax method, per_token granularity (dynamic quantization), int8 data type, and symmetric quantization.
Tuning summary
1. Outlier suppression algorithms are critical to accuracy improvement.
  Switching the algorithm from Smooth Quant to Iterative Smooth (symmetric/alpha: 0.9) improves the model's accuracy to 66.67%, resolving the garbled-character issue and making the quantized model basically usable.
2. The selected activation quantization policy directly affects model performance.
  Switching the quantization granularity from per-tensor (static quantization) to per-token (dynamic quantization) improves the model's accuracy from 66.67% to 80.00%, an increase of 13.33 percentage points (about 20% relative improvement). However, dynamic quantization may cause inference performance loss.
3. The selected quantization algorithm has a significant impact on accuracy and efficiency.
  In the INT8 quantization scenario, the minmax method outperforms the ssz method in terms of accuracy (80.00% vs 70.00%), quantization time (289s vs 348s), and implementation simplicity. Therefore, the minmax method is recommended in this scenario.
4. The quality of calibration datasets has an important impact on quantized model accuracy.
  Adding bad-case samples to the calibration dataset increases the model's accuracy on the GPQA dataset from 46.97% to 55.56% (an increase of 8.59 percentage points). This indicates that using data that matches the target scenario, especially hard samples, can effectively improve quantized model accuracy.
5. Quantization rollback can be used as a supplementary method to maintain accuracy.
  In the GPQA dataset testing, the model's accuracy improved from 46.97% to 51.51% (an increase of 4.54 percentage points) by rolling back nine quantization-sensitive layers. Although this policy can improve model accuracy to some extent, it increases model size and inference overhead. Therefore, you are advised to use this policy as a last resort.