Uniform Quantization

Uniform quantization refers to a type of quantization in which the quantization levels are equal. For example, INT8 quantization uses 8-bit INT8 data to represent 32-bit FP32 data or 16-bit FP16 data, and converts an FP32 or FP16 convolution operation (multiply-add operation) into an INT8 convolution operation. This reduces the model size, speeding up computing. In uniform INT8 quantization, the quantized data is evenly distributed in the value range [–128, +127] of INT8.

The layers that support uniform quantization and the restrictions are as follows. For details about quantization examples, see Sample List.

**Table 1** Layers that support uniform quantization as well as their restrictions
Supported Layer Type	Restriction
Conv	4D or 5D inputs are supported, and the weight type is constant or initializer.
Gemm	transpose_a=false, Alpha=Beta=1.0. The weight type is constant or initializer.
MatMul	When the second input is a constant, only 2D input is supported. The weight type is constant or initializer. When both inputs are tensors, only INT8 symmetric quantization is supported.
ConvTranspose	Only 4D inputs are supported, and the weight type is constant or initializer.
AveragePool	Only quantization with 4D inputs is supported.
MaxPool	Only tensor quantization is supported.
Add	Only tensor quantization is supported.

API Call Sequence

Figure 1 shows the API call sequence for NUQ.

Figure 1 API call sequence for uniform quantization

The user implements the operations in blue, while those in gray are implemented by using AMCT APIs.

Build a source ONNX model and then generate a quantization configuration file by using the create_quant_config call.
Optimize the source ONNX model using the quantize_model API based on the quantization configuration file, including Conv+BN fusion. Next, insert weight and activation quantization operators for weight quantization and calibration.
Calibrate the model by running forward passes on the calibration dataset in the ONNX Runtime environment to complete activation quantization and save quantization factors to a file.
Call the save_model API to save the quantized model, including the fake-quantized model file in the ONNX Runtime environment or deployable model file on Ascend AI Processor.
- A fake-quantized model for accuracy simulation on ONNX Runtime with the file name containing the fake_quant keyword.
  The fake-quantized model is used to verify the accuracy of the quantized model and can run in the ONNX Runtime environment. During forward passes, the input activations and weights of the convolutional layer (and other layers) of the fake-quantized model are quantized and then dequantized to simulate quantization, which enables you to quickly verify the accuracy of the quantized model. As shown in the following figure which takes the INT8 mode as an example, data flows through the Quant, Conv, and Dequant layers in Float32. The Quant layer quantizes activations and weights to INT8 and dequantizes back to Float32. The calculation happened at the convolutional layer is performed in Float32. This model is used only to verify the accuracy of the quantized model in the ONNX Runtime environment and cannot be converted into an .om model by ATC.
  Figure 2 Fake-quantized model
- A deployable model with the file name containing the deploy keyword. The model can be deployed on Ascend AI Processor after being converted by the ATC tool.
  For example, in INT8 quantization, because the deployable model has converted the weights to the INT8 and INT32 types, inference computation cannot be performed in the ONNX Runtime environment. As shown in the following figure, the AscendQuant layer of the deployable model quantizes activations from Float32 to INT8 as the input of the convolutional layer, which uses INT8 weights and outputs INT32 results. That is, in the deployable model, calculation at the convolutional layer is based on the INT8 and INT32 types. Then, the INT32 results are converted into Float32 at the AscendDeQuant layer before they are input to the next layer.
  Figure 3 Deployable model

Examples

This section details the PTQ template code line by line, helping you understand the AMCT workflow. You can adapt the template code to other network models with just a few tweaks.

For the sample code, see the resnet-101 sample. The PTQ workflow goes through the following steps:

Prepare an already-trained model and necessary datasets.
Verify the model accuracy and environment setup in the source ONNX Runtime environment.
Write a PTQ script based on AMCT API calls.
Run the PTQ script.
Test the accuracy of the fake-quantized model in the ONNX Runtime environment.

The following details how to write a quantization script based on AMCT API calls.

Take the following steps to get started. Update the sample code based on your situation.
Tweak the arguments passed to AMCT API calls as required.

Import the AMCT package and set the log level (see Set the environment variable. for details).
1

import amct_onnx as amct

(Optional) Validate the inference script and environment setup in the source ONNX Runtime environment. Update the sample code based on your situation.

You are advised to use the source model to be quantized and related test dataset for running inference in the ONNX Runtime environment.

This step is recommended as it guarantees a properly functioning source model for inference with acceptable accuracy. You can use a subset from the test dataset to improve the efficiency.

        
             user_do_inference(ori_model, test_data, test_iterations)

Run AMCT to quantize the model.

Generate a quantization configuration file.

          
               config_file = './tmp/config.json'
skip_layers = []
batch_num = 1
amct.create_quant_config(config_file=config_file,
			 model_file=ori_model,
			 skip_layers=skip_layers,
			 batch_num=batch_num)

Modify the graph to insert activation and weight quantization operators for quantization parameter calculation.

          
               record_file = './tmp/record.txt'
modified_model = './tmp/modified_model.onnx'
amct.quantize_model(config_file=config_file,
                    model_file=ori_model,
                    modified_onnx_file=modified_model,
                    record_file=record_file)

Run inference on the modified graph based on the calibration dataset to determine the quantization factors. Update the sample code based on your situation.

Pay attention to the following points:

Ensure that the calibration dataset and the preprocessed data match the model to preserve the accuracy.
The number of forward passes is specified by batch_num. If the number of forward passes is insufficient, the quantization factor is not output to the record file. As a result, the record file fails to be read for verification.

Refer to Why Am I Seeing the Message "IFMR node. Name:'layer_ifmr_op' Status Message: std::bad_alloc "? or Why Am I Seeing the "killed" Message During Calibration? for troubleshooting.

If the message "IfmrQuantWithOffset scale is illegal" is displayed during the calibration, fix the error by referring to Why Do I See "[IfmrQuantWithoutOffset] scale is illegal" During Calibration?.

          
               user_do_inference(modified_onnx_file, calibration_data, batch_num)

Save the model.

Call the save_model API to insert operators such as AscendQuant and AscendDequant into the modified model and save the resultant model based on the quantization factors.

           
                quant_model_path = './results/user_model'
amct.save_model(modified_onnx_file=modified_model,
                record_file=record_file,
                save_path=quant_model_path)

(Optional) Run inference on the quantized model (quant_model) in the ONNX Runtime environment based on the test dataset (test_data) to test the accuracy. Update the sample code based on your situation.

Compare the accuracy of the fake-quantized model with that of the source model (see 2).

        
             quant_model = './results/user_model_fake_quant_model.onnx'
user_do_inference(quant_model, test_data, test_iterations)

If the inference speed of the fake-quantized model on the CPU is slower than that of the original model, set the following environment variable:

        
             # Set the number of threads used during execution.
export OMP_NUM_THREADS=8

The value of the environment variable is related to the data volume of the model and the number of CPUs in the running environment. For example, if the data volume of the ResNet-101 model is 100,000, the number of threads can be set to 8.

Parent topic: Manual Quantization