Uniform Quantization
Uniform quantization refers to a type of quantization in which the quantization levels are equal. For example, INT8 quantization uses 8-bit (INT8) data to represent 32-bit (float32) data or 16-bit (float16) data, and converts a float32 or float16 convolution operation (multiply-add operation) into an INT8 convolution operation. This reduces the model size, speeding up computing. In uniform INT8 quantization, the quantized data is evenly distributed in the value range [–128, +127] of INT8.
If the model accuracy is not satisfactory after uniform quantization, perform Accuracy-based Automatic Quantization, Quantization Aware Training, or Manual Tuning. The supported layers and restrictions are as follows. For details about the quantization sample, see Sample List.
Supported Layer Type |
Restriction |
Remarks |
|---|---|---|
torch.nn.Linear |
- |
Layers sharing the weight and bias parameters do not support quantization. |
torch.nn.Conv2d |
|
|
torch.nn.Conv3d |
|
|
torch.nn.ConvTranspose2d |
|
|
torch.nn.AvgPool2d |
- |
- |
API Call Sequence
Figure 1 shows the API call sequence for uniform quantization.
- Build an original PyTorch model and then generate a quantization configuration file by using the create_quant_config API.
- Optimize the original PyTorch model using the quantize_model API based on the quantization configuration file. The optimized model contains quantization algorithms.
- Calibrate the model by running forward passes on the calibration dataset in the PyTorch environment to obtain the quantization factor and save it into a file.
- Call the save_model API to save the quantized model, including the fake-quantized model file in the ONNX Runtime environment or deployable model file on the Ascend AI Processor.
- A fake-quantized ONNX model file for accuracy simulation on ONNX Runtime with the file name containing the fake_quant keyword.The fake-quantized model is used to verify the accuracy of the quantized model and can run in the ONNX Runtime environment. During forward passes, the input activations and weights of the convolutional layer (and other layers) of the fake-quantized model are quantized and then dequantized to simulate quantization, which enables you to quickly verify the accuracy of the quantized model. As shown in the following figure which takes the INT8 mode as an example, data flows through the Quant, Conv2d, and Dequant layers in float32. The Quant layer quantizes activations and weights to INT8 and dequantizes back to float32. The calculation happened at the convolutional layer is performed in float32. This model is used only to verify the accuracy of the quantized model in the ONNX Runtime environment and cannot be converted into an .om model by ATC.Figure 2 Fake-quantized model

- A deployable ONNX model file with the file name containing the deploy keyword. The model can be deployed on the Ascend AI Processor after being converted by ATC.For example, in INT8 quantization, because the deployable model has converted the weights to the INT8 and INT32 types, inference computation cannot be performed in the ONNX Runtime environment. As shown in the following figure, the AscendQuant layer of the deployable model quantizes activations from float32 to INT8 as the input of the convolutional layer, which uses INT8 weights and outputs INT32 results. That is, in the deployable model, calculation at the convolutional layer is based on the INT8 and INT32 types. Then, the INT32 results are converted into float32 at the AscendDeQuant layer before they are transmitted to the next layer.Figure 3 Deployable model

- A fake-quantized ONNX model file for accuracy simulation on ONNX Runtime with the file name containing the fake_quant keyword.
Example
The PTQ workflow goes through the following steps:
- Prepare an already-trained model and necessary datasets.
- Validate the model accuracy and environment setup in the source PyTorch environment.
- Write a PTQ script based on AMCT API calls.
- Run the PTQ script.
- Test the accuracy of the fake-quantized model in the ONNX Runtime environment.
- Due to software restrictions (the input data cannot be of DT_INT8 type in the dynamic shape scenario), when ATC is used to convert the quantized deployable model, dynamic shape–related options must not be used, such as --dynamic_batch_size and --dynamic_image_size. Otherwise, the model conversion fails.
- When ATC is used to convert a deployable model quantized by AMCT, the high-precision feature cannot be used. For example, force_fp32 or must_keep_origin_dtype (fp32 input of the original graph) cannot be configured through --precision_mode, origin cannot be configured through --precision_mode_v2, and high_precision cannot be configured through --op_precision_mode. Setting quantization parameters in high-precision mode does not provide any performance benefits of quantization nor that of the high-precision mode.
- Take the following steps to get started. Update the sample code based on your situation.
- Tweak the arguments passed to AMCT API calls as required.
- Import the AMCT package and set the log level using the environment variable in "AMCT (PyTorch)" in Post-installation Actions.
1import amct_pytorch as amct
- (Optional) Validate the inference script and environment setup in the source PyTorch environment. (Update the sample code based on your situation.)
You are advised to run inference on the original model in the PyTorch environment based on the test dataset to validate the environment setup and inference script.
This step is recommended as it guarantees a properly functioning original model for inference with acceptable accuracy. You can use a subset from the test dataset to improve the efficiency.
1user_do_inference_torch(ori_model, test_data, test_iterations)
- Run AMCT to quantize the model.
- Generate a quantization configuration file.
1 2 3 4 5 6 7 8
config_file = './tmp/config.json' skip_layers = [] batch_num = 1 amct.create_quant_config(config_file=config_file, model=ori_model, input_data=ori_model_input_data, skip_layers=skip_layers, batch_num=batch_num)
- Modify the graph by inserting activation and weight quantization operators for quantization parameter calculation.
1 2 3 4 5 6 7
record_file = './tmp/record.txt' modified_onnx_model = './tmp/modified_model.onnx' calibration_model = amct.quantize_model(config_file=config_file, modified_onnx_model=modified_onnx_model, record_file=record_file, model=ori_model, input_data=ori_model_input_data)
- Run inference on the modified model (calibration_model) in the PyTorch environment based on the calibration dataset (calibration_data) to determine the quantization factors. (Update the sample code based on your situation.)
Pay attention to the following points:
- Ensure that the calibration dataset and the preprocessed data match the model to preserve the accuracy.
- Ensure that the number of forward passes (specified by batch_num) is large enough. If the number of forward inference times is not large enough, the quantization factor is not output to the record file. As a result, the record file fails to be read for verification.
If you get the error "[IFMR]: Do layer xxx data calibration failed!" during the calibration, rectify the fault by referring to Why Do I See "[IFMR]: Do layer xxx data calibration failed!" During Calibration?
1user_do_inference_torch(calibration_model, calibration_data, batch_num)
- Save the model.Call the save_model API to insert operators such as AscendQuant and AscendDequant and save the quantized models based on the quantization factors.
1 2 3 4
quant_model_path = './results/user_model' amct.save_model(modified_onnx_file=modified_onnx_file, record_file=record_file, save_path=quant_model_path)
- Generate a quantization configuration file.
- (Optional) Run inference on the fake-quantized model (quant_model) in the ONNX Runtime environment based on the test dataset (test_data) to test the accuracy. (Update the sample code based on your situation.)Check the accuracy drop (from quantization) of the fake-quantized model by comparing with that of the original model in 2.
1 2
quant_model = './results/user_model_fake_quant_model.onnx' user_do_inference_onnx(quant_model, test_data, test_iterations)
