Uniform Quantization

The type of quantization in which the quantization levels are uniformly spaced is termed as uniform quantization. For example, INT8 quantization uses 8-bit INT8 data to represent 32-bit float32 data or 16-bit float16 data, and converts a float32 or float16 convolution operation (multiply-add operation) into an INT8 convolution operation. This reduces the model size, speeding up computing. In uniform INT8 quantization, the quantized data is evenly distributed in the value range [–128, +127] of INT8.

If the model accuracy is not satisfactory after uniform quantization, perform Accuracy-based Automatic Quantization, Quantization Aware Training, or Manual Tuning.

The layers that support uniform quantization as well as their restrictions are listed as follows. For details about the quantization sample, see Sample List.

**Table 1** Layers that support uniform quantization as well as their restrictions
Supported Layer Type	Restriction	Remarks
MatMul	transpose_a = False, transpose_b = False, adjoint_a = False, adjoint_b = False	-
BatchMatMul/BatchMatMulV2	adj_x=False, adj_y=False	When both inputs are tensors, only INT8 symmetric quantization is supported. In the quantization scenario where both inputs are tensors, benefits can be obtained only in the following Product models. For other Product models, the precision reduces after quantization. Atlas 200I/500 A2 inference product Atlas A2 training products/Atlas A2 inference products Atlas A3 training series products/Atlas A3 inference series products When the second input source does not have dynamic inputs such as placeholder and has two dimensions, weight quantization is performed and the AscendWeightQuant operator is inserted. In other scenarios, two inputs are quantized and the AscendQuant operator is inserted.
Conv2D	-	The weights are of type const and do not have dynamic inputs (such as placeholders). For the DepthwiseConv2dNative layer: If strides is greater than 1 and dilation is greater than 1, the shape of the CPU/GPU inference result is incorrect in TensorFlow 1.15 and 2.6.5. This is a known issue of TensorFlow and not caused by the AMCT. If only one of strides and dilation is greater than 1, the inference result is correct.
Conv3D	dilation_d = 1, dilation_h/dilation_w ≥ 1
DepthwiseConv2dNative	-
Conv2DBackpropInput	dilation = 1
AvgPool	-	-
MaxPool	Tensor quantization only	-
Add	Tensor quantization only	-

API Call Sequence

Figure 1 shows the API call sequence for uniform quantization.

Figure 1 API call sequence for uniform quantization

The user implements the operations in blue, while those in gray are implemented by using AMCT APIs. Specifically, import the package to the source TensorFlow network inference code and call APIs where appropriate for quantization. The following details how to use this tool.

Build an original TensorFlow model and then generate a quantization configuration file by using the create_quant_config API.
Based on the TensorFlow model and quantization configuration file, call the quantize_model API to optimize the original TensorFlow model and insert activation and weight quantization operators to the resultant model to calculate quantization parameters.
Run inference with the modified model output in 2 on the test and calibration datasets provided by AMCT in the TensorFlow environment to obtain the quantization factors.
The test dataset is used to test the accuracy of the quantized model in the TensorFlow environment, while the calibration dataset generates the quantization factors to ensure accuracy.
Using the save_model API, insert operators including AscendQuant and AscendDequant and save the quantized model that is either suitable for accuracy simulation in the TensorFlow environment or deployable on the Ascend AI Processor.

Example

This section details the PTQ template code line by line, helping you understand the AMCT workflow. You can adapt the template code to other network models with just a few tweaks. The PTQ workflow goes through the following steps:

Prepare an already-trained model and necessary datasets.
Validate the model accuracy and environment setup in the source TensorFlow environment.
Write a PTQ script based on AMCT API calls.
Run the PTQ script.
Test the accuracy of the fake-quantized model in the source TensorFlow environment.

The following process shows how to write a script to call the AMCT API call to perform quantization on the model.

Take the following steps to get started. Update the sample code based on your situation.
Tweak the arguments passed to AMCT API calls as required.

Import the AMCT package and set the log level.

import amct_tensorflow as amct
amct.set_logging_level(print_level='info', save_level='info')

(Optional) Validate the inference script and environment setup in the source TensorFlow environment. (Update the sample code based on your situation.)
You are advised to run inference on the original model for quantization in the TensorFlow environment based on the test dataset to validate the inference script and environment setup.

This step is recommended as it guarantees a properly functioning original model for inference with acceptable accuracy. You can use a subset from the test dataset to improve the efficiency.
1
user_do_inference(ori_model, test_data, test_iterations)
Prepare a tf.Graph based on the user_model.pb model file. (Update the sample code based on your situation.)
1
ori_graph = user_load_graph()

Run AMCT to quantize the model.

Generate a quantization configuration file.

config_file = './tmp/config.json'
skip_layers = []
batch_num = 1
amct.create_quant_config(config_file=config_file,
			 graph=ori_graph,
			 skip_layers=skip_layers,
			 batch_num=batch_num)

Modify the graph by inserting activation and weight quantization operators for quantization parameter calculation.
1 2 3 4
record_file = './tmp/record.txt' amct.quantize_model(graph=ori_graph, config_file=config_file, record_file=record_file)
Call the quantize_model API in AMCT to modify the original TensorFlow model. This API call inserts a searchN layer to the model, which means that the output node of the model will be changed. For details, see What Do I Do If My TensorFlow Network Output Node Is Changed by AMCT?. If an error message is displayed indicating empty tensor input during quantization, rectify the fault by referring to What Do I Do If an Error Message Is Displayed Indicating Empty Tensor Input During PTQ?
Run inference on the modified graph based on the calibration dataset to determine the quantization factors. (Update the sample code based on your situation.)
Pay attention to the following points:
1. Ensure that the calibration dataset and the preprocessed data match the model to preserve the accuracy.
2. Ensure that the number of forward passes (specified by batch_num) is large enough. If the number of forward inference times is not large enough, the quantization factor is not output to the record file. As a result, the record file fails to be read for verification.
1
user_do_inference(ori_graph, calibration_data, batch_num)
If the message "Invalid argument: You must feed a value for placeholder tensor **" is displayed, fix the error by referring to Why Is the Message "Invalid argument: You must feed a value for placeholder tensor **" Displayed During Calibration?

If the message "xxx calculate scale failed" is displayed, fix the error by referring to What If "inf or NaN" or "xxx calculate scale failed" Appears During IFMR Activation Quantization?

Save the model.

Call the save_model API to insert operators such as AscendQuant and AscendDequant and save the quantized models based on the quantization factors.

quant_model_path = './results/user_model'
amct.save_model(pb_model='user_model.pb',
                outputs=['user_model_outputs0', 'user_model_outputs1'],
                record_file=record_file,
                save_path=quant_model_path)

If the message "RuntimeError: cannot find shift_bit of layer ** in record_file" is displayed, fix the error by referring to Why Is the Message "RuntimeError: record_file is empty, no layers to be quantized" Displayed During Model Saving?

(Optional) Run inference on the quantized model user_model_quantized.pb in the TensorFlow environment based on the test dataset (test_data) to test the accuracy. (Update the sample code based on your situation.)
Compare the accuracy of the fake-quantized model with that of the original model (see 2).
1 2
quant_model = './results/user_model_quantized.pb' user_do_inference(quant_model, test_data, test_iterations)

Parent topic: Manual Quantization