Uniform Quantization

Uniform quantization refers to a type of quantization in which the quantization levels are equal. For example, INT8 quantization uses 8-bit INT8 data to represent 32-bit FP32 data or 16-bit FP16 data, and converts an FP32 or FP16 convolution operation (multiply-add operation) into an INT8 convolution operation. This reduces the model size, speeding up computing. In uniform INT8 quantization, the quantized data is evenly distributed in the value range [–128, +127] of INT8.

If the model accuracy is not satisfactory after uniform quantization, perform Quantization Aware Training, Accuracy-based Automatic Quantization, or Manual Tuning.

The layers that support uniform quantization and the restrictions are as follows. For details about quantization examples, see Sample List.

**Table 1** Layers that support uniform quantization as well as their restrictions
Supported Layer Type	Restriction	Remarks
MatMul	transpose_a = False, transpose_b = False, adjoint_a = False, adjoint_b = False	-
BatchMatMul/BatchMatMulV2	adj_x=False, adj_y=False	When both inputs are tensors, only INT8 symmetric quantization is supported. When the second input source does not have dynamic inputs such as placeholder and has two dimensions, weight quantization is performed and the AscendWeightQuant operator is inserted. In other scenarios, two inputs are quantized and the AscendQuant operator is inserted.
Conv2D	-	The weights are of type const and do not have dynamic inputs (such as placeholders).
Conv3D	dilation_d = 1, dilation_h/dilation_w ≥ 1
DepthwiseConv2dNative	-
Conv2DBackpropInput	dilation = 1
AvgPool	-	-
MaxPool	Tensor quantization only	-
Add	Tensor quantization only	-

API Call Sequence

Figure 1 shows the API call sequence for NUQ.

Figure 1 API call sequence for uniform quantization

The user implements the operations in blue, while those in gray are implemented by using AMCT APIs. Specifically, import the package to the source TensorFlow network inference code and call APIs where appropriate for quantization. The following details how to use this tool.

Build a source TensorFlow model and then generate a quantization configuration file by using the create_quant_config call.
Using the quantize_model API, optimize the source TensorFlow model based on the quantization configuration file by including activation and weight quantization operators for quantization parameter calculation.
Run inference with the modified model on the test and calibration datasets provided by AMCT in the TensorFlow environment to obtain the quantization factors.
The test dataset is used to test the accuracy of the quantized model in the TensorFlow environment, while the calibration dataset is used to generate quantization factors to ensure accuracy.
Using the save_model API, insert operators including AscendQuant and AscendDequant and save the quantized model that is either suitable for accuracy simulation in the TensorFlow environment or deployable on Ascend AI Processor.

Examples

This section details the PTQ template code line by line, helping you understand the AMCT workflow. You can adapt the template code to other network models with just a few tweaks.

For the sample code, see the mobilenet_v2 sample. The PTQ workflow goes through the following steps:

Prepare an already-trained model and necessary datasets.
Validate the model accuracy and environment setup in the source TensorFlow environment.
Write a PTQ script based on AMCT API calls.
Run the PTQ script.
Test the accuracy of the fake-quantized model in the source Caffe environment.

The following details how to write a quantization script based on AMCT API calls.

Take the following steps to get started. Update the sample code based on your situation.
Tweak the arguments passed to AMCT API calls as required.

Import the AMCT package and set the log level.

import amct_tensorflow as amct
amct.set_logging_level(print_level='info', save_level='info')

(Optional) Validate the inference script and environment setup in the source TensorFlow environment. Update the sample code based on your situation.
You are advised to use the source model to be quantized and related test dataset for running inference in the TensorFlow environment.

This step is recommended as it guarantees a properly functioning source model for inference with acceptable accuracy. You can use a subset from the test dataset to improve the efficiency.
1
user_do_inference(ori_model, test_data, test_iterations)
Prepare a tf.Graph based on the user_model.pb model file. (Update the sample code based on your situation.)
1
ori_graph = user_load_graph()

Run AMCT to quantize the model.

Generate a quantization configuration file.

config_file = './tmp/config.json'
skip_layers = []
batch_num = 1
amct.create_quant_config(config_file=config_file,
			 graph=ori_graph,
			 skip_layers=skip_layers,
			 batch_num=batch_num)

Modify the graph to insert activation and weight quantization operators for quantization parameter calculation.
1 2 3 4
record_file = './tmp/record.txt' amct.quantize_model(graph=ori_graph, config_file=config_file, record_file=record_file)
Call AMCT's quantize_model API to modify the source TensorFlow model. This API call inserts a searchN layer to the model, which means that the output node of the model will be changed. For details, see What Do I Do If My TensorFlow Network Output Node Is Changed by AMCT? If an error message is displayed indicating empty tensor input during quantization, rectify the fault by referring to When PTQ is used for quantization, an error message is displayed when an empty tensor is input during quantization.
Run inference on the modified graph based on the calibration dataset to determine the quantization factors. Update the sample code based on your situation.
Pay attention to the following points:
1. Ensure that the calibration dataset and the preprocessed data match the model to preserve the accuracy.
2. The number of forward passes is specified by batch_num. If the number of forward passes is insufficient, the quantization factor is not output to the record file. As a result, the record file fails to be read for verification.
1
user_do_inference(ori_graph, calibration_data, batch_num)
If the message "Invalid argument: You must feed a value for placeholder tensor **" is displayed, fix the error by referring to "Invalid argument: You must feed a value for placeholder tensor **" Is Displayed During Calibration.

If the message "xxx calculate scale failed" is displayed, fix the error by referring to During IFMR Activation Quantization, When "inf or NaN Value" or "xxx Calculate Scale Failed" Exists, an Error Is Reported..

Save the model.

Call save_model to insert operators such as AscendQuant and AscendDequant and save the quantized models based on the quantization factors.

quant_model_path = './results/user_model'
amct.save_model(pb_model='user_model.pb',
                outputs=['user_model_outputs0', 'user_model_outputs1'],
                record_file=record_file,
                save_path=quant_model_path)

If the message "RuntimeError: cannot find shift_bit of layer ** in record_file" is displayed, fix the error by referring to Why Is the Message "RuntimeError: record_file is empty, no layers to be quantized" During Model Saving?

(Optional) Run inference on the quantized model user_model_quantized.pb in the TensorFlow environment based on the test dataset (test_data) to test the accuracy. (Update the sample code based on your situation.)
Compare the accuracy of the quantized model with that of the source model (see 2).
1 2
quant_model = './results/user_model_quantized.pb' user_do_inference(quant_model, test_data, test_iterations)

Parent topic: Manual Quantization