Manual Quantization

This section describes the supported quantization layers of PTQ, and API call sequence and example.

The layers that support PTQ as well as their restrictions are listed as follows. For details about the quantization workflow, see Sample List.

**Table 1** Layers that support PTQ as well as their restrictions
Supported Layer Type	Restriction	Remarks
MatMul	transpose_a = False, transpose_b = False, adjoint_a = False, adjoint_b = False	-
BatchMatMul/BatchMatMulV2	adjoint_a = False, adjoint_b = False	-
Conv2D	-	The weights are of type const and do not have dynamic inputs (such as placeholders).
DepthwiseConv2dNative	dilation = 1
Conv2DBackpropInput	dilation = 1
AvgPool	-	-

API Call Sequence

Figure 1 shows the API call sequence for PTQ.

Figure 1 API call sequence for PTQ

The user implements the operations in blue, while those in gray are implemented by using AMCT APIs. Specifically, import the package to the source TensorFlow network inference code and call APIs where appropriate for quantization. The tool can be used in the following two scenarios. This document mainly describes the scenario of online inference on the NPU. For details about how to use the TensorFlow CPU for inference, see Post-Training Quantization.

NPU-based online inference:
1. Build an original TensorFlow model and then generate a quantization configuration file by using the create_quant_config_ascend API.
2. Optimize the original TensorFlow model using the quantize_model_ascend API based on the quantization configuration file. The optimized model contains quantization algorithms. Run online inference on the test and calibration datasets provided by AMCT in the NPU environment to obtain the quantization factors.
  The test dataset is used to test the accuracy of the quantized model in the NPU environment, while the calibration dataset generates quantization factors to ensure accuracy.
3. Call the save_model_ascend API to save the quantized model, which is deployable in the NPU environment.
Inference in TensorFlow (CPU version):
- Scenario 1:
  1. Build an original TensorFlow model and then generate a quantization configuration file by using the create_quant_config API.
  2. Optimize the original TensorFlow model using the quantize_model API based on the quantization configuration file. The optimized model contains quantization algorithms. Run inference on the test and calibration datasets provided by AMCT in the TensorFlow (CPU version) environment to obtain the quantization factors.
    The test dataset is used to test the accuracy of the quantized model in the TensorFlow environment, while the calibration dataset generates quantization factors to ensure accuracy.
  3. Call the save_model API to save the quantized model, which can be used for accuracy simulation in the TensorFlow (CPU version) environment.
- Scenario 2:
  If you have generated a quantized model using your own quantization factors and original TensorFlow model, instead of using the APIs in scenario 1, complete the quantization by using the convert_model API.

Example

The PTQ workflow goes through the following steps:

Prepare an already-trained model and necessary datasets.
Validate the model accuracy and environment setup in the source TensorFlow environment.
Write a PTQ script based on AMCT API calls.
Run the PTQ script.
Test the accuracy of the fake-quantized model in the source TensorFlow environment.

The following process shows how to write a script to call the AMCT API call to perform quantization on the model.

Take the following steps to get started. Update the sample code based on your situation.
Tweak the arguments passed to AMCT API calls as required.

Import the AMCT package and call the set_logging_level API to set the log level.

        
             import amct_tensorflow as amct
amct.set_logging_level(print_level="info", save_level="info")

(Optional) Run inference on the original TensorFlow model in the NPU environment based on the test dataset to validate the inference script and environment setup. (Update the sample code based on your situation.)

Pay attention to the following points:

Check if model inference can complete successfully on the NPU with satisfactory accuracy. If inference fails, the model will not complete the quantization process either. If the inference accuracy is not satisfactory, the accuracy of the quantized model will be unreliable.
You can use a subset from the test dataset to improve the efficiency.

        
             user_do_inference_on_npu(ori_model, test_data)

Prepare a tf.Graph based on the original model user_model.pb. (Update the sample code based on your situation.)

        
             ori_model = 'user_model.pb'
ori_graph = user_load_graph(ori_model)

Run AMCT to quantize the model.

Generate a quantization configuration file.

          
               config_file = './tmp/config.json'
skip_layers = []
amct.create_quant_config_ascend(config_file=config_file,
				graph=ori_graph,
				skip_layers=skip_layers)

Modify the graph by inserting quantization operators into the graph.

          
               record_file = './tmp/record.txt'
user_model_outputs = ['user_model_outputs0', 'user_model_outputs1']
calibration_graph, calibration_outputs = amct.quantize_model_ascend(
    graph=ori_graph,
    config_file=config_file,
    record_file=record_file,
    outputs=user_model_outputs)

Run online inference on the modified graph based on the calibration dataset to determine the quantization factors. (Update the sample code based on your situation.)

Pay attention to the following points:

Ensure that the calibration dataset and the preprocessed data match the model to preserve the accuracy.
The outputs of the resultant calibration_graph are calibration_outputs, which need to be completely executed at online inference.
Ensure that the number of forward passes (specified by batch_num) is large enough. If the number of forward inference times is not large enough, the quantization factor is not output to the record file. As a result, the record file fails to be read for verification.

          
               user_do_inference_on_npu(calibration_graph, calibration_outputs, calibration_data)

Save the model.

          
               quant_model_path = './results/user_model'
amct.save_model_ascend(pb_model=ori_model,
	               outputs=user_model_outputs,
	               record_file=record_file,
	               save_path=quant_model_path)

(Optional) Run inference on the quantized model user_model_quantized.pb in the TensorFlow (CPU version) environment based on the test dataset to test the accuracy. (Update the sample code based on your situation.)

Compare the accuracy of the fake-quantized model with that of the original model (see 2).

        
             quant_model = './results/user_model_quantized.pb'
user_do_inference_on_cpu(quant_model, test_data)

Parent topic: Quantization