Accuracy-based Automatic Quantization

Accuracy-based automatic quantization is a technique introduced to produce a quantized model that yields satisfactory accuracy by automatically searching for model quantization configurations and implementing PTQ.

Accuracy-based automatic quantization is similar to Manual Quantization. However, you do not need to manually tune the quantization configuration file, which reduces the optimization workload and makes quantization more efficient. For details about layers that can be quantized and their quantization restrictions, see Uniform Quantization. For the quantization example, see Additional Samples.

API Call Sequence

Figure 1 shows the API call sequence.

Figure 1 API call sequence for automatic quantization

The workflow goes through the following steps:

Generate a quantization configuration file by using the create_quant_config API, and then run accuracy-based automatic quantization by using the accuracy_based_auto_calibration API.
Pass an evaluator instance to the accuracy_based_auto_calibration API call to test the accuracy of the source model.
In this process, the quantization strategy module in accuracy_based_auto_calibration is called to output the initialized quantization configuration file. The file records all layers that support quantization.
Use the initial quantization configuration file (generated by the create_quant_config API call in 1) to run PTQ on the model, obtaining the accuracy of the fake-quantized model.
Compare the accuracy results of both models. If the accuracy drop of the fake-quantized model is below the predefined limit, output the quantized model. Otherwise, perform accuracy-based automatic quantization.
1. Run inference on the source TensorFlow model and dump the input activations of each layer.
2. Use the quantization factors obtained after calibration to build single-operator networks of quantization layers. Then, use the buffered activations to calculate the cosine similarity between the output data of each fake-quantized single-operator network and that of the source TensorFlow equivalent.
3. Pass the cosine similarity list to the quantization strategy module in accuracy_based_auto_calibration. After certain layers have been dequantized based on the initial quantization configuration file generated in 2, the quantization strategy module will output a new quantization configuration file.
4. Using the new quantization configuration file, run PTQ to obtain a new fake-quantized model.
5. Analyze the accuracy of the new fake-quantized model by a call to the evaluator module in accuracy_based_auto_calibration.
  - If the model accuracy is acceptable, output a fake-quantized model and a deployable model.
  - If the model accuracy is unacceptable, dequantize the layer with the lowest cosine similarity, and go back to 4.c to output a new quantization configuration.
  - If the model accuracy is still unsatisfactory after you have dequantized all layers, cancel quantizing the model. In this case, no quantized model is generated.

Figure 2 shows the principles of the accuracy_based_auto_calibration API.

Figure 2 Principles of the accuracy-based automatic quantization API

Examples

This example demonstrates how to use AMCT to perform accuracy-based automatic quantization. In the process, you need to define a callback function for obtaining the model inference accuracy. This user-defined callback function is important, as AMCT will filter the quantization layers based on the accuracy return.

Take the following steps to get started. Update the sample code based on your situation.
Tweak the arguments passed to AMCT API calls as required.

Import the AMCT package and set the log level.

import amct_tensorflow as amct
from amct_tensorflow.common.auto_calibration import AutoCalibrationEvaluatorBase
amct.set_logging_level(print_level="info", save_level="info")

Define the following callback functions based on the source model and test dataset: calibration(), evaluate(), and metric_eval(). (Update the sample code based on your situation.)

The arguments passed to these callback functions must be consistent with those passed to the AutoCalibrationEvaluatorBase base class. where

calibration() calibrates the model by running forward passes.
evaluate() evaluates the model accuracy.
metric_eval() evaluates the accuracy drop of the fake-quantized model by comparing the accuracy of the fake-quantized model and that of the source model. If the accuracy drop is below the predefined limit, True is returned; False, otherwise.

class ModelEvaluator(AutoCalibrationEvaluatorBase):
      # The evaluator for model
    def __init__(self, *args, **kwargs):
        # Initialize member variables.
        # Set the accuracy drop limit.
        self.diff = expected_acc_loss 
        pass

    def calibration(self, graph, outputs):
        # Calibrate the model by running batch_num forward passes.
        pass


    def evaluate(self, graph, outputs): # pylint: disable=R0914
        # evaluate the input models, get the eval metric of model
        pass

    def metric_eval(self, original_metric, new_metric):
        # Evaluate the accuracy drop of the fake-quantized model. If the accuracy drop is below the predefined limit, True is returned; False, otherwise.
        loss = original_metric - new_metric
        if loss < self.diff:
            return True, loss
        return False, loss

Prepare a tf.Graph based on the user_model.pb model file. (Update the sample code based on your situation.)
1
ori_graph = user_load_graph()

Call AMCT to run accuracy-based automatic quantization.

Generate a quantization configuration file.

config_file = './tmp/config.json'
skip_layers = []
batch_num = 1
amct.create_quant_config(config_file=config_file,
			 graph=ori_graph,
			 skip_layers=skip_layers,
			 batch_num=batch_num)

Initialize an evaluator.
1
evaluator = ModelEvaluator()

Start automatic search for the model quantization configuration that yields satisfactory accuracy.

amct.accuracy_based_auto_calibration(ori_pb_model, outputs, record_file, config_file, save_dir, evaluator)

Parent topic: Automatic Quantization