Activation Quantization Balance Preprocessing

In scenarios where activations are unevenly distributed, the result of per-tensor quantization on activations has a large error due to outliers, while per-channel quantization result has a small error. The current hardware does not support per-channel quantization for activations and only supports per-channel quantization for weights. To reduce the quantization error, this section introduces a special method based on AMCT.

Use the quantize_preprocess API to calculate the balance factor, perform a mathematical equivalent conversion between the model activations and weights to balance their distribution, and then migrate some of the quantization difficulties from activations to weights. The layers that support this feature are listed as follows.

Table 1 Supported layers and their restrictions

Supported Layer Type

Restriction

Remarks

MatMul

transpose_a = False, transpose_b = False, adjoint_a = False, adjoint_b = False

The weights are of type const and do not have dynamic inputs (such as placeholders).

Conv2D

-

Conv3D

dilation_d = 1, dilation_h/dilation_w ≥ 1

DepthwiseConv2dNative

dilation = 1

Conv2DBackpropInput

dilation = 1

API Call Sequence

Figure 1 shows the API call sequence for balance preprocessing.

Figure 1 API call sequence for balance preprocessing
The user implements the operations in blue, while those in gray are implemented by using AMCT APIs. Specifically, import the package to the source TensorFlow network inference code and call APIs where appropriate for quantization. The following details how to use this tool.
  1. Construct a TensorFlow model, set the DMQ parameters in the simplified configuration file dmp_quant.cfg (for details, see), and import the configuration file.
  2. Based on the TensorFlow model and quantization configuration file, call quantize_preprocess to insert operators for balanced quantization to the original TensorFlow model to calculate parameters for balanced quantization.
  3. Run inference with the modified model (output in 2) on the test and calibration datasets provided by AMCT in the TensorFlow environment to obtain the balance factor.

    The test dataset is used to test the accuracy of the quantized model in the TensorFlow environment, while the calibration dataset is used to generate a balance factor to ensure accuracy.

  4. Using the quantize_model API, optimize the source TensorFlow model based on the quantization configuration file by including activation and weight quantization operators for quantization parameter calculation.
  5. Run inference with the modified model on the test and calibration datasets provided by AMCT in the TensorFlow environment to obtain the quantization factors.
  6. Using the save_model API, insert operators including AscendQuant and AscendDequant and save the quantized model that is either suitable for accuracy simulation in the TensorFlow environment or deployable on Ascend AI Processor.

Examples

This section details the activation quantization balance preprocessing template code line by line, helping you understand the AMCT workflow. You can adapt the template code to other network models with just a few tweaks.

  • Take the following steps to get started. Update the sample code based on your situation.
  • Tweak the arguments passed to AMCT API calls as required.
  1. Import the AMCT package and set the log level.
    1
    2
    import amct_tensorflow as amct
    amct.set_logging_level(print_level='info', save_level='info')
    
  2. (Optional) Validate the inference script and environment setup in the source TensorFlow environment. Update the sample code based on your situation.

    You are advised to use the source model to be quantized and related test dataset for running inference in the TensorFlow environment.

    This step is recommended as it guarantees a properly functioning source model for inference with acceptable accuracy. You can use a subset from the test dataset to improve the efficiency.

    1
    user_do_inference(ori_model, test_data, test_iterations) 
    
  3. Prepare a tf.Graph based on the user_model.pb model file. (Update the sample code based on your situation.)
    1
    ori_graph = user_load_graph()
    
  4. Run AMCT to quantize the model.
    1. Generate a quantization configuration file.
      You can set the DMQ parameters in the simplified configuration file dmp_quant.cfg and import the configuration file through the config_defination parameter.
      1
      2
      3
      4
      5
      6
      7
      8
      9
      config_defination = os.path.join(PATH, 'dmp_quant.cfg')
      config_file = './tmp/config.json'
      skip_layers = []
      batch_num = 1
      amct.create_quant_config(config_file=config_file,
      			 graph=ori_graph,
      			 skip_layers=skip_layers,
      			 batch_num=batch_num,
                               config_defination=config_defination)
      
    2. Modify the graph. Insert the quantize_preprocess operator into the graph to calculate parameters for balanced quantization.
      1
      2
      3
      4
      record_file = './tmp/record.txt'
      amct.quantize_preprocess(graph=ori_graph,
                          config_file=config_file,
                          record_file=record_file)
      

      Call AMCT's quantize_model API to modify the source TensorFlow model. This API call inserts a searchN layer to the model, which means that the output node of the model will be changed. For details, see What Do I Do If My TensorFlow Network Output Node Is Changed by AMCT? If an error message is displayed indicating empty tensor input during quantization, rectify the fault by referring to When PTQ is used for quantization, an error message is displayed when an empty tensor is input during quantization.

    3. Run inference on the modified graph based on the calibration dataset to calculate the balanced quantization factor. Update the sample code based on your situation.

      Pay attention to the following points:

      1. Ensure that the calibration dataset and the preprocessed data match the model to preserve the accuracy.
      2. Ensure that the number of forward passes (specified by batch_num) is 1.
      1
      user_do_inference(ori_graph, calibration_data, batch_num)
      

      If the message "Invalid argument: You must feed a value for placeholder tensor **" is displayed, fix the error by referring to "Invalid argument: You must feed a value for placeholder tensor **" Is Displayed During Calibration

    4. Reload and modify the source model, and insert operators related to activation and weight quantization to the graph to calculate quantization parameters.
      1
      2
      3
      4
      ori_graph = user_load_graph()
      amct.quantize_model(graph=ori_graph,
                          config_file=config_file,
                          record_file=record_file)
      

      Call AMCT's quantize_model API to modify the source TensorFlow model. This API call inserts a searchN layer to the model, which means that the output node of the model will be changed. For details, see What Do I Do If My TensorFlow Network Output Node Is Changed by AMCT? If an error message is displayed indicating empty tensor input during quantization, rectify the fault by referring to When PTQ is used for quantization, an error message is displayed when an empty tensor is input during quantization.

    5. Run inference on the modified graph based on the calibration dataset to determine the quantization factors. Update the sample code based on your situation.

      Pay attention to the following points:

      1. Ensure that the calibration dataset and the preprocessed data match the model to preserve the accuracy.
      2. Ensure that the number of forward passes (specified by batch_num) is large enough.
      1
      user_do_inference(ori_graph, calibration_data, batch_num)
      

      If the message "Invalid argument: You must feed a value for placeholder tensor **" is displayed, fix the error by referring to "Invalid argument: You must feed a value for placeholder tensor **" Is Displayed During Calibration

    6. Save the model.
      Call save_model to insert operators such as AscendQuant and AscendDequant and save the quantized models based on the quantization factors.
      1
      2
      3
      4
      5
      quant_model_path = './results/user_model'
      amct.save_model(pb_model='user_model.pb',
                      outputs=['user_model_outputs0', 'user_model_outputs1'],
                      record_file=record_file,
                      save_path=quant_model_path)
      

      If the message "RuntimeError: cannot find shift_bit of layer ** in record_file" is displayed, fix the error by referring to Why Is the Message "RuntimeError: record_file is empty, no layers to be quantized" During Model Saving?

  5. (Optional) Run inference on the quantized model user_model_quantized.pb in the TensorFlow environment based on the test dataset (test_data) to test the accuracy. (Update the sample code based on your situation.)
    Compare the accuracy of the quantized model with that of the source model (see 2).
    1
    2
    quant_model = './results/user_model_quantized.pb'
    user_do_inference(quant_model, test_data, test_iterations)