Activation Quantization Balance Preprocessing

In scenarios where activations are unevenly distributed, the result of per-tensor quantization on activations has a large error due to outliers, while per-channel quantization result has a small error. The current hardware does not support per-channel quantization for activations and only supports per-channel quantization for weights. To reduce the quantization error, this section introduces a special method based on AMCT.

Use the quantize_preprocess API to calculate the balance factor, perform a mathematical equivalent conversion between the model activations and weights to balance their distribution, and then migrate some of the quantization difficulties from activations to weights.

Table 1 Supported layers and their restrictions

Supported Layer Type

Restriction

Remarks

Conv

Quantization is supported only for 4D or 5D input.

Layers sharing the weight do not support activation quantization balance preprocessing.

Gemm

transpose_a = false, Alpha = Beta = 1.0

MatMul

Quantization is supported only when the weight has rank 2.

ConvTranspose

Quantization is supported only with 4D inputs.

API Call Sequence

Figure 1 shows the API call sequence for balance preprocessing.

Figure 1 API call sequence for balance preprocessing
The user implements the operations in blue, while those in gray are implemented by using AMCT APIs.
  1. Prepare the original ONNX model, set the DMQ parameters in the simplified configuration file dmp_quant.cfg (for details, see), and import the configuration file.
  2. Call the 8.2.6-quantize_preprocess API to optimize the original ONNX model before quantization based on the ONNX model and quantization configuration file. Balanced quantization migrates the quantization difficulties from activations to weights.
  3. Calibrate the model by running a single forward pass on the calibration dataset in the ONNX Runtime environment to complete balanced quantization and save the balance factor to a record file.
  4. Based on the ONNX model, quantization configuration file and the record file, perform the optimization of the source ONNX model (including Conv+BN fusion) by using the quantize_model API. Next, insert weight and activation quantization operators for weight quantization and calibration.
  5. Calibrate the model by running forward passes on the calibration dataset in the ONNX Runtime environment to complete activation quantization and save quantization factors to a file.
  6. Call the save_model API to save the quantized model, including the fake-quantized model file in the ONNX Runtime environment or deployable model file on Ascend AI Processor.

Examples

  • Take the following steps to get started. Update the sample code based on your situation.
  • Tweak the arguments passed to AMCT API calls as required.
  1. Import the AMCT package and set the log level (see Set the environment variable. for details).
    1
    import amct_onnx as amct
    
  2. (Optional) Validate the inference script and environment setup in the source ONNX Runtime environment. Update the sample code based on your situation.

    You are advised to use the source model to be quantized and related test dataset for running inference in the ONNX Runtime environment.

    This step is recommended as it guarantees a properly functioning source model for inference with acceptable accuracy. You can use a subset from the test dataset to improve the efficiency.

    1
    user_do_inference(ori_model, test_data, test_iterations)
    
  3. Run AMCT to quantize the model.
    1. Generate a quantization configuration file.
      You can set the DMQ parameters in the simplified configuration file dmp_quant.cfg and import the configuration file through the config_defination parameter.
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      config_defination = os.path.join(PATH, 'dmp_quant.cfg')
      config_file = './tmp/config.json'
      skip_layers = []
      batch_num = 1
      amct.create_quant_config(config_file=config_file,
      			        model=ori_model,
                                      input_data=ori_model_input_data,
      			        skip_layers=skip_layers,
      			        batch_num=batch_num,
                                      config_defination=config_defination)
      
    2. Modify the graph. Insert the quantize_preprocess operator into the graph to calculate the balanced quantization factor.
      1
      2
      3
      4
      5
      6
      record_file = './tmp/record.txt'
      modified_model = './tmp/modified_model.onnx'
      amct.quantize_preprocess(config_file=config_file,
                                     record_file=record_file,
                                     model_file=ori_model,
                                     modified_onnx_file=modified_model)
      
    3. Run inference on the modified graph based on the calibration dataset to determine the balanced quantization factors. Update the sample code based on your situation.

      Pay attention to the following points:

      1. Ensure that the calibration dataset and the preprocessed data match the model to preserve the accuracy.
      2. Ensure that the number of forward passes is 1. If the number exceeds 1, the subsequent process will fail as the balanced quantization factor is recorded each time inference is executed.
      1
      user_do_inference(modified_onnx_file, calibration_data, test_iterations=1)
      
    4. Modify the graph to insert activation and weight quantization operators for quantization parameter calculation.
      1
      2
      3
      4
      5
      modified_model = './tmp/modified_model.onnx'
      amct.quantize_model(config_file=config_file,
                                model_file=ori_model,
                                modified_onnx_file=modified_model,
                                record_file=record_file)
      
    5. Run inference on the modified graph based on the calibration dataset to determine the quantization factors. Update the sample code based on your situation.

      Pay attention to the following points:

      1. Ensure that the calibration dataset and the preprocessed data match the model to preserve the accuracy.
      2. Ensure that the number of forward passes (specified by batch_num) is large enough.

      Refer to Why Am I Seeing the Message "IFMR node. Name:'layer_ifmr_op' Status Message: std::bad_alloc "? or Why Am I Seeing the "killed" Message During Calibration? for troubleshooting.

      1
      user_do_inference(modified_onnx_file, calibration_data, batch_num)
      
    6. Save the model.
      Call the save_model API to insert operators such as AscendQuant and AscendDequant into the modified model and save the resultant model based on the quantization factors.
      1
      2
      3
      4
      quant_model_path = './results/user_model'
      amct.save_model(modified_onnx_file=modified_model,
                            record_file=record_file,
                            save_path=quant_model_path)
      
  4. (Optional) Run inference on the quantized model (quant_model) in the ONNX Runtime environment based on the test dataset (test_data) to test the accuracy. Update the sample code based on your situation.
    Compare the accuracy of the fake-quantized model with that of the source model (see 2).
    1
    2
    quant_model = './results/user_model_fake_quant_model.onnx'
    user_do_inference(quant_model, test_data, test_iterations)