QAT Model Adaptation to CANN Format

An already-quantized source ONNX model is referred to as a QAT model. Prior to generating an offline model adapted to Ascend AI Processor with ATC, you need to use the function provided in this section to convert the QAT model into the CANN format, and then use the ATC tool to convert the CANN quantization model into an adapted offline model.

Note the following restrictions:

  • The source QAT model must have FakeQuant layers (including QuantizeLinear and DequantizeLinear). Channel-wise quantization takes effect on weights only. A QuantizeLinear-DequantizeLinear layer pair must have the same quantization factors.
  • Only the Conv, Gemm, MatMul, and ConvTranspose layers can match fake_quant nodes, which means only these layers are adaptable. For details about the layer restrictions, see Table Layers that support uniform quantization as well as their restrictions.

Adaptation Principles

Shows the adaptation principle. The user implements the operations in blue, while those in gray are implemented by using AMCT's convert_qat_model API. Specifically, import the package to the source ONNX QAT network inference code and call APIs where appropriate for model adaptation. For the adaptation example, see Sample List.

Figure 1 QAT model adaptation to Ascend format

Examples

This example details how to use AMCT to convert an ONNX QAT model to a CANN representation.

  • Take the following steps to get started. Update the sample code based on your situation.
  • Tweak the arguments passed to AMCT API calls as required.
  1. Import the AMCT package and set the log level (see Set the environment variable. for details).
    1
    import amct_onnx as amct
    
  2. (Optional) Validate the inference script and environment setup in the source ONNX Runtime environment. Update the sample code based on your situation.

    You are advised to use the source model to be quantized and related test dataset for running inference in the ONNX Runtime environment.

    This step is recommended as it guarantees a properly functioning source model for inference with acceptable accuracy. You can use a subset from the test dataset to improve the efficiency.

    1
    user_do_inference(ori_model, test_data, test_iterations)
    
  3. Call AMCT's convert_qat_model API.
    This API parses the model to be adapted into a graph, preprocesses the graph, modifies the parsed graph structure, inserts operators such as AscendQuant and AscendDequant, and saves the model as a quantized model.
    1
    2
    3
    model_file = "./pre_model/mobilenet_v2_qat.onnx"
    save_path="./results/model"
    amct.convert_qat_model(model_file, save_path)
    
  4. (Optional) Run inference on the fake-quantized model in the ONNX Runtime environment based on the test dataset to test the accuracy. (Update the sample code based on your situation.)

    Compare the accuracy of the fake-quantized model with that of the source model (see 2).

    1
    2
    quant_model = './results/user_model_fake_quant_model.onnx'
    user_do_inference(quant_model, test_data, test_iterations)
    

    If the inference accuracy of the deployable model is greatly different from that of the fake-quantized model after mode adaptation, the reason is that the bias needs to be quantized for board-based inference of the deployable model, while the bias needs to be quantized and dequantized for fake-quantized model inference, which may be different from the bias in the original model, as a result, the accuracy is different. In this scenario, you are advised to quantize and dequantize the bias of the original model before model adaptation. The quantization and dequantization formulas are as follows:

    round(bias/(scale_d*scale_w))*(scale_d*scale_w)