QAT Model Adaptation to CANN Format

If you have used the native ONNX operators QuanitzeLinear and DequantizeLinear to implement quantization (QAT model), prior to generating an offline model adapted to the Ascend AI Processor with ATC, you need to use the function provided in this section to convert the QAT model into the CANN format, and then use ATC to convert the CANN quantization model into an offline model adapted to the Ascend AI Processor.

Table 1 Supported scenarios and restrictions

Scenario

Supported Layer Type

Restriction

The QuanitzeLinear-DequantizeLinear graph structure quantizes the inputs and weights of the quantized operator at the same time. The input quantization operator is replaced with AscendQuant, the weight of the constant type is quantized offline based on the target data type, and the AscendDequant operator is inserted at the output side for dequantization.

  • Conv
  • Gemm
  • ConvTranspose
  • Single input MatMul
  • For details about layer restrictions, see Uniform Quantization. Only the input data type float32 is supported.
  • Conv and ConvTranspose: Only per-channel and per-tensor weight quantization is supported.
  • Gemm: Only per-tensor weight quantization is supported.
  • Single-input MatMul: Only per-tensor weight quantization is supported. The weights must be 2D constants.

The QuanitzeLinear-DequantizeLinear graph structure quantizes one or two inputs of the operator, and replaces the quantization operator with AscendQuant-AscendAntiquant on the operator input side.

Add

Add: Only per-tensor quantization is supported. The input data type is float32.

When the QuantizeLinear operator is a non-middle-layer output operator with a single output, the QuantizeLinear operator does not need to be paired with the DequantizeLinear operator during model adaptation. The QuantizeLinear operator is replaced with AscendQuant to quantize the model output (single output).

-

-

Note:

  • If the QuantizeLinear operator is not the output, only the QAT model that contains the QuantizeLinear and DequantizeLinear FakeQuant layers can be adapted, and per-channel quantization is supported only by weights. The QuantizeLinear and DequantizeLinear layers in pairs must have the same quantization factor.

Adaptation Principles

Figure 1 shows the adaptation principles. The user implements the operations in blue, while those in gray are implemented by using the convert_qat_model API in AMCT. Specifically, import the package to the source ONNX QAT network inference code and call APIs where appropriate for model adaptation. For the adaptation example, see Sample List.

Figure 1 QAT Model Adaptation to CANN Format

Example

This example details how to use AMCT to convert an ONNX QAT model to a CANN format.

  • Take the following steps to get started. Update the sample code based on your situation.
  • Tweak the arguments passed to AMCT API calls as required.
  1. Import the AMCT package and set the log level (see Setting Environment Variables for details).
    1
    import amct_onnx as amct
    
  2. (Optional) Validate the inference script and environment setup in the source ONNX Runtime environment. (Update the sample code based on your situation.)

    You are advised to run inference on the original model for quantization in the ONNX Runtime environment based on the test dataset to validate the environment setup and inference script.

    This step is recommended as it guarantees a properly functioning original model for inference with acceptable accuracy. You can use a subset from the test dataset to improve the efficiency.

    1
    user_do_inference(ori_model, test_data, test_iterations)
    
  3. Call the convert_qat_model API in AMCT to perform model adaptation.
    This API parses the model to be adapted into a graph, preprocesses and modifies the graph, inserts operators such as AscendQuant and AscendDequant into the modified graph, and then saves the quantized model.
    1
    2
    3
    model_file = "./pre_model/mobilenet_v2_qat.onnx"
    save_path="./results/model"
    amct.convert_qat_model(model_file, save_path)
    
  4. (Optional) Run inference on the fake-quantized model in the ONNX Runtime environment based on the test dataset to test the accuracy. (Update the sample code based on your situation.)

    Check the accuracy drop (from quantization) of the fake-quantized model by comparing with that of the original model in 2.

    1
    2
    quant_model = './results/user_model_fake_quant_model.onnx'
    user_do_inference(quant_model, test_data, test_iterations)
    

    If the inference accuracy of the deployable model and fake-quantized model is greatly different from that of the original model, the reason is that bias quantization is required for inference on the deployable model, while bias quantization and dequantization are required for inference on the fake-quantized model, which may be different from the bias in the original model. In this case, you are advised to perform bias quantization and dequantization on the original model before model adaptation. The following formula is used for quantization and dequantization:

    round(bias/(scale_d × scale_w)) × (scale_d × scale_w)