Workflow

This section describes the supported quantization layers of QAT, and API call sequence and example.

Currently, QAT supports quantization only for float32 network models. The layers that support QAT are listed as follows. For the quantization sample, see Sample List.

Table 1 Layers that support QAT as well as their restrictions

Supported Layer Type

Restriction

Remarks

MatMul

transpose_a = False, transpose_b = False, adjoint_a = False, adjoint_b = False

-

Conv2D

Given hardware restrictions, do not perform QAT when the number of input channels (Cin) in the original model is less than or equal to 16, as this may hurt the quantized deployable model's inference accuracy.

DepthwiseConv2dNative

  • Given hardware restrictions, do not perform QAT when the number of input channels (Cin) in the original model is less than or equal to 16, as this may hurt the quantized deployable model's inference accuracy.
  • For the DepthwiseConv2dNative layer:

    If strides is greater than 1 and dilation is greater than 1, the shape of the CPU/GPU inference result is incorrect in TensorFlow 1.15 and 2.6.5. This is a known issue of TensorFlow and not caused by the AMCT.

    If only one of strides and dilation is greater than 1, the inference result is correct.

Conv2DBackpropInput

  • dilation = 1
  • Given hardware restrictions, do not perform QAT when the number of input channels (Cin) in the original model is less than or equal to 16, as this may hurt the quantized deployable model's inference accuracy.

AvgPool

-

Only INT8 quantization is supported.

API Call Sequence

Figure 1 shows the API call sequence of QAT. The training environment uses the CPU/GPU environment of the TensorFlow framework. Based on the inference script of the open-source framework, the AMCT API is called to compress the model. The compressed model needs to be converted into an offline model that adapts to the Ascend AI Processor using the ATC before it can be used for inference on the Ascend AI Processor.

Figure 1 API call sequence

The user implements the operations in blue, while those in gray are implemented by using AMCT APIs. Specifically, import the package to the source TensorFlow network inference code and call APIs where appropriate for quantization.

The workflow goes through the following steps:

  1. Construct a training graph and then call the create_quant_retrain_config API to generate a quantization configuration file.
  2. Call the create_quant_retrain_model API to modify the training graph before quantization based on the quantization configuration file, including inserting activation and weight quantization operators.
  3. Train the model and save the trained parameters as a checkpoint file.
  4. Call the create_quant_retrain_model API to modify the inference graph, including inserting the activation and weight quantization operators.
  5. Restore the training parameters, load the .ckpt file, infer the output node of the quantized model, write quantization factors to the record file, and freeze the inference graph into a .pb model.
  6. Call the save_quant_retrain_model API to insert quantization operators such as AscendQuant and AscendDequant and save the quantized model.

Example

  1. Take the following steps to get started. Update the sample code based on your situation.
  2. Tweak the arguments passed to AMCT API calls as required. QAT relies on the user training result. Ensure that a TensorFlow training script that yields satisfactory training accuracy is available.
  1. Import the AMCT package and set the log level.
    1
    2
    import amct_tensorflow as amct
    amct.set_logging_level(print_level='info', save_level='info')
    
  2. (Optional) Build a graph, read the trained parameters, and run inference on the graph in the TensorFlow environment to validate the inference script and environment setup. (Update the sample code based on your situation.)

    This step is recommended as it guarantees a properly functioning original model for inference with acceptable accuracy. You can use a subset from the test dataset to improve the efficiency.

    1
    user_test_evaluate_model(evaluate_model, test_data)
    
  3. Build a training graph. (Update the sample code based on your situation.)
    1
    train_graph = user_load_train_graph()
    
  4. Run AMCT to perform training with quantization parameters.
    1. Generate a quantization configuration file.
      Based on the training graph (that is, set is_training to True for BN), call the create_quant_retrain_config API to generate a quantization configuration file (corresponding to 1 in Figure 1).
      1
      2
      3
      4
      5
      config_file = './tmp/config.json'
      simple_cfg = './retrain.cfg'
      amct.create_quant_retrain_config(config_file=config_file,
                                       graph=train_graph,
                                       config_defination=simple_cfg)
      
    2. Modify the training graph.
      Call the create_quant_retrain_model API to modify the training graph before quantization based on the quantization configuration file, that is, insert activation and weight quantization operators in the graph to calculate quantization parameters (corresponding to 2 in Figure 1).
      1
      2
      3
      4
      record_file = './tmp/record.txt'
      retrain_ops = amct.create_quant_retrain_model(graph=train_graph,
      					      config_file=config_file,
      					      record_file=record_file)
      
    3. Implement gradient descent optimization on the modified graph, train the graph on the training dataset, and train quantization factors. (Update the sample code based on your situation.)
      1. Call RMSPropOptimizer to implement gradient descent optimization. Perform this step after 4.b.
        1
        2
        3
            optimizer = tf.compat.v1.train.RMSPropOptimizer(
                ARGS.learning_rate, momentum=ARGS.momentum)
            train_op = optimizer.minimize(loss)
        
      1. Create a session to train the model, and save the trained parameters as a checkpoint file (corresponding to 3 and 4 in Figure 1).
        Note: Restore the model parameters from existing checkpoints and then train the model. The parameters saved during training should include quantization factors. Quantization factors are generated after the first batch_num training. If the number of training times is less than batch_num, the training fails.
        1
        2
        3
        4
        5
        with tf.Session() as sess:
             sess.run(tf.compat.v1.global_variables_initializer())
             sess.run(outputs)
             # Save the trained parameters as a checkpoint file.
             saver_save.save(sess, retrain_ckpt, global_step=0)
        
  5. Build an inference graph. (Update the sample code based on your situation.)
    1
    test_graph = user_load_test_graph()
    
  6. Run AMCT to perform QAT.
    1. Modify the inference graph.

      Call the create_quant_retrain_model API to modify the inference graph (with is_training to False for BN) before quantization based on the quantization configuration file, that is, insert activation and weight quantization operators (corresponding to 5 in Figure 1) to the graph.

      1
      2
      3
      4
      record_file = './tmp/record.txt'
      retrain_ops = amct.create_quant_retrain_model(graph=train_graph,
      					      config_file=config_file,
      					      record_file=record_file)
      
    1. Create a session to restore the training parameters, infer the output node (retrain_ops[-1]) of the quantized model, write the quantization factors to the record file, and freeze the inference graph into a .pb model (corresponding to 6 and 7 in Figure 1). (Update the sample code based on your situation.)
      Note: The parameters to be inferred (the output tensor of retrain_ops[-1]) and to be restored must be in the same session. The .pb model generated from an inference graph contains the trained parameter values.
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      variables_to_restore = tf.compat.v1.global_variables()
      saver_restore = tf.compat.v1.train.Saver(variables_to_restore)
      with tf.Session() as sess:
           sess.run(tf.compat.v1.global_variables_initializer())
           # Restore training parameters.
           saver_restore.restore(sess, retrain_ckpt)
           # Infer the quantization output node (retrain_ops[-1]) and write quantization factors to the record file.
           sess.run(retrain_ops[-1])
           # Save the model as a .pb model.
           constant_graph = tf.compat.v1.graph_util.convert_variables_to_constants(
                sess, eval_graph.as_graph_def(), [output.name[:-2] for output in outputs])
           with tf.io.gfile.GFile(frozen_quant_eval_pb, 'wb') as f:
                f.write(constant_graph.SerializeToString())
      
    2. Save the quantized model.
      Call the save_quant_retrain_model API to insert operators such as AscendQuant and AscendDequant based on the quantization factors and the .pb model and save the quantized model (corresponding to 8 in Figure 1).
      1
      2
      3
      4
      5
      quant_model_path = './result/user_model'
      amct.save_quant_retrain_model(pb_model=trained_pb,
                                    outputs=user_model_outputs,
                                    record_file=record_file,
                                    save_path=quant_model_path)
      
  7. (Optional) Run inference on the quantized model user_model_quantized.pb in the TensorFlow environment based on the test dataset to test the accuracy. (Update the sample code based on your situation.)
    Compare the accuracy of the fake-quantized model with that of the original model (see 2).
    1
    2
    quant_model = './results/user_model_quantized.pb'
    user_do_inference(quant_model, test_data)