Quantization Process

This section describes the supported quantization layers of QAT, as well as the API call sequence and example.

Currently, QAT supports quantization only for FP32 network models. For details about the quantization example, see Sample List. The layers that support quantization and the restrictions are as follows:

Table 1 Layers that support QAT and restrictions

Supported Layer Type

Restriction

MatMul

transpose_a = False, transpose_b = False, adjoint_a = False, adjoint_b = False

Conv2D

Given hardware restrictions, do not perform QAT when the number of input channels (Cin) in the source model is less than or equal to 16, as this may hurt quantized model's inference accuracy.

DepthwiseConv2dNative

Given hardware restrictions, do not perform QAT when the number of input channels (Cin) in the source model is less than or equal to 16, as this may hurt quantized model's inference accuracy.

Conv2DBackpropInput

  • dilation = 1
  • Given hardware restrictions, do not perform QAT when the number of input channels (Cin) in the source model is less than or equal to 16, as this may hurt quantized model's inference accuracy.

AvgPool

-

API Call Sequence

Figure 1 shows the API call sequence for QAT.

Figure 1 API Call Sequence

The user implements the operations in blue, while those in gray are implemented by using AMCT APIs. Specifically, import the package to the source TensorFlow network inference code and call APIs where appropriate for quantization.

The main steps are as follows:

  1. Construct a training graph and then call the create_quant_retrain_config API to generate a quantization configuration file.
  2. Call the create_quant_retrain_model API to modify the training graph before quantization based on the quantization configuration file, including inserting activation and weight quantization operators.
  3. Train the model and save the parameters as a checkpoint file.
  4. Call the create_quant_retrain_model API to modify the inference graph, including inserting the activation and weight quantization operators.
  5. Restore the training parameters, infer the output node of the quantized model, write quantization factors to the record file, and freeze the inference graph into a .pb model.
  6. Call save_quant_retrain_model to insert quantization operators such as AscendQuant and AscendDequant and save the quantized model.

Examples

  1. Take the following steps to get started. Update the sample code based on your situation.
  2. Tweak the arguments passed to AMCT API calls as required. QAT relies on the user training result. Ensure that a TensorFlow training script that yields satisfactory training accuracy is available.
  1. Import the AMCT package and set the log level.
    1
    2
    import amct_tensorflow as amct
    amct.set_logging_level(print_level='info', save_level='info')
    
  2. (Optional) Build a graph, read the trained parameters, and run inference on the graph in the TensorFlow environment to validate the inference script and environment setup. (Update the sample code based on your situation.)

    This step is recommended as it guarantees a properly functioning source model for inference with acceptable accuracy. You can use a subset from the test dataset to improve the efficiency.

    1
    user_test_evaluate_model(evaluate_model, test_data)
    
  3. Build a training graph. (Update the sample code based on your situation.)
    1
    train_graph = user_load_train_graph()
    
  4. Call AMCT to start training with quantization parameters.
    1. Generate a quantization configuration file.
      Based on the training graph (that is, set is_training to True for BN), call the create_quant_retrain_config API to generate a quantization configuration file (corresponding to 1 in Figure 1).
      1
      2
      3
      4
      5
      config_file = './tmp/config.json'
      simple_cfg = './retrain.cfg'
      amct.create_quant_retrain_config(config_file=config_file,
                                       graph=train_graph,
                                       config_defination=simple_cfg)
      
    2. Modify the training graph.
      Call the create_quant_retrain_model API to modify the training graph before quantization based on the quantization configuration file, that is, insert activation and weight quantization operators in the graph to calculate quantization parameters (corresponding to 2 in Figure 1).
      1
      2
      3
      4
      record_file = './tmp/record.txt'
      retrain_ops = amct.create_quant_retrain_model(graph=train_graph,
      					      config_file=config_file,
      					      record_file=record_file)
      
    3. Implement gradient descent optimization on the modified graph, train the graph on the training dataset, and calculate quantization factors. (Update the sample code based on your situation.)
      1. Use the modified graph to call the adaptive learning rate optimizer (RMSPropOptimizer) to create a backward gradient graph. Perform this step after 4.b.
        1
        2
        3
            optimizer = tf.compat.v1.train.RMSPropOptimizer(
                ARGS.learning_rate, momentum=ARGS.momentum)
            train_op = optimizer.minimize(loss)
        
      1. Create a session to train the model, and save the trained parameters as a checkpoint file (corresponding to 3 and 4 in Figure 1).
        Note: Restore the model parameters from existing checkpoints and then train the model. The parameters saved during training should include quantization factors. Quantization factors are generated after the first batch_num training. If the number of training times is less than batch_num, the training fails.
        1
        2
        3
        4
        5
        with tf.Session() as sess:
             sess.run(tf.compat.v1.global_variables_initializer())
             sess.run(outputs)
             # Save the trained parameters as a checkpoint file.
             saver_save.save(sess, retrain_ckpt, global_step=0)
        
  5. Build an inference graph. (Update the sample code based on your situation.)
    1
    test_graph = user_load_test_graph()
    
  6. Call AMCT to run QAT.
    1. Modify the inference graph.

      Call the create_quant_retrain_model API to modify the inference graph (with is_training to False for BN) before quantization based on the quantization configuration file, that is, insert activation and weight quantization operators (corresponding to 5 in Figure 1) to the graph.

      1
      2
      3
      4
      record_file = './tmp/record.txt'
      retrain_ops = amct.create_quant_retrain_model(graph=train_graph,
      					      config_file=config_file,
      					      record_file=record_file)
      
    1. Create a session to restore the training parameters, infer the output node (retrain_ops[-1]) of the quantized model, write the quantization factors to the record file, and freeze the inference graph into a .pb model (corresponding to 6 and 7 in Figure 1). Update the sample code based on your situation.
      Note: The inference and restoration parameters must be in the same session. The inference is performed based on the output tensor of retrain_ops[-1]. When the inference graph is fixed into a PB model, the trained parameters are included.
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      variables_to_restore = tf.compat.v1.global_variables()
      saver_restore = tf.compat.v1.train.Saver(variables_to_restore)
      with tf.Session() as sess:
           sess.run(tf.compat.v1.global_variables_initializer())
           # Restore training parameters.
           saver_restore.restore(sess, retrain_ckpt)
      # Infer the quantization output node (retrain_ops[-1]) and write the quantization factors to the record file.
           sess.run(retrain_ops[-1])
           # Save the model as a PB model.
           constant_graph = tf.compat.v1.graph_util.convert_variables_to_constants(
                sess, eval_graph.as_graph_def(), [output.name[:-2] for output in outputs])
           with tf.io.gfile.GFile(frozen_quant_eval_pb, 'wb') as f:
                f.write(constant_graph.SerializeToString())
      
    2. Save the quantized model.
      Call save_quant_retrain_model to insert operators including AscendQuant and AscendDequant into the .pb model based on the quantization factors and save the quantized model (corresponding to 8 in Figure 1).
      1
      2
      3
      4
      5
      quant_model_path = './result/user_model'
      amct.save_quant_retrain_model(pb_model=trained_pb,
                                    outputs=user_model_outputs,
                                    record_file=record_file,
                                    save_path=quant_model_path)
      
  7. (Optional) Run inference on the fake-quantized model user_model_quantized.pb in the TensorFlow environment based on the test dataset to test the accuracy. (Update the sample code based on your situation.)
    Check the accuracy loss of the fake-quantized model by comparing with that of the source model (see 2).
    1
    2
    quant_model = './results/user_model_quantized.pb'
    user_do_inference(quant_model, test_data)