Workflow
This section describes the API call sequence and example of compression combination.
API Call Sequence
The user implements the operations in blue, while those in gray are implemented by using AMCT APIs. Specifically, import the package to the source TensorFlow network inference code and call APIs where appropriate for compression.
- Construct a training graph and then call the create_compressed_retrain_model API to modify the graph before compression based on the simplified configuration file, that is, to insert filter-level sparsity (or 2:4 structured sparsity) and QAT operators into the graph.
- Train and checkpoint the model.
- Construct an inference graph and then call the create_compressed_retrain_model API to modify the graph before compression based on the quantization configuration file, that is, to insert filter-level sparsity (or 2:4 structured sparsity) and QAT operators into the graph.
- Restore the training parameters, infer the output node of the quantized model, write quantization factors to the record file, and freeze the inference graph into a .pb model.
- Call the save_compressed_retrain_model API to export the compressed model based on the sparsity and quantization factor record file.
Examples
- Take the following steps to get started. Update the sample code based on your situation.
- Tweak the arguments passed to AMCT API calls as required. QAT relies on the user training result. Ensure that a TensorFlow training script that yields satisfactory training accuracy is available.
- Import the AMCT package and set the log level.
1 2
import amct_tensorflow as amct amct.set_logging_level(print_level='info', save_level='info')
- (Optional) Build a graph, read the trained parameters, and run inference on the graph in the TensorFlow environment to validate the inference script and environment setup. (Update the sample code based on your situation.)
This step is recommended as it guarantees a properly functioning source model for inference with acceptable accuracy. You can use a subset from the test dataset to improve the efficiency.
1user_test_evaluate_model(evaluate_model, test_data)
- Build a training graph. (Update the sample code based on your situation.)
1train_graph = user_load_train_graph()
- Call AMCT to run training with the sparsity operator and quantization parameters.
- Modify the graph to insert sparsity and quantization operators into the graph.
Before compression, call the create_compressed_retrain_model API to modify the trained graph based on the simplified configuration file and the source model. Specifically, insert the filter-level sparsity (or 2:4 structured sparsity) and QAT operators to generate the graph for compression combination (corresponding to 1 in Figure 1).
1 2 3 4 5
record_file = './tmp/record.txt' retrain_ops = amct.create_compressed_retrain_model(graph=train_graph, config_defination=simple_cfg, outputs=user_model_outputs, record_file=record_file)
- Implement gradient descent optimization on the modified graph, train the graph on the training dataset, and calculate quantization factors. (Update the sample code based on your situation.)
- Implement gradient descent optimization on the modified graph. Perform this step after 4.a.Call the adaptive learning rate optimizer (RMSPropOptimizer) to create a backward gradient graph.
1 2 3
optimizer = tf.compat.v1.train.RMSPropOptimizer( ARGS.learning_rate, momentum=ARGS.momentum) train_op = optimizer.minimize(loss)
- Create a session to train the model, and save the trained parameters as a checkpoint file (corresponding to 2 and 3 in Figure 1).Note: Restore the model parameters from existing checkpoints and then train the model. The parameters saved during training should include quantization factors. Quantization factors are generated after the first batch_num training. If the number of training times is less than batch_num, the training fails.
1 2 3 4 5
with tf.Session() as sess: sess.run(tf.compat.v1.global_variables_initializer()) sess.run(outputs) # Save the trained parameters as a checkpoint file. saver_save.save(sess, retrain_ckpt, global_step=0)
- Implement gradient descent optimization on the modified graph. Perform this step after 4.a.
- Modify the graph to insert sparsity and quantization operators into the graph.
- Build an inference graph. (Update the sample code based on your situation.)
1test_graph = user_load_test_graph()
- Call AMCT to perform compression combination.
- Modify the inference graph.
Construct an inference graph (that is, set is_training to False for the BN), and then call the create_compressed_retrain_model API to modify the graph before compression combination based on the quantization configuration file, that is, to insert filter-level sparsity (or 2:4 structured sparsity) and QAT operators into the graph for subsequent model freezing and inference, and generate a sparsity and quantization factor record file (corresponding to 4 and 5 in Figure 1).
- Create a session to restore the training parameters, infer the output node (retrain_ops[-1]) of the quantized model, write the quantization factors to the record file, and freeze the inference graph into a .pb model (corresponding to 6 and 7 in Figure 1). Update the sample code based on your situation.Note: The inference and restoration parameters must be in the same session. The inference is performed based on the output tensor of retrain_ops[-1]. When the inference graph is fixed into a PB model, the trained parameters are included.
1 2 3 4 5 6 7 8 9 10 11 12 13
variables_to_restore = tf.compat.v1.global_variables() saver_restore = tf.compat.v1.train.Saver(variables_to_restore) with tf.Session() as sess: sess.run(tf.compat.v1.global_variables_initializer()) # Restore training parameters. saver_restore.restore(sess, retrain_ckpt) # Write the quantization factors to the record file. Note: If no quantization function is enabled, skip this step and go to the next step. sess.run(retrain_ops[-1]) # Save the model as a PB model. constant_graph = tf.compat.v1.graph_util.convert_variables_to_constants( sess, eval_graph.as_graph_def(), [output.name[:-2] for output in outputs]) with tf.io.gfile.GFile(frozen_quant_eval_pb, 'wb') as f: f.write(constant_graph.SerializeToString())
- Save the combined compression model.Based on the sparsity and quantization factor record file and the fixed model, delete the inserted sparsity operator, insert quantization operators such as AscendQuant and AscendDequant, and save the compressed model.
1 2 3 4 5
compressed_model_path = './result/user_model' amct.save_compressed_retrain_model(pb_model=trained_pb, outputs=user_model_outputs, record_file=record_file, save_path=compressed_model_path)
- Modify the inference graph.
- (Optional) Use the compressed model user_model_compressed.pb and test dataset to perform inference in the TensorFlow environment and test the accuracy of the compressed model. (Update the sample code based on your situation.)Compare the accuracy of the compressed model with that of the original model in to observe the impact of compression on the accuracy.
1 2
compressed_model = './results/user_model_compressed.pb' user_do_inference(compressed_model, test_data)
