Workflow

This section describes the API call sequence and example of compression combination.

API Call Sequence

Figure 1 Process of calling the compression combination API

The user implements the operations in blue, while those in gray are implemented by using AMCT APIs.

Build a source PyTorch model and call the create_compressed_retrain_model API to modify the model. The modified model contains sparsity and quantization operators.
Train the modified model. If the training process is not interrupted, perform inference on the trained model. During the inference process, the quantization factors are written into the quantization factor record file. Then, call the save_compressed_retrain_model API to save a fake-quantized model for accuracy simulation and a deployable model. If the training process is interrupted, call the restore_compressed_retrain_model API again based on the saved .pth model parameters to output a sparse network with quantization operators for retraining with weights saved before the interruption. Then, run inference on the result model, and call the save_compressed_retrain_model API to save the quantized model.

Examples

Training is performed based on the PyTorch environment. Currently, only multi-device training in distribution mode (DistributedDataParallel) is supported. Multi-device training in DataParallel mode is not supported. If the DataParallel mode is used for training, an error is reported.
Tweak the arguments passed to AMCT API calls as required. Compression combination relies on the user training result. Ensure that a PyTorch training script that yields satisfactory training accuracy is available.
When the QAT feature of the AMCT is used, if the training process is suspended, check whether other ONNX Runtime programs are running on the current server (by running the top command). If yes, suspend other ONNX Runtime programs, perform QAT again.
Take the following steps to get started. Update the sample code based on your situation.

Import the AMCT package and set the log level (see Post-installation Actions for details).
1
import amct_pytorch as amct
(Optional) Run inference on the source model in the PyTorch environment based on the test dataset to validate the inference script and environment setup. (Update the sample code based on your situation.)
This step is recommended as it guarantees a properly functioning source model for inference with acceptable accuracy. You can use a subset from the test dataset to improve the efficiency.
1 2 3
ori_model.load() # Test the model. user_test_model(ori_model, test_data, test_iterations)

Call AMCT to perform compression combination.

Modify the model. Specifically, sparsify ori_model, insert quantization operators into it, load model weights, and save the new model as retrain_model.

Before performing this step, restore the already trained parameters, for example, ori_model.load() in 2.

simple_cfg = './compressed.cfg'
record_file = './tmp/record.txt'
compressed_retrain_model = amct.create_compressed_retrain_model(
                                model=ori_model,
                                input_data=ori_model_input_data,
                                config_defination=simple_cfg,
                                record_file=record_file)

Implement gradient descent optimization on the modified graph, train the graph on the training dataset, and calculate quantization factors. (Update the sample code based on your situation.)
1. Implement gradient descent optimization on the modified graph.
  Perform this step after 3.a.
  1
  optimizer = user_create_optimizer(compressed_retrain_model)
2. Restore the model from existing checkpoints and train the model.
  Note: Restore the model parameters from existing checkpoints and then train the model. The parameters saved during training should include quantization factors. Quantization factors are generated after the first batch_num training. If the number of training times is less than batch_num, the training fails.
  1 2
  compressed_pth = './ckpt/user_model' user_train_model(optimizer, compressed_retrain_model, train_data)
3. After the training is complete, run inference to calculate and save the quantization factors.
  1
  user_infer_graph(compressed_retrain_model)

Save the model.

save_path = '/.result/user_model'
amct.save_compressed_retrain_model(
     model=compressed_retrain_model,
     record_file=record_file,
     save_path=save_path,
     input_data=ori_model_input_data)

(Optional) Run inference on the distilled model (quant_model) in the ONNX Runtime environment based on the test dataset (test_data) to analyze the accuracy. Update the sample code based on your situation. Compare the accuracy of the simulation model after combined compression with that of the original model in to observe the impact of combined compression on the accuracy.
1 2
compressed_model = './results/user_model_fake_quant_model.onnx' user_do_inference_onnx(compressed_model, test_data, test_iterations)

If the training process is interrupted, restore data from the checkpoints to resume the training.

Import the AMCT package and set the log level (see Post-installation Actions for details).
1
import amct_pytorch as amct
Prepare a source model.
1
ori_model = user_create_model()

Call AMCT to perform compression combination.

Modify the model. Specifically, sparsify ori_model, insert quantization operators into it, load model weights, and save the new model as retrain_model.

simple_cfg = './compressed.cfg'
record_file = './tmp/record.txt'
compressed_pth_file = './ckpt/user_model_newest.ckpt'
compressed_retrain_model = amct.restore_compressed_retrain_model(
                                model=ori_model,
                                input_data=ori_model_input_data,
                                config_defination=simple_cfg,
                                record_file=record_file,
                                pth_file=compressed_pth_file)

Implement gradient descent optimization on the modified graph, train the graph on the training dataset, and calculate quantization factors. (Update the sample code based on your situation.)
1. Implement gradient descent optimization on the modified graph.
  Perform this step after 3.a.
  1
  optimizer = user_create_optimizer(compressed_retrain_model)
2. Restore the model from existing checkpoints and train the model.
  The quantization factors are saved to the checkpoints.
  1 2
  compressed_pth = './ckpt/user_model' user_train_model(optimizer, compressed_retrain_model, train_data)
3. After the training is complete, run inference to calculate and save the quantization factors.
  1
  user_infer_graph(compressed_retrain_model)

Save the model.

save_path = '/.result/user_model'
amct.save_compressed_retrain_model(
     model=compressed_retrain_model,
     record_file=record_file,
     save_path=save_path,
     input_data=ori_model_input_data)

(Optional) Run inference on the distilled model (quant_model) in the ONNX Runtime environment based on the test dataset (test_data) to analyze the accuracy. Update the sample code based on your situation. Compare the accuracy of the simulation model after combined compression with that of the original model in to observe the impact of combined compression on the accuracy.
1 2
compressed_model = './results/user_model_fake_quant_model.onnx' user_do_inference_onnx(compressed_model, test_data, test_iterations)

Parent topic: Compression combination