Quantization Process

This section describes the supported quantization layers of QAT, as well as the API call sequence and example.

For the quantization example, see Sample List. The layers that support QAT are listed as follows.

**Table 1** Layers that support QAT and restrictions
Supported Layer Type	Restriction
InnerProduct	transpose = false, axis = 1
Convolution	4 x 4 filter
Deconvolution	1-dilated 4 x 4 filter with group = 1
AvgPool	Global Pooling is not supported.

API Call Sequence

Figure 1 shows the API call sequence for QAT.

Figure 1 API call sequence for QAT

The user implements the operations in blue, while those in gray are implemented by using AMCT APIs. Specifically, import the package to the source Caffe network inference code and call APIs where appropriate for quantization.

Build an original Caffe model and then generate a quantization configuration file by using the create_quant_retrain_config API.
Using the create_quant_retrain_model API, optimize the source Caffe model by including activation and weight quantization operators for quantization parameter calculation. Run model retraining on the test and calibration datasets provided by AMCT in the Caffe environment to obtain the quantization factors.
Execute the Caffe training process, configure the solver, add the test process during the training, and set the number of test iterations to a value greater than the value of batch_num in the quantization configuration file.
Using the save_quant_retrain_model API, insert operators including AscendQuant and AscendDequant and save the quantized model (including its weight file) that is either suitable for accuracy simulation in the Caffe environment or deployable on Ascend AI Processor.

Example

Suggestions on the training process in the QAT scenario:
The configuration of the training process after quantization must be basically the same as that of the original training process. The main adjustments are as follows:
- Epoch number: 1/4–1/3 of the initial epoch number.
- Learning rate: 1/100 of the initial learning rate.
Take the following steps to get started. Update the sample code based on your situation.
Tweak the arguments passed to AMCT API calls as required. QAT relies on the user training result. Ensure that a TensorFlow training script that yields satisfactory training accuracy is available.

Import the AMCT package and set the log level (see Set environment variables: for details).
1
import amct_caffe as amct
Set the device running mode.
AMCT runs on the CPU or GPU. To run AMCT on the GPU, you first need to configure the Caffe run mode and target device before configuring AMCT's run mode. Since the target device has already been specified here, you do not need to configure the target device in the model inference function.
1 2 3 4 5 6
if 'gpu': caffe.set_mod_gpu() caffe.set_device(gpu_id) amct.set_gpu_mode() else: caffe.set_mode_cpu()
(Optional) Run inference on the source model in the Caffe environment based on the test dataset to validate the inference script and environment setup. (Update the sample code based on your situation.)
This step is recommended as it guarantees a properly functioning source model for inference with acceptable accuracy. You can use a subset from the test dataset to improve the efficiency.
1
user_test_model(ori_model_file, ori_weights_file, test_data, test_iterations)

Run AMCT to quantize the model.

Generate a quantization configuration file.

config_file = './tmp/config.json'
amct_caffe.create_quant_retrain_config(config_file=config_file, 
                                       model_file=ori_model_file,
                                       weights_file=ori_weights_file)

Modify the model, insert the fake-quantization layer, and save the model file as a new model file.

Modify the model based on the quantization configuration file and insert activation quantization and weight quantization operators to calculate quantization parameters.

modified_model_file = './tmp/modified_model.prototxt'
modified_weights_file = './tmp/modified_model.caffemodel'
scale_offset_record_file = './tmp/record.txt'
amct_caffe.create_quant_retrain_model(model_file=ori_model_file,
				      weights_file=ori_weights_file,
                                      config_file=config_file,
                                      modified_model_file=modified_model_file,
				      modified_weights_file=modified_weights_file,
                                      scale_offset_record_file=scale_offset_record_file)

Implement gradient descent optimization on the modified graph, train the graph on the training dataset, and calculate quantization factors. (Update the sample code based on your situation.)
1. Add a test phase (test_interval > 0, test_iter > 0) to solver.prototxt to enable the search for the shift factor N in the test phase and turn precheck off (test_initialization = false) to avoid triggering an unintentional search for the shift factor N.
  The solver.prototxt file must contain the following parameters:
```
test_iter: 1
test_interval: 4
base_lr: 9.999999747378752e-05
max_iter: 4
lr_policy: "step"
gamma: 0.10000000149011612
momentum: 0.8999999761581421
weight_decay: 9.999999747378752e-05
stepsize: 10
snapshot: 4
net: "$HOME/amct_path/sample/resnet50/tmp/modified_model.prototxt"
test_initialization: false
```
  - test_iter: repeated parameter, which specifies the number of iterations for each test. Because the shift factor N needs to be executed during this process, the number must be greater than or equal to the batch_num parameter of the shift factor N. Otherwise, the calculation of the shift factor N fails due to insufficient data. test_iter*batch_size indicates the number of images tested each time.
  - test_interval: interval between tests (in training iterations). Defaults to 0. You are advised to set this parameter to the max_iter factor (test_interval==max_iter in the sample). That is, only one test is performed after the training phase.
  - max_iter: maximum number of training iterations.
  - net: Caffe model to train. This model will be reused in the training phase and test phase. The operators to run in each phase are specified by using the phase field of each operator. Alternatively, you can specify the model to train and the model to test by using train_net and test_net, respectively. As AMCT has generated only one model, net is used here.
  - test_initialization: whether to perform the test of the original model before training. The data type is bool. The default value is True, indicating that a pretest is performed. If it is set to True and the initial parameter is set to 0, the calculated shift factor N is incorrect. Therefore, you need to disable the pretest by setting test_initialization to False.
  - The base_lr, lr_policy, and gamma parameters are used to control the learning policy, that is, how learning_rate changes.
  - momentum: weight of the last gradient update.
  - weight_decay: weight decay item, which is used to prevent overfitting.
  - Save the snapshots of the trained model and solver status to the specified directory after the specified times of training.
2. Train the model.
  1
  user_train_model(modified_model_file, modified_weights_file, train_data)
  During training, the activation quantization operator is trained to obtain the quantization upper and lower bounds clip_max and clip_min, which are saved to the operator BLOB. The weight quantization operator learns quantization parameters and saves the updated parameters to the model. Quantization factors are generated after batch_num training. If the number of training times is less than batch_num, the training fails.

Save the model.

Call the save_quant_retrain_model API to insert operators such as AscendQuant and AscendDequant and save the resultant deployable model and fake-quantized model based on the quantization factors and the retrained model.

quant_model_path = './result/user_model'
amct.save_quant_retrain_model(retrained_model_file=modified_model_file,
                              retrained_weights_file=modified_weights_file,
                              save_type='Both',
                              save_path=quant_model_path,
                              scale_offset_record_file=scale_offset_record_file,
                              config_file=config_file)

(Optional) Run inference on the fake-quantized model fake_quant_model and fake_quant_weights in the Caffe environment based on the test dataset to test the accuracy. (Update the sample code based on your situation.)

Check the accuracy loss of the fake-quantized model by comparing with that of the source model (see 3).

fake_quant_model = './result/user_model_fake_quant_model.prototxt'
fake_quant_weights = './result/user_model_fake_quant_weights.caffemodel'
user_test_model(fake_quant_model, fake_quant_weights, test_data, test_iterations)

Parent topic: Performs quantization aware training.