Uniform quantization

The type of quantization in which the quantization levels are uniformly spaced is termed as uniform quantization. For example, INT8 quantization uses 8-bit int8 data to represent 32-bit float32 data, and converts a float32 convolution operation (multiply-add operation) into an int8 convolution operation. This reduces the model size, speeding up computing. In uniform INT8 quantization, the quantized data is evenly distributed in the value range [–128, +127] of int8.

If the model accuracy is not satisfactory after uniform quantization, perform Quantization Aware Training, Accuracy-based Automatic Quantization, or Manual Tuning.

The layers that support uniform quantization and the restrictions are as follows. For details about quantization examples, see Sample List.

**Table 1** Layers that support uniform quantization as well as their restrictions
Technique	Supported Layer Type	Restriction
Uniform quantization	InnerProduct	transpose = false, axis = 1
	Convolution	4 x 4 filter
	Deconvolution	1-dilated 4 x 4 filter
	AvgPool	Global Pooling is not supported.

API Call Sequence

Figure 1 shows the API call sequence for uniform quantization. You cannot run a uniform quantization on more than one GPU.

Figure 1 API call sequence for uniform quantization

The user implements the operations in blue, while those in gray are implemented by using AMCT APIs. Specifically, import the package to the source Caffe network inference code and call APIs where appropriate for quantization. The following details how to use this tool.

Build an original Caffe model and then generate a quantization configuration file by using the create_quant_config API.
Use the Caffe model and quantization configuration file to initialize the tool by using the init API, configure the quantization factor record file, and then parse the model into a graph.
Using the quantize_model API, optimize the graph of the source Caffe model by inserting activation and weight quantization operators for quantization parameter calculation.
Run inference with the modified model on the test and calibration datasets provided by AMCT in the Caffe environment to obtain the quantization factors.
The test dataset is used to test the accuracy of the quantized model in the TensorFlow environment, while the calibration dataset is used to generate quantization factors to ensure accuracy.
Using the save_model API, insert operators including AscendQuant and AscendDequant and save the quantized model (including its weight file) that is either suitable for accuracy simulation in the Caffe environment or deployable on Ascend AI Processor.
- A fake-quantized model for accuracy simulation in the Caffe environment, with its name containing the fake_quant keyword.
  The fake-quantized model is used to verify the accuracy of the quantized model and can run within the Caffe framework. During forward passes, the input activations and weights of the convolutional layer (and other layers) of the fake-quantized model are quantized and then dequantized to simulate quantization, which enables you to quickly verify the accuracy of the quantized model. As shown in the following figure, data flows through the Quant, Convolution, and Dequant layers in float32. The Quant layer quantizes activations and weights to INT8 and dequantizes back to float32. The calculation happened at the Convolution layer is performed in float32. This model is used only to verify the accuracy of the quantized model within the Caffe framework and is not suitable to be converted into an .om model by ATC.
  Figure 2 Fake-quantized model
- Deployable model file. The file name contains deploy. The model can be deployed on the Ascend AI Processor after being converted by the ATC tool.
  Data including weights calculated in the deployable model has been converted into the INT8 or INT32 type. Therefore, you cannot run inference on this model within the Caffe framework. As shown in the following figure, the AscendQuant layer of the deployable model quantizes activations from Float32 to INT8 as the input of the convolutional layer, which uses INT8 weights and outputs INT32 results. That is, in the deployable model, calculation at the convolutional layer is based on the INT8 and INT32 types. Then, the INT32 results are converted into Float32 at the AscendDequant layer before they are input to the next layer.
  Figure 3 Deployable model

Example

This section details the PTQ template code line by line, helping you understand the AMCT workflow. You can adapt the template code to other network models with just a few tweaks.

For the sample code, see "Uniform Quantization" in "resnet50". The PTQ workflow goes through the following steps:

Prepare an already-trained model and necessary datasets.
Validate the model accuracy and environment setup in the source Caffe environment.
Write a PTQ script based on AMCT API calls.
Run the PTQ script.
Test the accuracy of the fake-quantized model in the source Caffe environment.

The following details how to write a quantization script based on AMCT API calls.

Take the following steps to get started. Update the sample code based on your situation.
Tweak the arguments passed to AMCT API calls as required.

Import the AMCT package and set the log level (see Set environment variables: for details).
1
import amct_caffe as amct
Set the run mode and target device.
AMCT runs on the CPU (set_cpu_mode) or GPU (set_gpu_mode). The GPU run mode is closely related to the Caffe framework, as the selection between the GPU devices is implemented by using Caffe APIs caffe.set_mode_gpu() and caffe.set_device(args.gpu_id). To run AMCT on the GPU, you first need to configure the Caffe's run mode and target device before configuring AMCT's run mode. Since the target device has already been specified here, you do not need to configure the target device in the model inference function.
1 2 3 4 5 6
if args.gpu_id is not None and not args.cpu_mode: caffe.set_mode_gpu() caffe.set_device(args.gpu_id) amct.set_gpu_mode() else: caffe.set_mode_cpu()

(Optional) Validate the inference script and environment setup in the source Caffe environment. Update the sample code based on your situation.

You are advised to start with running inference on the original model under the Caffe framework.

# Run original model without quantize test
    if args.pre_test:
        run_caffe_model(args.model_file, args.weights_file, args.iterations)
        print('[INFO]Run %s without quantize success!' %(args.model_name))
        return

Run AMCT to quantize the model.

Parse the user model and generate a complete quantization configuration file.

There are two ways to generate a quantization configuration file:

Tweak the simplified configuration file and set the config_defination parameter. Other parameters are invalid and can be ignored.

Call the API for generating a quantization configuration file, with the skip_layers, batch_num, and activation_offset arguments specified. The sample code is as follows.

    # Generate quantize configurations
    config_json_file = 'tmp/config.json'
    batch_num = 2
    if args.cfg_define is not None:
        amct.create_quant_config(config_json_file,
                                 args.model_file,
                                 args.weights_file,
                                 config_defination=args.cfg_define)
    else:
        skip_layers = []
        amct.create_quant_config(config_json_file,
                                 args.model_file,
                                 args.weights_file,
                                 skip_layers,
                                 batch_num)

Initialize AMCT, read the complete quantization configuration file, parse the user model file, and generate the Graph IR of the optimized model.

    # Phase0: Init amct task
    scale_offset_record_file = 'tmp/scale_offset_record.txt'
    graph = amct.init(config_json_file,
                      args.model_file,
                      args.weights_file,
                      scale_offset_record_file)

Perform graph fusion and offline weight quantization, and insert activation quantization layers, resulting in a model ready for calibration. In the calibration process, the activations of this model will be quantized.

    # Phase1: Do conv+bn+scale fusion, weights calibration and fake
    #         quantize, insert data-quantize layer
    modified_model_file = 'tmp/modified_model.prototxt'
    modified_weights_file = 'tmp/modified_model.caffemodel'
    amct.quantize_model(graph, modified_model_file, modified_weights_file)

Calibrate the model by running forward passes to complete activation quantization. Update the sample code based on your situation.
The number of inference iterations required in this step must be greater than or equal to the value of batch_num for activation quantization. The calibration and inference process includes one data preprocessing, one forward inference (net.forward), and one data post-processing.
1 2
# Phase2: run caffe model to do activation calibration run_caffe_model(modified_model_file, modified_weights_file, batch_num)
If the message "IfmrQuantWithOffset scale is illegal" is displayed during the calibration, fix the error by referring to " "IfmrQuantCalibration with offset scale is illegal"" or " "IfmrQuantCalibration without offset scale is illegal"" Is Displayed During Calibration

Save the quantized model.

Call the save_model API to insert operators such as AscendQuant and AscendDequant into the modified graph and save the resultant deployable model and fake-quantized model based on the quantization factors.

    # Phase3: save final model, one for caffe do fake quant test, one
    #         deploy model for ATC
    result_path = 'results/%s' %(args.model_name)
    amct.save_model(graph, 'Both', result_path)

If the message "Error: Cannot find scale_d of layer '**' in record file" is displayed, fix the error by referring to "Check scale and offset record file record.txt failed" Is Displayed During Quantization

(Optional) Run inference on the fake-quantized model to test its accuracy. (Update the sample code based on your situation.)

    # Phase4: if need test quantized model, uncomment to do final fake quant
    #         model test.
    fake_quant_model = 'results/{0}_fake_quant_model.prototxt'.format(args.model_name)
    fake_quant_weights = 'results/{0}_fake_quant_weights.caffemodel'.format(args.model_name)
    run_caffe_model(fake_quant_model, fake_quant_weights, args.iterations)

If you want to use the preceding sample code for quantizing more models, tweak the sample code as follows:

Modify the arguments.

Pass the necessary arguments for executing AMCT. This step is optional. You can also implement this in your own ways. A code example is as follows:

    class Args(object):
        """struct for Args"""
        def __init__(self):
            self.model_name = '' # Caffe model name as prefix to save model
            self.model_file = ''  # user caffe model txt define file
            self.weights_file = '' # user caffe model binary weights file
            self.cpu = True # If True, force to CPU mode, else set to False
            self.gpu_id = 0 # Set the gpu id to use
            self.pre_test = False # Set true to run original model test, set
                                  # False to run quantize with amct_caffe tool
            self.iterations = 5 # Iteration to run caffe model
            self.cfg_define = None # If None use

    args = Args()
    #############################user modified start#########################
    """User set basic info to use amct_caffe tool
    """
    # e.g.
    args.model_name = 'ResNet50'
    args.model_file = 'pre_model/ResNet-50-deploy.prototxt'
    args.weights_file = 'pre_model/ResNet-50-model.caffemodel'
    args.cpu = True
    args.gpu_id = None
    args.pre_test = False
    args.iterations = 5
    args.cfg_define = None
    #############################user modified end###########################

Modify the code lines for running Caffe model inference.

A code example is as follows:

def run_caffe_model(model_file, weights_file, iterations):
    """run caffe model forward"""
    net = caffe.Net(model_file, weights_file, caffe.TEST)
    #############################user modified start#########################
    """User modified to execute caffe model forward
    """
    # # e.g.
    # for iter_num in range(iterations):
    #     data = get_data()
    #     forward_kwargs = {'data': data}
    #     blobs_out = net.forward(**forward_kwargs)
    #     # if have label and need check network forward result
    #     post_process(blobs_out)
    # return
    #############################user modified end###########################

The code is described as follows. Implement model inference based on the specific service network.

Pass the model file to instantiate a Caffe Net (set phase to caffe.TEST for inference).
1
net = caffe.Net(model_file, weights_file, caffe.TEST)
Set iterations to the number of inference iterations.
Obtain the network data required every iteration. Preprocess the data based on the service network. For example, for ResNet-50, convert YUV images to the RGB format, resize to 224 x 224, and subtract the mean values from each channel. Then, construct input blobs in key(blob name):value(NumPy array) dictionary format.
1 2
data = get_data() forward_kwargs = {'data': data}

Run a forward inference and obtain the network output.

    blobs_out = net.forward(**forward_kwargs)

The Caffe network output blobs_out is also stored in dictionary format, for example, {'prob1': blob1, 'prob2':blob2}. You can obtain a blob data structure by blob name.
(Optional) If you need to test the network output, obtain the corresponding data in the preceding format and then compute the image classification or object detection result. This step is not required by AMCT. You only need to perform network inference to obtain all hidden-layer data of the network. You can determine whether to perform postprocessing on the network inference result.
1
post_process(blobs_out)

Parent topic: Manual Quantization