Non-Uniform Quantization

Non-uniform quantization (NUQ) refers to a type of quantization in which the quantization levels are unequal. NUQ clusters the activation distribution after UQ based on the probability distribution of activation to be quantized. The clustering is using a target compression ratio, that is the ratio of reserved values to original quantized values. Compared with uniform quantization, NUQ further compresses the activation volume, and retains the activation information with a high probability distribution as much as possible, reducing the loss of activation.

Due to hardware restrictions, you are not advised to perform NUQ in this version. Otherwise, performance benefits cannot be obtained.

During model inference on Ascend AI Processor, you can increase the weight compression ratio through NUQ (used together with ATC to enable weight compression during compilation) to reduce the weight transmission overhead and further improve the inference performance. After NUQ, test the inference accuracy of the fake-quantized model in the source Caffe environment. If the accuracy result is not as expected, you can tune the NUQ configuration file config.json to recover the accuracy. NUQ is classified into static NUQ and auto NUQ, depending on whether the quantization configuration is manually or automatically tuned.

Static NUQ: The user manually tweaks the quantization parameters. For details, see Manual Tuning. For details about the quantization process, see >Non-Uniform Quantization>Static Non-Uniform Quantization.
Auto NUQ: AMCT automatically searches for a quantization configuration that will yield a higher model compression ratio while meeting the accuracy requirements. First, implement a subclass that inherits the provided auto NUQ base class AutoNuqEvaluatorBase. Then, implement the eval_model (self, model_file, weights_file, batch_num) and is_satisfied (self, original_metric, new_metric) methods.
- eval_model evaluates a model by performing data preprocessing, model inference, and data postprocessing based on the batch_num argument, and returns the model evaluation result. The said result must be unique, for example, the top 1 accuracy of an image classification network, the mAP of an object detection network, or the weighted result of a range of metrics.
- is_satisfied determines whether the accuracy drop of the quantized model is below the defined limit. If it is, True is returned; otherwise, False. For example, if the top 1 accuracy of a classification network is a decimal fraction, the judgment condition can be if (original_metric - new_metric) * 100 < 1, indicating that the accuracy cannot drop 1% or more.
For details about quantization examples, see >Non-Uniform Quantization>Automatic Non-Uniform Quantization.

The layers that support NUQ are listed as follows.

**Table 1** Layers that support NUQ as well as their restrictions
Technique	Supported Layer Type	Restriction
Non-Uniform Quantization	Convolution	Using a 1-dilated 4 x 4 filter with group = 1
Non-Uniform Quantization	InnerProduct	transpose = false, axis = 1

API Call Sequence

Figure 1 shows the API call sequence for NUQ. You cannot run a NUQ on more than one GPU.

Figure 1 API call sequence for NUQ

Follow these steps to perform auto NUQ. For the procedure of static NUQ, see Uniform quantization.

Build an original Caffe model and then generate a quantization configuration file by using the create_quant_config API.
Use the Caffe model and quantization configuration file to initialize the tool by using the init API, configure the quantization factor record file, and then parse the model into a graph.
Pass the source model file, JSON file (which preserves the weight compression information at certain layers by using the fe_weight_compress field) converted by ATC, and evaluator instance to the auto_nuq API call to run auto NUQ. The evaluator instance is used to analyze the accuracy of the source model.

Figure 2 shows the basic workflow.

Figure 2 Non-uniform quantization workflow

Calling Example

Import the AMCT package.

import amct_caffe as amct
from amct_caffe.auto_nuq import AutoNuqEvaluatorBase

Derive a class from the given base class AutoNuqEvaluatorBase. Then, implement two methods: eval_model (for evaluating the model accuracy) and is_satisfied (for determining if the quantized model meets accuracy requirements).

class AutoNuqEvaluator(AutoNuqEvaluatorBase):
    # Auto Nuq Evaluator
    def __init__(self, evaluate_batch_num):
        super().__init__(self)
        self.evaluate_batch_num = evaluate_batch_num

Implement the model accuracy evaluation method eval_model.
eval_model evaluates a model by performing data preprocessing, model inference, and data postprocessing based on the batch_num argument, and returns the model evaluation result. The said result must be unique, for example, the top 1 accuracy of an image classification network, the mAP of an object detection network, or the weighted result of a range of metrics.
1 2
def eval_model(self, model_file, weights_file, batch_num): return do_benchmark_test(QUANT_ARGS, model_file, weights_file, batch_num)

Implement the is_satisfied method for evaluating the accuracy loss.

is_satisfied determines whether the accuracy drop of the quantized model is below the defined limit. If it is, True is returned; otherwise, False. For example, if the top 1 accuracy of an image classification network is a decimal fraction, the judgment condition can be if (original_metric - new_metric) * 100 < 1, indicating that the accuracy cannot drop 1% or more.

original_metric indicates the accuracy of the original unquantized model.
new_metric indicates the accuracy of the fake-quantized model. True or False is returned based on whether the accuracy loss meets the threshold.

    def is_satisfied(self, original_metric, new_metric):
        # the loss of top1 acc need to be less than 1%
        if (original_metric - new_metric) * 100 < 1:
            return True
        return False

Generate a quantization configuration file.

    config_json_file = os.path.join(TMP, 'config.json')
    skip_layers = []
    batch_num = 2
    activation_offset = True

    # do weights calibration with non uniform quantize configure
    amct.create_quant_config(
        config_json_file, args.model_file, args.weights_file, skip_layers,
        batch_num, activation_offset, args.cfg_define)

    scale_offset_record_file = os.path.join(TMP, 'scale_offset_record.txt')
    result_path = os.path.join(RESULT, 'ResNet50')

Starts the automatic non-uniform quantization process.

    evaluator = AutoNuqEvaluator(args.iterations)
    amct.auto_nuq(
        args.model_file,
        args.weights_file,
        evaluator,
        config_json_file,
        scale_offset_record_file,
        result_path)

Parent topic: Manual Quantization