Non-Uniform Quantization

Non-uniform quantization (NUQ) refers to a type of quantization in which the quantization levels are unequal. NUQ clusters the activation distribution after UQ based on the probability distribution of activation to be quantized. The clustering is using a target compression ratio, that is the ratio of reserved values to original quantized values. Compared with uniform quantization, NUQ further compresses the activation volume, and retains the activation information with a high probability distribution as much as possible, reducing the loss of activation.

Due to hardware restrictions, you are not advised to perform NUQ in this version. Otherwise, performance benefits cannot be obtained.

During model inference on Ascend AI Processor, you can increase the weight compression ratio through NUQ (used together with ATC to enable weight compression during compilation) to reduce the weight transmission overhead and further improve the inference performance. After NUQ, test the inference accuracy of the fake-quantized model in the source ONNX Runtime environment. If the accuracy result is not as expected, tune the NUQ configuration file config.json to recover accuracy. For details, see Manual Tuning.

The supported layers and restrictions are as follows. For details about quantization examples, see resnet101 in Sample List.

**Table 1** Layers that support NUQ as well as their restrictions
Supported Layer Type	Restriction
Conv	Using a 1-dilated 4 x 4 filter with group = 1
Gemm	transpose_a = false, Alpha = Beta = 1.0

Workflow

The principle of static NUQ follows the same principles as Uniform Quantization. Figure 1 shows the basic workflow.

Figure 1 Non-uniform quantization workflow

The workflow is described as follows.

Obtain the deployable model and fake-quantized model generated after uniform quantization (Uniform Quantization).
Convert the deployable model into the .json format (see ATC Instructions for details). The .json file records the fusion information of the quantized model as well as the weight compression information (by using the fe_weight_compress field).
If weight compression is required, obtain the simplified configuration file (see Simplified PTQ Configuration File for details) along with the .json fusion file of the quantized model for NUQ.
During NUQ, the fusion .json file is used to determine which layers support weight compression, and then generate a new deployable NUQ model and new quantization configuration file.
After NUQ, test the inference accuracy of the fake-quantized model in the ONNX Runtime environment. If the accuracy is not as expected, try more times to tune the NUQ configuration file config.json until you reach the expected accuracy. For details, see Manual Tuning.

Parent topic: Manual Quantization