Non-Uniform Quantization
Non-uniform quantization (NUQ) refers to a type of quantization in which the quantization levels are unequal. NUQ clusters the activation distribution after uniform quantization based on the probability distribution of activation to be quantized. The clustering is using a target compression ratio, that is the ratio of reserved values to source quantized values. Compared with uniform quantization, NUQ further compresses the activation volume, and retains the activation information with a high probability distribution as much as possible, reducing the loss of activation.
Due to hardware restrictions, you are advised not to perform NUQ in this version. Otherwise, performance benefits cannot be obtained.
During model inference on the Ascend AI Processor, you can increase the weight compression ratio through NUQ (used together with ATC to enable weight compression during building) to reduce the weight transmission overhead and further improve the inference performance. If the inference accuracy of the fake-quantized model in the source ONNX Runtime environment fails to meet your requirement after NUQ, tune the NUQ configuration file config.json to improve the model accuracy. For details, see Manual Tuning.
The supported layers and restrictions are as follows. For details about the quantization sample, see Sample List.
Supported Layer Type |
Restriction |
|---|---|
Conv |
Using a 1-dilated 4 × 4 filter with group = 1 |
Gemm |
transpose_a = false, Alpha = Beta = 1.0 |
Workflow
The principles of static NUQ follow the same principles as Uniform Quantization. Figure 1 shows the basic workflow.
The workflow is described as follows.
- Obtain the deployable model and fake-quantized model generated after uniform quantization (see Uniform Quantization for details).
- Convert the deployable model generated in 1 into the JSON format (see ATC Instructions for details). The JSON file records the fusion information of the quantized model as well as the weight compression information (by using the fe_weight_compress field).
- If weight compression is required, obtain the simplified NUQ configuration file (see Simplified PTQ Configuration File for details) along with the JSON fusion file of the quantized model generated in 2.
During NUQ, the fusion JSON file is used to determine which layers support weight compression, and then generate a new deployable NUQ model and new quantization configuration file.
- After NUQ, test the inference accuracy of the fake-quantized model in the ONNX Runtime environment. If the accuracy is not as expected, try more times to tune the NUQ configuration file config.json until you reach the expected accuracy. For details about the adjustment method, see Manual Tuning.
