Overview

This section describes the quantization options supported by AMCT.

Quantitative Classification

There are two quantization types based on whether retraining is required: PTQ (post-training quantization) and Performs quantization aware training. (quantization aware training). For details, see Quantization.

Post-training quantization
Quantization takes two forms: Manual Quantization and Accuracy-based Automatic Quantization based on whether the quantization configuration file is manually tuned after quantization. For details about the quantization algorithms for PTQ, see PTQ Algorithms.

Two forms Uniform Quantization and Non-Uniform Quantization are divided based on whether the weight data is compressed. If the quantization accuracy does not meet the requirement, you can perform accuracy-based automatic quantization. Accuracy-based automatic quantization is recommended.
Quantization aware training
Currently, QAT supports quantization only for FP32 network models. For details about the quantization algorithms for QAT, see QAT Algorithms.

Currently, only manual quantization is supported. If you find that the accuracy of the quantized model is not as expected, perform Manual Tuning.

Terminology

The terms used in the quantization process are explained as follows:

**Table 1** Related Concepts in the Quantization Process
Terminology	Description
Activation quantization and weight quantization	PTQ and QAT are further classified into activation or weight quantization based on different quantization objects. Currently, Ascend AI Processor supports symmetric/asymmetric quantization of activations, but supports only symmetric quantization of weights. These symmetric and asymmetric modes depend on whether the data center point is 0 after quantization. For details about the quantization algorithm, see Quantization Algorithm Principles. Activation quantization Activation quantization refers to quantizing activations to lower-bit representations according to the distribution of activation values. The activation volume is often high and the value distribution at each layer is not determinable until the forward pass (either inference or training). Therefore, activation quantization is inference-time or training-time based. In PTQ: online quantization. Modify your inference model by inserting a bypass quantization node at the layer to be quantized to collect activations input to the layer, and then obtain the quantization factors scale and offset through calibration. This process leverages a subset of samples from the dataset, improving the efficiency. In QAT: Train the quantization factors scale and offset for activations. Specifically, use scale and offset to quantize activations in the forward pass, and then incorporate the gradients into scale and offset updates in the backward pass. Weight quantization Weight quantization refers to quantizing weights to lower-bit representations according to the distribution of weight values. In PTQ, the weight quantization is conducted offline. Weights are read directly from your inference model, and are quantized using the quantization algorithm. After that, the quantized weights are written back to your model for activation quantization. In QAT: Train weights. Specifically, quantize the original floating-point weights in the forward pass, and use the quantized weights to run forward inference. Then incorporate the gradients into weight updates in the backward pass.
Quantization bit width	There are typically INT8, INT4, INT16, and binary quantization types based on the low bit width after quantization. The current version supports only INT8 quantization. INT8 quantization uses 8-bit (int8) data to represent 32-bit (float32) data, and converts a float32 convolution operation (multiply-add operation) into an int8 convolution operation. This reduces the model size, speeding up computing. INT4 quantization uses 4-bit (INT4) data to represent 32-bit (FP32) data. Compared with INT8 quantization, INT4 quantization results in smaller model size but higher accuracy loss. Fortunately, QAT can be used to compensate for the accuracy loss, though the whole process could be time-consuming. INT16 quantization uses 16-bit (INT16) data to represent 32-bit (FP32) data, and converts an FP32 convolution operation (multiply-add operation) into an INT16 convolution operation. This reduces the model size, speeding up computing. If the INT8 quantization accuracy cannot meet the requirements, you are advised to use INT16 quantization.
Test dataset	A test dataset is a dataset subset for the final test of model accuracy.
Calibration	Calibration refers to the forward inference process in PTQ. Calibration is conducted to determine the quantization factors for quantizing activations.
Calibration dataset	A calibration dataset is used during the forward inference in PTQ. The calibration dataset should contain a sufficient number of representative samples. A subset of samples from the test dataset is suggested. If the selected calibration dataset does not match your model or is not representative enough, the calculated quantization factors will show poor behavior on the complete dataset, resulting in high accuracy drop.
Training dataset	A subset of datasets, which is used to train models based on the datasets in the user training network.
Quantization factors	Quantization factors are the parameters used to quantize floating-point values into integer values. The factors include Scale and Offset. The formula for quantizing a floating point into an integer (for example, INT8) is as follows:
Scale	The quantization factor for scaling floating points, including: scale_d for activation quantization. Only unified activation quantization is supported. scale_w for weight quantization. If a scalar is set, the weights of the current layer are quantized in a unified manner; if a vector is set, the weights of the current layer are quantized channel-wise. For more details, see Record Files.
Offset	The quantization factor for the offset, including: offset_d for activation quantization. Only unified activation quantization is supported. offset_w for weight quantization. If a scalar is set, the weights of the current layer are quantized in a unified manner; if a vector is set, the weights of the current layer are quantized channel-wise. The dimensions of offset_w must be consistent with those of scale_w. For more details, see Record Files.
Quantization sensitivity	Model outputs vary at different precision levels. A model generally delivers better inference accuracy in high precision. Quantization, which reduces the precision of a model or certain layers, may hurt the model's inference accuracy. To quantify the impact, the concept of quantization sensitivity is introduced. Quantization sensitivity evaluates how a network model or a quantizable layer reacts to quantization. Quantization sensitivity is calculated by comparing the difference between the output of the network or the output of a layer before and after quantization. Common indicators include mean square error (MSE) and cosine similarity.
Bit complexity	The floating-point computation amount of a model layer is denoted as Flops. Bit complexity (denoted as bitops) is defined as the product of Flops and representational precisions, showing the differences in computational resource demands between precisions (for example, FP32, FP16, INT8, or INT4). The calculation formula is as follows, where act_bit indicates the activation precision and wts_bit the weight precision. $\text{[math]}$

Parent topic: Quantization