Quantization Algorithm Principles

Common quantization algorithms include binary quantization, linear quantization, and logarithmic quantization. Linear quantization can be classified into symmetric and asymmetric based on whether offset exists. AMCT works with Ascend SoCs in linear quantization mode. Taking INT8 quantization as an example, the symmetric and asymmetric quantization modes are normalized as follows:

For activations and weights on quantization layers, the quantization factors scale (a floating-point number) and offset need to be provided. Their supported value ranges are as follows:

The following describes the origin of the preceding expressions.

Symmetric Quantization Algorithm

The relationship between source high-precision data and quantized int8 data can be expressed as , where scale is a float32. To indicate positive and negative numbers, the signed INT8 data type is used for . The following describes how to convert source data into INT8 format. round is a rounding function. The value to be determined by the quantization algorithm is the constant scale.

The quantization of weights and activations may be summarized as a process of searching for a scale. Because is a signed number, to ensure symmetry of the ranges represented by positive and negative values, an absolute value operation is first performed on all data. This changes the range of the to-be-quantized data to , and then scale is determined. The range of positive int8 values is [0, 127]. Therefore, scale can be computed as follows:

The range of the INT8 values is after the scale is determined. Data beyond the range is saturated to a boundary value, and then the quantization operation shown in the formula is performed.

Asymmetric Quantization Algorithm

Compared with the symmetric quantization algorithm, this one uses a different data conversion technique. Plus, the scale and offset constants also need to be determined.

The uint8 data is converted using calculations based on the original high-precision data, as shown in the following formula:

scale is a float32 floating-point number, is an unsigned INT8 fixed-point number, and offset is an INT8 fixed-point number, which indicates the data range . If a value range of the to-be-quantized data is , scale and offset are computed as follows:

,

Normalized Quantization Data Format

AMCT uses a unified quantization data format.

By performing simple data conversion using the asymmetric quantization algorithm formula, the quantized data and the symmetric quantization algorithm are of the same type, int. The following shows the conversion process:

The following shows the conversion process and uses INT8 quantization as an example. The input source floating-point data is , the source quantized fixed-point number is , the quantization scale is scale, and the quantization offset is (the algorithm requires zero crossing to prevent accuracy drop). The calculation principle of quantization is as follows:

where . The above conversion allows the data to be converted into the INT8 format. After scale and the converted offset' are determined, the int8 data converted from the source floating-point data is as follows: