IFMR Algorithm
The Input Feature Map Reconstruction (IFMR) algorithm determines the optimal quantization factors by searching in an activation distribution. This algorithm applies to PTQ. The following figure shows the quantization principles.
The meanings of the numbers in the figure are as follows: [1, 3] indicates [clip_min_start, clip_min_end], 2 indicates clip_min, [4, 6] indicates [clip_max_start, clip_max_end], and 5 indicates clip_max.
As shown in Figure 1, the quantization workflow of this algorithm consists of two basic steps: First, floating-point activations are truncated to the range of [clip_min, clip_max] (denoted by [2, 5] in Figure 1). Then, the activations are quantized from floating-point numbers to integers. Generally, sparse values near the boundaries can be truncated to improve the accuracy.
Tune the clipping range [clip_min, clip_max] repeatedly to find the optimal quantization effect. clip_min and clip_max specify the search range and search step of the IFMR algorithm. The parameters provided by FMRQuantize are used to adjust the clipping range in the quantization configuration. PyTorch is used as an example. For details about the parameters, see Table 1.
- The search range of clip_min (denoted by 2 in Figure 1) is [clip_min_start, clip_min_end] (denoted by [1, 3] in Figure 1), at a search step specified by search_step.
- The search range of clip_max (denoted by 5 in Figure 1) is [clip_max_start, clip_max_end] (denoted by [4, 6] in Figure 1), at a search step specified by search_step.
Arrange activations in descending order to determine clip_min_init based on min_percentile to obtain clip_min_start and clip_min_end:
- clip_min_start = clip_min_init × search_range_start
- clip_min_end = clip_min_init × search_range_end
Arrange activations in ascending order to determine clip_max_init based on max_percentile to obtain clip_max_start and clip_max_end:
- clip_max_start = clip_max_init × search_range_start
- clip_max_end = clip_max_init × search_range_end
Generally, a wider search range (denoted by [clip_min_start, clip_min_end]) and a smaller search step (denoted by search_step) result in higher quantization accuracy but longer quantization time.
