ifmr: IFMR algorithm for activation quantization

The IFMR (Input Feature Map Reconstruction) algorithm searches for the optimal quantization mode in a data distribution. This algorithm is used for PTQ. The following figure shows the quantization principle.

Figure 1 ifmr: IFMR algorithm for activation quantization

In the preceding figure, [1,3] indicates [clip_min_start, clip_min_end], 2 indicates clip_min, [4,6] indicates [clip_max_start, clip_max_end], and 5 indicates clip_max.

The quantization process is divided into two steps. As shown in zz, the first step is to truncate the floating-point data to the range of [clip_min, clip_max], that is, the [2,5] point in zz. The second step is to quantize the floating-point data to the int range. Generally, sparse values near the boundaries can be clipped to improve the accuracy.

To obtain the optimal quantization effect, you can continuously change the truncation range [clip_min, clip_max] and select the range with the best quantization effect as the final quantization result. In the IFMR algorithm, clip_min and clip_max are used to set the search range and step, and the optimal quantization effect can be obtained by traversing the query. The parameters provided by FMRQuantize in the quantization configuration are used to adjust the truncation range. (The PyTorch framework is used as an example. For details about the parameters, see.)

  • The search range of clip_min (denoted by 2 in Figure 1) is [clip_min_start, clip_min_end] (denoted by [1, 3] in Figure 1), at a search step specified by search_step.
  • The search range of clip_max (denoted by 5 in Figure 1) is [clip_max_start, clip_max_end] (denoted by [4, 6] in Figure 1), at a search step specified by search_step.

Arrange activations in descending order to determine clip_min_init based on min_percentile to obtain clip_min_start and clip_min_end:

  • clip_min_start =clip_min_init*search_range_start
  • clip_min_end= clip_min_init* search_range_end

In the ascending data sequence, clip_max_init is obtained based on the max_percentile parameter, to obtain the clip_max_start and clip_max_end parameters.

  • clip_max_start =clip_max_init*search_range_start
  • clip_max_end=clip_max_init* search_range_end

Generally, a larger [clip_min_start, clip_min_end] and a smaller search_step indicate higher quantization precision but more time-consuming quantization.