Manual Tuning
If the PTQ accuracy does not meet the requirements, you can manually adjust the parameters in the config.json file. This section provides the adjustment principles and parameter description.
Workflow
If you find that the accuracy of the model quantized based on the initial config.json file generated by the create_quant_config API call is not as expected, you can tune the configuration parameters until the accuracy meets your requirement. The workflow for manually tuning the parameters in the PTQ configuration file config.json goes through the following three phases:
- Tune the amount of data used for calibration.
- Skip quantizing certain layers.
- Tune the quantization algorithm and parameters.
Specifically,
- Run quantization based on the initial config.json file generated by the create_quant_config API. If the accuracy of the quantized model is satisfactory, stop tuning the configuration parameters. Otherwise, go to 2.
- Tweak the value of batch_num to tune the amount of data used for calibration.
batch_num controls the batch number for quantization. Tune it based on the batch size and the dataset size. Generally:
A larger value of batch_num indicates more data samples used for quantization and a smaller accuracy drop of the quantized model. However, excessive data does not necessarily improve accuracy, but certainly consumes more memory and reduces the quantization speed, hence resulting in insufficient memory, video RAM, and thread resources. An optimal tradeoff is achieved when the product of batch_num and batch_size (the number of images per batch) is 16 or 32.
- Run quantization based on the new configuration generated in 2. If the accuracy of the quantized model is satisfactory, stop tuning the configuration parameters. Otherwise, go to 4.
- Tweak the value of quant_enable to skip quantizing certain layers.
quant_enable is the quantization switch of a specified layer. The value false indicates that the layer will be skipped during quantization; true, otherwise. Removing the layer configuration can also skip the layer.
Quantizing a model can have a negative effect on accuracy. Layers sensitive to quantization will suffer from remarkable error increases once quantized and therefore should be left dequantized. Spot these layers as follows:
- In a model, the input layer, output layer, and layers with especially fewer parameters are likely to be quantization-sensitive.
- Use the Model Accuracy Analyzer to compare the output errors between the original model and the quantized model layer-wise (a cosine similarity of at least 0.99, for example) to locate the layers that reduce accuracy the most and dequantize them with priority.
- Run quantization based on the new configuration generated in 4. If the accuracy of the quantized model is satisfactory, stop tuning the configuration parameters. Otherwise, go to 6.
- Tweak the values of activation_quant_params and weight_quant_params to tune the quantization algorithms and parameters.
For details about the algorithm parameters, see the parameter description in Quantization Configuration File. For details about the algorithm, see PTQ Algorithms.
- Run quantization based on the new configuration generated in 6. If the accuracy of the quantized model is satisfactory, stop tuning the configuration parameters. Otherwise, it indicates that your model is not suitable for quantization and the quantization configuration should be removed.
Quantization Configuration File
If inference based on the config.json quantization configuration file generated by the create_quant_config call has significant accuracy drop, tune the config.json file until the accuracy is as expected. For details about the JSON quantization configuration file example, see Example. Keep the layer names unique in the file.
The following tables describe the parameters in the configuration file.
|
Description |
Version number of the quantization configuration file |
|---|---|
|
Type |
Integer |
|
Value |
1 |
|
Command-Line Options |
Currently, only version 1 is available. |
|
Recommended Value |
1 |
|
Required/Optional |
Optional |
|
Description |
Batch number for quantization |
|---|---|
|
Type |
Integer |
|
Value |
Greater than 0 |
|
Command-Line Options |
Defaults to 1. You are advised to keep the calibration dataset size within 50 images. Calculate batch_num based on batch_size as follows: batch_num × batch_size = Calibration dataset size batch_size indicates the number of images per batch. |
|
Recommended Value |
1 |
|
Required/Optional |
Optional |
|
Description |
Symmetric quantization or asymmetric quantization select for activation quantization. It is a global configuration parameter. The asymmetric parameter takes precedence over the activation_offset parameter if both of them exist in the configuration file. |
|---|---|
|
Type |
Boolean |
|
Value |
true or false |
|
Command-Line Options |
|
|
Recommended Value |
true |
|
Required/Optional |
Optional |
|
Description |
Fusion switch |
|---|---|
|
Type |
Boolean |
|
Value |
true or false |
|
Command-Line Options |
Currently, only Conv+BN fusion is supported. |
|
Recommended Value |
true |
|
Required/Optional |
Optional |
|
Description |
Layers to skip fusion |
|---|---|
|
Type |
String |
|
Value |
Must be names of fusible layers. Currently, only Conv+BN fusion is supported. |
|
Command-Line Options |
Sets the layers to skip fusion. |
|
Recommended Value |
- |
|
Required/Optional |
Optional |
|
Description |
Quantization configuration of a network layer |
|---|---|
|
Type |
Object |
|
Value |
- |
|
Command-Line Options |
Includes the following parameters:
|
|
Recommended Value |
- |
|
Required/Optional |
Optional |
|
Description |
Quantization enable |
|---|---|
|
Type |
Boolean |
|
Value |
true or false |
|
Command-Line Options |
|
|
Recommended Value |
true |
|
Required/Optional |
Optional |
|
Description |
Migration strength of the DMQ Balancer |
|---|---|
|
Type |
Float |
|
Value |
[0.2, 0.8] |
|
Command-Line Options |
Degree to which the quantization difficulty of activations is migrated to weights. Set the migration strength to a small value if there are many outliers in the activation distribution. |
|
Recommended Value |
0.5 |
|
Required/Optional |
Optional |
|
Description |
Activation quantization parameters |
|---|---|
|
Type |
Object |
|
Value |
- |
|
Command-Line Options |
Includes the following parameters. (Beware that IFMR algorithm parameters are mutually exclusive with HFMG ones at the same layer.)
|
|
Recommended Value |
- |
|
Required/Optional |
Optional |
|
Description |
Weight quantization parameters |
|---|---|
|
Type |
Object |
|
Value |
- |
|
Command-Line Options |
|
|
Recommended Value |
- |
|
Required/Optional |
Optional |
|
Description |
Quantization bit width |
|---|---|
|
Type |
Integer |
|
Value |
8 or 16 |
|
Command-Line Options |
Currently, the value can only be 8, indicating that the INT8 quantization bit width is used. |
|
Recommended Value |
- |
|
Required/Optional |
Required |
|
Description |
Activation quantization algorithm |
|---|---|
|
Type |
String |
|
Value |
ifmr or hfmg |
|
Command-Line Options |
ifmr: IFMR algorithm for activation quantization hfmg: HFMG algorithm for activation quantization |
|
Recommended Value |
- |
|
Required/Optional |
Optional |
|
Description |
Symmetric quantization or asymmetric quantization select for activation quantization. It is used to select the layer-wise quantization algorithm. The asymmetric parameter takes precedence over the activation_offset parameter if both of them exist in the configuration file. |
|---|---|
|
Type |
Boolean |
|
Value |
true or false |
|
Command-Line Options |
|
|
Recommended Value |
true |
|
Required/Optional |
Optional |
|
Description |
Upper bound for searching for the largest of the IFMR activation quantization algorithm |
|---|---|
|
Type |
Float |
|
Value |
(0.5,1] |
|
Command-Line Options |
For example, given 100 numeric values in descending order, the upper bound 1.0 indicates that the value indexed 0 (100 – 100 × 1.0) is considered as the largest. A larger value indicates that the upper bound for clipping-based quantization is closer to the maximum value of the data to be quantized. |
|
Recommended Value |
0.999999 |
|
Required/Optional |
Optional |
|
Description |
Lower bound for searching for the smallest of the IFMR activation quantization algorithm |
|---|---|
|
Type |
Float |
|
Value |
(0.5,1] |
|
Command-Line Options |
For example, given 100 numeric values in ascending order, the lower bound 1.0 indicates that the value indexed 0 (100 – 100 × 1.0) is considered as the smallest. A larger value indicates that the lower bound for clipping-based quantization is closer to the minimum value of the data to be quantized. |
|
Recommended Value |
0.999999 |
|
Required/Optional |
Optional |
|
Description |
Quantization factor search range ([search_range_start, search_range_end]) of the IFMR algorithm |
|---|---|
|
Type |
A list of two floats |
|
Value |
0 < search_range_start < search_range_end |
|
Command-Line Options |
Sets the quantization factor search range.
|
|
Recommended Value |
[0.7, 1.3] |
|
Required/Optional |
Optional |
|
Description |
Quantization factor search step of the IFMR algorithm |
|---|---|
|
Type |
Float |
|
Value |
(0, (search_range_end – search_range_start)] |
|
Command-Line Options |
Sets the fluctuation step of the upper bound for clipping-based quantization. A smaller value indicates a smaller quantization factor search step. The number of search iterations is calculated as: search_iteration = (search_range_end – search_range_start)/search_step. Increasing the number of search iterations will increase the search time and lead to process suspension. |
|
Recommended Value |
0.01 |
|
Required/Optional |
Optional |
|
Description |
Number of bins (the minimum unit in a histogram) of the HFMG algorithm |
|---|---|
|
Type |
Unsigned integer |
|
Value |
{1024, 2048, 4096, 8192} |
|
Command-Line Options |
A larger value of num_of_bins leads to better distribution fitting of the histogram and better quantization effect, but it also incurs longer PTQ time. |
|
Recommended Value |
4096 |
|
Required/Optional |
Optional for quantization using the HFMG algorithm. |
|
Description |
Weight quantization algorithm |
|---|---|
|
Type |
String |
|
Value |
arq_quantize |
|
Command-Line Options |
arq_quantize: ARQ algorithm |
|
Recommended Value |
- |
|
Required/Optional |
Optional |
|
Description |
Whether to use different quantization factors for each channel in the ARQ algorithm. |
|---|---|
|
Type |
Boolean |
|
Value |
true or false |
|
Command-Line Options |
|
|
Recommended Value |
true |
|
Required/Optional |
Optional |