Basic Concepts
This section describes the concepts used in model compression and the principles of different compression methods.
Quantization
Quantization is a process of processing the weights and activations of a model into low bits for making the generated network model liter, thereby saving the storage space of the network model, reducing the transmission delay, improving the compute efficiency, and enhancing the performance.
AMCT separates quantization and model conversion to independently quantize quantizable operators in a model and output the quantized model. The resultant fake-quantized model can run on the CPU or GPU to complete accuracy simulation, while the quantized deployable model can run on the Ascend AI Processor to improve inference performance.
Currently, this tool can quantize only network models of the float32 or float16 data type (the Caffe and MindSpore frameworks do not support the float16 data type). The following figure shows the working principles of quantizing a network model to the INT8 data type.

- Post-training quantization
Post-training quantization (PTQ) is performed after the model training is complete. The weights of an already trained model are quantized from a floating point to a low-bit integer, and the activations are quantized by using a small calibration dataset, minimizing accuracy drop during quantization. PTQ is easy to use and requires only a small calibration dataset, which is applicable to scenarios where ease of use and resource saving take priority.
In general, after model training, the weights are determined. Therefore, the quantization parameters of a weight can be calculated offline based on the determinate weight value. In contrast, activations are input online. As such, the accurate value ranges are hard to be obtained and a small representative dataset is required to simulate the distributions of online activations. To obtain the quantization parameters of an activation, use the said dataset to run forward passes for the intermediate floating-point results, and calculate the quantization parameters of the activation offline based on the results. The following figure shows the working principles.
Figure 2 PTQ principles
- Quantization aware training
QAT acts by including quantization in the retraining process to address the effects of quantization on the model accuracy. QAT, which generally requires a complete training dataset, emulates the errors incurred by quantization in forward passes by introducing fake-quantization (quantizing floating-point numbers to fixed-point ones, and then dequantizing them to floating-point ones) in the training process, and updates weights in training on top of the said errors, ensuring less accuracy drop in quantization.
QAT is often better than PTQ for model accuracy, but it is more time-consuming and requires a complete training dataset.
The following figure shows the working principles.
Figure 3 QAT principles
Tensor Decomposition
Deep learning computation, especially in the context of computer vision (CV), involves a large volume of convolution operations. Tensor decomposition converts a convolution kernel into two continuously-multiplied smaller convolution kernels (low-rank tensors), resulting in smaller storage and computational demands and lower inference overhead.
Take a 64 × 64 × 3 × 3 convolution as an example. Decomposing it into the 32 × 64 × 3 × 1 and 64 × 32 × 1 × 3 convolution cascades can save 66.7% computation workload, that is, 1 – (32 × 64 × 3 × 1 + 64 × 32 × 1 × 3)/(64 × 64 × 3 × 3), and produce more cost-effective performance benefits with inconsequential accuracy drop. The following figure shows the working principles of tensor decomposition in the PyTorch framework.

Model Optimization for Deployment
The optimization involves mainly operator fusion, which refers to fusing the operators in a model into single-operators through mathematical equivalence and reduces the amount of computation in forward passes. For example, a convolutional layer and a BN layer can be fused into a new convolutional layer.
The following figure shows the working principles in the PyTorch framework.

Sparsity
Sparsity refers to implementing weight sparsification for certain operators in a model via structured pruning, to generate a less computationally expensive model with fewer parameters. Currently, AMCT has two sparsity modes: filter-level sparsity and 2:4 structured sparsity. For compressible operators at the same layer, only one sparsity mode can be enabled at a time.
Compared with 2:4 structured sparsity, filter-level sparsity has a larger granularity and has a greater impact on model accuracy, but can obtain more performance benefits. You can select a sparsity mode based on your need.
- Filter-level sparsity
Filter-level sparsity reduces the number of network channels (filters) based on retraining, to achieve fewer model parameters and smaller computational demand with network functionality intact. Filter-level sparsity prunes less important channels. However, channel pruning may cause significant degradation in model accuracy, which necessitates retraining the sparsified model. The filter-level sparsity is a two-step process. The first is filter selection, where a proper filter set is selected to retain the most information; the second is the reconstruction of the next-layer output using the selected filters. The following figure shows the principles of filter-level sparsity.
Figure 6 Filter-level sparsity
- 2:4 structured sparsity
Due to hardware restrictions, the
Atlas inference series products andAtlas training products do not support the 2:4 structured sparsity feature.2:4 structured sparsity, based on retraining, reserves two greater weights among every four consecutive weights, and sets the remaining weights to 0. This sparsity mode features smaller granularity, and therefore can retain a larger amount of important information. It can also reduce the computing workload on specially designed hardware, enjoying the performance advantage of structured sparsity. Unlike filter-level sparsity, 2:4 structured sparsity does not change the shape of the weight, and therefore does not affect the operator of the upper or lower layer.
The following figure shows the principles of this sparsity mode. Four adjacent elements in the cin dimension form a group. The two elements with the largest absolute values in each group are reserved. If cin is not a multiple of 4, 0 is padded until the value becomes a multiple of 4.
Figure 7 2:4 structured sparsity
Compression Combination
Compression combination, as its name suggests, applies a combination of sparsity and quantization. It goes through two phases: sparsity based on the configuration file and quantization. In the first phase, the sparsity operator is inserted based on the corresponding algorithm. In the next phase, the quantization layer for activations and weights and a searchN layer are inserted into the sparsified model to generate a compressed model, for achieving a higher performance benefit. Then you retrain the said model and save it as a model that can be used for both accuracy simulation and inference deployment.
Layer-wise Distillation
Distillation is a model compression method, which uses the supervision information of the original model to train the quantized model, achieving a higher quantization accuracy.
This method uses the pre-trained original model as the teacher network, and performs the supervised training on the student network. By computing the loss of the predicted values output by the teacher network and student network, the gradients are updated to obtain a quantized model with higher accuracy.
- Compared with PTQ, the knowledge distillation can achieve better accuracy results.
- Compared with QAT, the knowledge distillation does not require labeled datasets and can obtain good quantization results in a shorter time.
