Basic Concepts
This section describes the concepts used in model compression and the principles of different compression methods.
Quantization
The quantization process reduces the accuracy of model weights and activations to make the model lighter, improving the compute efficiency and lowering the transfer latency.
AMCT separates quantization and model conversion to independently quantize quantizable operators in a model and outputs the quantized model. The resultant fake-quantized model can run on the CPU or GPU to complete accuracy simulation, while the deployable model can run on the Ascend AI Processor to improve inference performance.
Currently, this tool can quantize only network models of the float32 or float16 data type (the Caffe framework does not support the float16 data type). The following figure shows the working principle of quantizing a network model to the INT8 data type.

- Post-training quantization
PTQ refers to quantizing the weights of an already trained model from float32 to int8 as well as quantizing the activations by using a small calibration dataset, minimizing precision loss during quantization. PTQ is easy to use and requires only a small calibration dataset, which is applicable to scenarios where ease of use and resource saving take priority.
In general, after model training, the weights are determined. Therefore, the quantization parameters of a weight can be calculated offline based on the determinate weight value. In contrast, activations are input online. In this way, the accurate value ranges are hard to be obtained and a small representative dataset is required to simulate the distributions of online activations. To obtain the quantization parameters of an activation, use the said dataset to run forward passes for the intermediate floating-point results, and calculate the quantization parameters of the activation offline based on the results. The following figure shows the working principle.
Figure 2 PTQ principles
- Quantization aware training
Quantization aware training (QAT) acts by including quantization in the retraining process to address the effects of quantization on the model accuracy. QAT, which generally requires a complete training dataset, emulates the errors incurred by quantization in forward inference by introducing fake-quantization (quantizing floating-point numbers to fixed-point ones, and then dequantizing them to floating-point ones) in the training process, and updates weights in training on top of the said errors, ensuring less accuracy loss in quantization.
QAT is often better than PTQ for model accuracy, but it is more time-consuming and requires a complete training dataset.
The following figure shows the working principles.
Figure 3 QAT principles
Tensor Decomposition
Image analysis computations, especially in the context of computer vision (CV), involves a large volume of convolution operations. Tensor decomposition converts a convolution kernel into two continuously-multiplied smaller convolution kernels (low-rank tensors), resulting in smaller storage and computational demands and lower inference overhead.
Take a 64 x 64 x 3 x 3 convolution for example. Decomposing it into the 32 x 64 x 3 x 1 and 64 x 32 x 1 x 3 convolution cascades can save 66.7% computation workload, that is, 1 – (32 x 64 x 3 x 1 + 64 x 32 x 1 x 3)/(64 x 64 x 3 x 3), and produce more cost-effective performance benefits with inconsequential accuracy loss. The following figure shows the working principle of tensor decomposition (using the PyTorch framework as an example).

Model Optimization for Deployment
The optimization involves mainly operator fusion, which refers to fusing the operators in a model into single-operators through mathematical equivalence and reduces the amount of computation in forward passes. For example, a convolutional layer and a BN layer can be fused into a new convolutional layer.
The following figure shows the running principle (the PyTorch framework is used as an example).

Sparsity
Sparsity refers to implementing weight sparsification for certain operators in a model via structured pruning, to generate a less computationally expensive model with fewer parameters. Currently, AMCT has two sparsity modes: filter-level sparsity and 2:4 structured sparsity. Only one sparsity mode can be enabled at a time. That is, for compressible operators at the same layer, filter-level sparsity and 2:4 structured sparsity cannot be configured at the same time.
Compared with 2:4 structured sparsity, filter-level sparsity has a larger granularity and has a greater impact on model accuracy, but can obtain more performance benefits. You can select a sparsity mode based on the site requirements.
- Filter-Level Sparsity
Channel sparsity is based on retraining. It reduces the number of network channels to reduce the number of model parameters while maintaining network functions, thereby reducing the computing workload of the entire network. Channel pruning is performed based on the importance of channels. Channels with low importance are pruned. However, direct channel pruning has a great impact on the network precision. Therefore, the tailored model needs to be retrained to ensure the service precision. The filter-level sparsity is a two-step process. The first is filter selection, where a proper filter set is selected to retain the most information; the second is the reconstruction of the next-layer output using the selected filters. The following figure shows the principles of filter-level sparsity.
Figure 6 Filter-level sparsity
- 2:4 structured sparsity
Due to hardware restrictions, the Atlas 200/300/500 Inference Product, and Atlas Training Series Product do not support the 2:4 structured sparsity feature. Enabling this feature obtains few performance benefits.
2:4 structured sparsity is based on retraining. Two weights with higher importance are reserved in every four consecutive weights, and other weights are set to 0. Because the granularity of sparsity is small, 2:4 structured sparsity can retain a large amount of important information and has precision advantages of fine-grained sparsity. In addition, 2:4 structured sparsity can reduce the computing amount on specially designed hardware and has the performance advantage of structured sparsity. Different from channel sparsity, 2:4 sparsity does not change the shape of the weight. Therefore, the upper-layer or lower-layer operators are not affected.
As shown in the following figure, four adjacent elements in the CIN dimension form a group. The two elements with the largest absolute values are reserved in each group. If the CIN is not a multiple of 4, 0s are padded to ensure that the CIN is a multiple of 4.
Figure 7 2:4 structured sparsity
Compression combination
Compression combination, as its name suggests, applies a combination of sparsity and quantization. It goes through two phases: sparsity based on the configuration file and quantization. In the first phase, the sparsity operator is inserted based on the corresponding algorithm. In the next phase, the quantization layer for activations and weights and a searchN layer are inserted into the sparsified model to generate a compressed model, for achieving a higher performance benefit. Then you retrain the said model and save it as a model that serves for both accuracy simulation and inference deployment.
Layer-wise Distillation
Distillation is a model compression method, which uses the supervision information of the source model to train the quantized model, achieving better quantization accuracy.
This method uses the pre-trained source model as the teacher network, and performs the supervised training on the student network. By computing the loss of the predicted values output by the teacher network and student network, the gradients are updated to obtain a quantization model with higher accuracy.
- Compared with post-training quantization, the knowledge distillation can achieve better accuracy results.
- Compared with quantization aware training, the knowledge distillation does not require labeled datasets and can obtain good quantization results in a shorter quantization time.
