Learning Wizard

This section describes the concept and advantages of Ascend Model Compression Toolkit (AMCT), intended audience of this document, and differences of AMCT based on different frameworks. You can select a framework for model compression based on actual requirements.

AMCT is a deep learning model compression toolkit designed for Ascend AI Processors. It aims to make models slim by means of various model compression techniques, including quantization and tensor decomposition. The resultant model merges support for low-bit computation on the Ascend AI Processor, achieving higher compute efficiency and improved performance.

AMCT, a toolkit based on the open framework, implements low-bit quantization of activations and weights, tensor decomposition, and model optimization (mainly operator fusion) in network models. This toolkit has the following advantages:

Ease of use: You only need to install the tool package based on the original framework environment.
Intuitive APIs: You can complete model compression using APIs based on the open framework inference script. The resultant model can run on the CPU and GPU.
Hardware compatibility: You can convert the resultant model by using the Ascend Tensor Compiler (ATC), and then implement inference on the Ascend AI Processor.
Configurable quantization: For optimal results, you can modify the quantization configuration file and adjust the compression strategy.

AMCT is using quantization and tensor decomposition for compression. Model optimization (mainly operator fusion) can be implemented during quantization.

Advantages and Disadvantages of Compression Modes

**Table 1** Comparison of compression modes
Compression Mode		Advantage	Disadvantage	Supported Framework	Applicability
Quantization	Post-training quantization (PTQ)	Model retraining is not required. Only a small amount of calibration data is required.	This mode depends on the distribution of the calibration dataset. If the distribution of the calibration dataset differs greatly from that of the validation dataset, the quantization result quality is poor. If the weight is not retrained, the model accuracy drops significantly after quantization.	Caffe TensorFlow PyTorch ONNX MindSpore TensorFlow, Ascend	Atlas inference series products Atlas training products Atlas 200I/500 A2 inference product Atlas A2 training products/Atlas A2 inference products Atlas A3 training series products/Atlas A3 inference series products
Quantization	Quantization aware training (QAT)	The model needs to be retrained. The accuracy drop is small.	Quantization during training is time-consuming. More data is required, usually a complete training dataset.	Caffe TensorFlow PyTorch MindSpore	Atlas inference series products Atlas training products Atlas 200I/500 A2 inference product Atlas A2 training products/Atlas A2 inference products Atlas A3 training series products/Atlas A3 inference series products
Sparsity	Filter-level sparsity	The sparse granularity is larger, and more performance benefits can be obtained. The sparsity ratio is configurable.	Changing the shape of the weight affects the operators of the upper or lower layer. The retraining required after sparsity is time-consuming. The model accuracy is greatly affected.	TensorFlow PyTorch	Atlas inference series products Atlas training products Atlas 200I/500 A2 inference product Atlas A2 training products/Atlas A2 inference products Atlas A3 training series products/Atlas A3 inference series products
Sparsity	2:4 structured sparsity	Smaller sparse granularity retains more important information, resulting in a precision advantage.	This feature is supported only by some chips due to hardware restrictions. The retraining required after sparsity is time-consuming. The sparsity ratio is fixed at 50%.	TensorFlow PyTorch	Atlas 200I/500 A2 inference product Atlas A2 training products/Atlas A2 inference products Atlas A3 training series products/Atlas A3 inference series products This feature is not supported.
Compression combination	-	The model can be quantized and sparsified at the same time to obtain a higher compression ratio.	This feature involves retraining, which is time-consuming. It also performs quantization and sparsity at the same time, significantly affecting the model accuracy.	TensorFlow PyTorch	Atlas inference series products Atlas training products Atlas 200I/500 A2 inference product Atlas A2 training products/Atlas A2 inference products Atlas A3 training series products/Atlas A3 inference series products
Tensor decomposition	-	A convolution kernel is decomposed into low-rank tensors to reduce the storage space and computation workload.	-	Caffe TensorFlow PyTorch	Atlas inference series products Atlas training products Atlas 200I/500 A2 inference product Atlas A2 training products/Atlas A2 inference products Atlas A3 training series products/Atlas A3 inference series products
Activation quantization balance preprocessing	-	The impact of activation outliers on the accuracy of the quantized model is reduced.	-	TensorFlow PyTorch ONNX	Atlas inference series products Atlas training products Atlas 200I/500 A2 inference product Atlas A2 training products/Atlas A2 inference products Atlas A3 training series products/Atlas A3 inference series products
Layer-wise distillation	-	Weights can be fine-tuned based on quantization to ensure high precision and shorten the duration of weight training.	-	PyTorch	Atlas inference series products Atlas training products Atlas 200I/500 A2 inference product Atlas A2 training products/Atlas A2 inference products Atlas A3 training series products/Atlas A3 inference series products
KV cache quantization	-	No model retraining is required. Only a small amount of calibration data is required.	Only the node output is quantized, which does not improve the model running efficiency.	PyTorch	Atlas inference series products Atlas A2 training products/Atlas A2 inference products Atlas A3 training series products/Atlas A3 inference series products

Differences of AMCT Frameworks

Document	Description
AMCT (PyTorch)	To compress models under the PyTorch framework, you need to set up the PyTorch environment and then install AMCT.
AMCT (ONNX)	To compress ONNX models, you need to set up the ONNX Runtime environment and then install AMCT.
AMCT (TensorFlow)	To compress models under the TensorFlow framework, you need to set up the TensorFlow environment and then install AMCT.
AMCT (Caffe)	To compress models under the Caffe framework, you need to set up the Caffe environment and then install AMCT. The following Products do not support the Caffe framework: Atlas A2 training products/Atlas A2 inference products Atlas A3 training series products/Atlas A3 inference series products
AMCT (TensorFlow, Ascend)	You need to set up a TensorFlow environment and use the online inference environment powered by NPUs. After the environment is set up, install AMCT.

Intended Audience

This document provides guidance for developers to use AMCT to compress models. By reading this document, you can:

Understand different compression methods of AMCT.
Compress different models based on the methods provided in the document.
Master quantization, a common compression method.

To better understand this document, you are supposed to be familiar with the basic Linux commands, be capable of developing programs with Python, and have a basic understanding of machine learning and deep learning.