Learning Wizard
This section describes the concept and advantages of Ascend Model Compression Toolkit (AMCT), intended audience of this document, and differences of AMCT based on different frameworks. You can select a framework for model compression based on actual requirements.
AMCT is a deep learning model compression toolkit designed for Ascend AI Processors. It aims to make models slim by means of various model compression techniques, including quantization and tensor decomposition. The resultant model merges support for low-bit computation on the Ascend AI Processor, achieving higher compute efficiency and improved performance.
AMCT, a toolkit based on the open framework, implements low-bit quantization of activations and weights, tensor decomposition, and model optimization (mainly operator fusion) in network models. This toolkit has the following advantages:
- Ease of use: You only need to install the tool package based on the original framework environment.
- Intuitive APIs: You can complete model compression using APIs based on the open framework inference script. The resultant model can run on the CPU and GPU.
- Hardware compatibility: You can convert the resultant model by using the Ascend Tensor Compiler (ATC), and then implement inference on the Ascend AI Processor.
- Configurable quantization: For optimal results, you can modify the quantization configuration file and adjust the compression strategy.
AMCT is using quantization and tensor decomposition for compression. Model optimization (mainly operator fusion) can be implemented during quantization.
Advantages and Disadvantages of Compression Modes
Compression Mode |
Advantage |
Disadvantage |
Supported Framework |
Applicability |
|
|---|---|---|---|---|---|
Quantization |
Post-training quantization (PTQ) |
|
This mode depends on the distribution of the calibration dataset. If the distribution of the calibration dataset differs greatly from that of the validation dataset, the quantization result quality is poor. If the weight is not retrained, the model accuracy drops significantly after quantization. |
|
|
Quantization aware training (QAT) |
|
|
|
||
Sparsity |
Filter-level sparsity |
|
|
|
|
2:4 structured sparsity |
Smaller sparse granularity retains more important information, resulting in a precision advantage. |
|
|
This feature is not supported. |
|
Compression combination |
- |
The model can be quantized and sparsified at the same time to obtain a higher compression ratio. |
This feature involves retraining, which is time-consuming. It also performs quantization and sparsity at the same time, significantly affecting the model accuracy. |
|
|
Tensor decomposition |
- |
A convolution kernel is decomposed into low-rank tensors to reduce the storage space and computation workload. |
- |
|
|
Activation quantization balance preprocessing |
- |
The impact of activation outliers on the accuracy of the quantized model is reduced. |
- |
|
|
Layer-wise distillation |
- |
Weights can be fine-tuned based on quantization to ensure high precision and shorten the duration of weight training. |
- |
PyTorch |
|
KV cache quantization |
- |
No model retraining is required. Only a small amount of calibration data is required. |
Only the node output is quantized, which does not improve the model running efficiency. |
PyTorch |
|
Differences of AMCT Frameworks
Document |
Description |
|---|---|
To compress models under the PyTorch framework, you need to set up the PyTorch environment and then install AMCT. |
|
To compress ONNX models, you need to set up the ONNX Runtime environment and then install AMCT. |
|
To compress models under the TensorFlow framework, you need to set up the TensorFlow environment and then install AMCT. |
|
To compress models under the Caffe framework, you need to set up the Caffe environment and then install AMCT. The following Products do not support the Caffe framework: |
|
You need to set up a TensorFlow environment and use the online inference environment powered by NPUs. After the environment is set up, install AMCT. |
Intended Audience
This document provides guidance for developers to use AMCT to compress models. By reading this document, you can:
- Understand different compression methods of AMCT.
- Compress different models based on the methods provided in the document.
- Master quantization, a common compression method.
To better understand this document, you are supposed to be familiar with the basic Linux commands, be capable of developing programs with Python, and have a basic understanding of machine learning and deep learning.