Learning Wizard
This section describes the concept, advantages, and intended audience of Ascend Model Compression Toolkit (AMCT), and its differences between different frameworks. You can select a framework for model compression based on actual requirements.
It is a deep learning model compression toolkit designed for Ascend AI Processors. It aims to make models slim by means of various model compression techniques, including quantization and tensor decomposition. The resultant model merges support for low-bit computation on Ascend AI Processor, achieving higher compute efficiency and improved performance.
AMCT, a toolkit based on the open framework, implements low-bit quantization of activations and weights, tensor decomposition, and model optimization (mainly operator fusion) in network models. This toolkit has the following advantages:
- Easy to use. You only need to install the tool package based on the original framework environment.
- Easy-to-use APIs: You can complete model compression using APIs based on the open framework inference script. The resultant model can run on the CPU and GPU.
- Hardware compatibility: You can convert the resultant model by using the Ascend Tensor Compiler (ATC) tool, and then implement 8-bit inference on Ascend AI Processor.
- Configurable quantization: For optimal results, you can modify the quantization configuration file and adjust the compression strategy.
AMCT is using quantization and tensor decomposition for compression. Model optimization (mainly operator fusion) can be implemented during quantization.
Advantages and Disadvantages of Different Compression Modes
Compression Mode |
Advantage |
Disadvantage |
Supported Framework |
Supported Product |
|
|---|---|---|---|---|---|
Quantization |
PTQ |
|
This depends on the distribution of the calibration dataset. If the distribution of the calibration dataset differs greatly from that of the validation dataset, the quantization result is poor. If the weight is not retrained, the model accuracy drops greatly after quantization. |
|
|
Quantization aware training |
|
|
|
||
Sparsity |
Filter-level sparsity |
|
|
|
|
2:4 structured sparsity |
Smaller sparse granularity retains more important information, resulting in a precision advantage. |
|
|
||
Compression combination |
- |
The model can be quantized and sparsified at the same time to obtain a higher compression ratio. |
Retraining is time-consuming. Quantization and sparsity are performed at the same time, which greatly affects the model precision. |
|
|
Tensor decomposition |
- |
A convolution kernel is decomposed into low-rank tensors to reduce storage space and computation workload. |
- |
|
|
Automatic mixed precision search |
- |
An optimal solution is automatically provided for the calculation precision configuration of each layer, eliminating the difficulty of manual optimization. |
- |
|
|
Activation quantization balance preprocessing |
- |
The impact of activation outliers on the accuracy of the quantized model is reduced. |
- |
|
|
Layer-wise distillation |
- |
Weights can be fine-tuned based on quantization to ensure high precision and shorten the execution duration of weight training. |
- |
PyTorch |
|
Differences of AMCT Frameworks
Document |
Description |
|---|---|
To compress models under the PyTorch framework, you need to set up the PyTorch environment and then install AMCT. |
|
To compress ONNX models, you need to set up the ONNX Runtime environment and then install AMCT. |
|
To compress TensorFlow models, you need to set up the TensorFlow environment and then install AMCT. |
|
To compress models under the Caffe framework, you need to set up the Caffe environment and then install AMCT. |
|
You need to set up a TensorFlow environment and use the online inference environment with NPU devices. After the environment is set up, install the AMCT tool. |
Intended Audience
This document provides guidance for developers to use AMCT to compress models. By reading this document, you can achieve the following objectives:
- Understand different compression methods of AMCT.
- Be able to compress different models based on the methods provided in the document.
- Master the common compression method: quantization.
To better understand this document, you are supposed to be familiar with the basic architecture and features of Linux, capable of developing programs with Python, and have a basic understanding of machine learning and deep learning.