Learning Wizard

This section describes the concept, advantages, and intended audience of Ascend Model Compression Toolkit (AMCT), and its differences between different frameworks. You can select a framework for model compression based on actual requirements.

It is a deep learning model compression toolkit designed for Ascend AI Processors. It aims to make models slim by means of various model compression techniques, including quantization and tensor decomposition. The resultant model merges support for low-bit computation on Ascend AI Processor, achieving higher compute efficiency and improved performance.

AMCT, a toolkit based on the open framework, implements low-bit quantization of activations and weights, tensor decomposition, and model optimization (mainly operator fusion) in network models. This toolkit has the following advantages:

  • Easy to use. You only need to install the tool package based on the original framework environment.
  • Easy-to-use APIs: You can complete model compression using APIs based on the open framework inference script. The resultant model can run on the CPU and GPU.
  • Hardware compatibility: You can convert the resultant model by using the Ascend Tensor Compiler (ATC) tool, and then implement 8-bit inference on Ascend AI Processor.
  • Configurable quantization: For optimal results, you can modify the quantization configuration file and adjust the compression strategy.

AMCT is using quantization and tensor decomposition for compression. Model optimization (mainly operator fusion) can be implemented during quantization.

Advantages and Disadvantages of Different Compression Modes

Table 1 Comparison of compression modes

Compression Mode

Advantage

Disadvantage

Supported Framework

Supported Product

Quantization

PTQ

  • The model does not need a retraining model.
  • Only a small amount of calibration data is required.

This depends on the distribution of the calibration dataset. If the distribution of the calibration dataset differs greatly from that of the validation dataset, the quantization result is poor. If the weight is not retrained, the model accuracy drops greatly after quantization.

  • Caffe
  • TensorFlow
  • PyTorch
  • ONNX
  • TensorFlow,Ascend

Atlas 200/300/500 Inference Product

Atlas Training Series Product

Quantization aware training

  • The model needs to be retrained.
  • The accuracy loss is small.
  • Quantization during training is time-consuming.
  • More data is required, usually a complete training dataset.
  • Caffe
  • TensorFlow
  • PyTorch

Atlas 200/300/500 Inference Product

Atlas Training Series Product

Sparsity

Filter-level sparsity

  • The sparse granularity is larger, and more performance benefits can be obtained.
  • The sparsity ratio is configurable.
  • Changing the shape of the weight affects the operators of the upper or lower layer.
  • The retraining required after sparsity is time-consuming.
  • The model accuracy is greatly affected.
  • TensorFlow
  • PyTorch

Atlas 200/300/500 Inference Product

Atlas Training Series Product

2:4 structured sparsity

Smaller sparse granularity retains more important information, resulting in a precision advantage.

  • This feature is supported only by some chips due to hardware restrictions.
  • The retraining required after sparsity is time-consuming.
  • The sparsity ratio is fixed at 50%.
  • TensorFlow
  • PyTorch

Compression combination

-

The model can be quantized and sparsified at the same time to obtain a higher compression ratio.

Retraining is time-consuming. Quantization and sparsity are performed at the same time, which greatly affects the model precision.

  • TensorFlow
  • PyTorch

Atlas 200/300/500 Inference Product

Atlas Training Series Product

Tensor decomposition

-

A convolution kernel is decomposed into low-rank tensors to reduce storage space and computation workload.

-

  • Caffe
  • TensorFlow
  • PyTorch

Atlas 200/300/500 Inference Product

Atlas Training Series Product

Automatic mixed precision search

-

An optimal solution is automatically provided for the calculation precision configuration of each layer, eliminating the difficulty of manual optimization.

-

  • TensorFlow
  • PyTorch

Atlas 200/300/500 Inference Product

Atlas Training Series Product

Activation quantization balance preprocessing

-

The impact of activation outliers on the accuracy of the quantized model is reduced.

-

  • TensorFlow
  • PyTorch
  • ONNX

Atlas 200/300/500 Inference Product

Atlas Training Series Product

Layer-wise distillation

-

Weights can be fine-tuned based on quantization to ensure high precision and shorten the execution duration of weight training.

-

PyTorch

Atlas 200/300/500 Inference Product

Atlas Training Series Product

Differences of AMCT Frameworks

Document

Description

AMCT (PyTorch)

To compress models under the PyTorch framework, you need to set up the PyTorch environment and then install AMCT.

AMCT (ONNX)

To compress ONNX models, you need to set up the ONNX Runtime environment and then install AMCT.

AMCT (TensorFlow)

To compress TensorFlow models, you need to set up the TensorFlow environment and then install AMCT.

AMCT (Caffe)

To compress models under the Caffe framework, you need to set up the Caffe environment and then install AMCT.

AMCT (TensorFlow,Ascend)

You need to set up a TensorFlow environment and use the online inference environment with NPU devices. After the environment is set up, install the AMCT tool.

Intended Audience

This document provides guidance for developers to use AMCT to compress models. By reading this document, you can achieve the following objectives:

  • Understand different compression methods of AMCT.
  • Be able to compress different models based on the methods provided in the document.
  • Master the common compression method: quantization.

To better understand this document, you are supposed to be familiar with the basic architecture and features of Linux, capable of developing programs with Python, and have a basic understanding of machine learning and deep learning.