Learning Wizard

This section describes the concept and advantages of Ascend Model Compression Toolkit (AMCT), intended audience of this document, and differences of AMCT based on different frameworks. You can select a framework for model compression based on actual requirements.

AMCT is a deep learning model compression toolkit designed for Ascend AI Processors. It aims to make models slim by means of various model compression techniques, including quantization and tensor decomposition. The resultant model merges support for low-bit computation on the Ascend AI Processor, achieving higher compute efficiency and improved performance.

AMCT, a toolkit based on the open framework, implements low-bit quantization of activations and weights, tensor decomposition, and model optimization (mainly operator fusion) in network models. This toolkit has the following advantages:

  • Ease of use: You only need to install the tool package based on the original framework environment.
  • Intuitive APIs: You can complete model compression using APIs based on the open framework inference script. The resultant model can run on the CPU and GPU.
  • Hardware compatibility: You can convert the resultant model by using the Ascend Tensor Compiler (ATC), and then implement inference on the Ascend AI Processor.
  • Configurable quantization: For optimal results, you can modify the quantization configuration file and adjust the compression strategy.

AMCT is using quantization and tensor decomposition for compression. Model optimization (mainly operator fusion) can be implemented during quantization.

Advantages and Disadvantages of Compression Modes

Table 1 Comparison of compression modes

Compression Mode

Advantage

Disadvantage

Supported Framework

Applicability

Quantization

Post-training quantization (PTQ)

  • Model retraining is not required.
  • Only a small amount of calibration data is required.

This mode depends on the distribution of the calibration dataset. If the distribution of the calibration dataset differs greatly from that of the validation dataset, the quantization result quality is poor. If the weight is not retrained, the model accuracy drops significantly after quantization.

  • Caffe
  • TensorFlow
  • PyTorch
  • ONNX
  • MindSpore
  • TensorFlow, Ascend

Atlas inference series products

Atlas training products

Atlas 200I/500 A2 inference product

Atlas A2 training products/Atlas A2 inference products

Atlas A3 training series products/Atlas A3 inference series products

Quantization aware training (QAT)

  • The model needs to be retrained.
  • The accuracy drop is small.
  • Quantization during training is time-consuming.
  • More data is required, usually a complete training dataset.
  • Caffe
  • TensorFlow
  • PyTorch
  • MindSpore

Atlas inference series products

Atlas training products

Atlas 200I/500 A2 inference product

Atlas A2 training products/Atlas A2 inference products

Atlas A3 training series products/Atlas A3 inference series products

Sparsity

Filter-level sparsity

  • The sparse granularity is larger, and more performance benefits can be obtained.
  • The sparsity ratio is configurable.
  • Changing the shape of the weight affects the operators of the upper or lower layer.
  • The retraining required after sparsity is time-consuming.
  • The model accuracy is greatly affected.
  • TensorFlow
  • PyTorch

Atlas inference series products

Atlas training products

Atlas 200I/500 A2 inference product

Atlas A2 training products/Atlas A2 inference products

Atlas A3 training series products/Atlas A3 inference series products

2:4 structured sparsity

Smaller sparse granularity retains more important information, resulting in a precision advantage.

  • This feature is supported only by some chips due to hardware restrictions.
  • The retraining required after sparsity is time-consuming.
  • The sparsity ratio is fixed at 50%.
  • TensorFlow
  • PyTorch

Atlas 200I/500 A2 inference product

Atlas A2 training products/Atlas A2 inference products

Atlas A3 training series products/Atlas A3 inference series products

This feature is not supported.

Compression combination

-

The model can be quantized and sparsified at the same time to obtain a higher compression ratio.

This feature involves retraining, which is time-consuming. It also performs quantization and sparsity at the same time, significantly affecting the model accuracy.

  • TensorFlow
  • PyTorch

Atlas inference series products

Atlas training products

Atlas 200I/500 A2 inference product

Atlas A2 training products/Atlas A2 inference products

Atlas A3 training series products/Atlas A3 inference series products

Tensor decomposition

-

A convolution kernel is decomposed into low-rank tensors to reduce the storage space and computation workload.

-

  • Caffe
  • TensorFlow
  • PyTorch

Atlas inference series products

Atlas training products

Atlas 200I/500 A2 inference product

Atlas A2 training products/Atlas A2 inference products

Atlas A3 training series products/Atlas A3 inference series products

Activation quantization balance preprocessing

-

The impact of activation outliers on the accuracy of the quantized model is reduced.

-

  • TensorFlow
  • PyTorch
  • ONNX

Atlas inference series products

Atlas training products

Atlas 200I/500 A2 inference product

Atlas A2 training products/Atlas A2 inference products

Atlas A3 training series products/Atlas A3 inference series products

Layer-wise distillation

-

Weights can be fine-tuned based on quantization to ensure high precision and shorten the duration of weight training.

-

PyTorch

Atlas inference series products

Atlas training products

Atlas 200I/500 A2 inference product

Atlas A2 training products/Atlas A2 inference products

Atlas A3 training series products/Atlas A3 inference series products

KV cache quantization

-

No model retraining is required. Only a small amount of calibration data is required.

Only the node output is quantized, which does not improve the model running efficiency.

PyTorch

Atlas inference series products

Atlas A2 training products/Atlas A2 inference products

Atlas A3 training series products/Atlas A3 inference series products

Differences of AMCT Frameworks

Document

Description

AMCT (PyTorch)

To compress models under the PyTorch framework, you need to set up the PyTorch environment and then install AMCT.

AMCT (ONNX)

To compress ONNX models, you need to set up the ONNX Runtime environment and then install AMCT.

AMCT (TensorFlow)

To compress models under the TensorFlow framework, you need to set up the TensorFlow environment and then install AMCT.

AMCT (Caffe)

To compress models under the Caffe framework, you need to set up the Caffe environment and then install AMCT.

The following Products do not support the Caffe framework:

Atlas A2 training products/Atlas A2 inference products

Atlas A3 training series products/Atlas A3 inference series products

AMCT (TensorFlow, Ascend)

You need to set up a TensorFlow environment and use the online inference environment powered by NPUs. After the environment is set up, install AMCT.

Intended Audience

This document provides guidance for developers to use AMCT to compress models. By reading this document, you can:

  • Understand different compression methods of AMCT.
  • Compress different models based on the methods provided in the document.
  • Master quantization, a common compression method.

To better understand this document, you are supposed to be familiar with the basic Linux commands, be capable of developing programs with Python, and have a basic understanding of machine learning and deep learning.