Overview
The quantization process reduces the precision of model weights and activations to make the model lighter, improving the compute efficiency and lowering the transfer latency. This section describes how to quantize a graph.
Figure 1 shows the working principle.
During quantization, model optimization for deployment (mainly operator fusion) is implemented, as shown in Figure 2.
Quantization is classified into automatic and manual quantization.
- Automatic quantization: The aclgrphCalibration API is used to automatically insert quantization operators. During quantization, operator fusion is performed on some structures in the model. Automatic quantization is recommended.
- Manual quantization: The model is modified manually to insert quantization operators.
Parent topic: Running a Graph Asynchronously

