Manual Quantization

This section describes how to manually modify a model by inserting quantization operators to implement the quantization function.

Overview

You can choose proper tools based on the framework in use to perform quantization and inject the quantization factors (scale_d, scale_w and offset_d) into your model when building the model.

Currently, only the Conv2D, DepthwiseConv2D and FullyConnection operators support quantization.
When the channel size of the input data of the Conv2D, DepthwiseConv2D, or FullyConnection operator is less than or equal to 16, INT8 quantization does not improve the performance due to padding. Therefore, for these three operators, the channel size of the input data must be greater than 16. Otherwise, performance benefits cannot be obtained.

For example, to INT8 quantize the Conv2D operator, insert the AscendQuant quantization operator before the Conv2D operator and insert the AscendDequant dequantization operator after the Conv2D operator, as shown in Figure 1.

The AscendQuant quantization operator converts float data into int8 data as in the following formula: data_int8 = round x [(data_float x scale) + offset], where scale = 1/scale_d, and offset = offset_d. The round algorithm here is similar to the FE_TONEAREST mode of rint() in C language.

The AscendDequant operator converts int32 data into float32 data as in the following formula: data_float₃₂ = data_int32 x deq_scale, where deq_scale = scale_d x scale_w.

Figure 1 Quantization diagram

Inserting AscendQuant Before Conv2D

AscendQuant operator prototype definition is as follows.

REG_OP(AscendQuant)
    .INPUT(x, TensorType({DT_FLOAT16, DT_FLOAT32}))
    .OUTPUT(y, TensorType({DT_INT8, DT_INT4}))
    .REQUIRED_ATTR(scale, Float)
    .REQUIRED_ATTR(offset, Float)
    .ATTR(sqrt_mode, Bool, false)
    .ATTR(round_mode, String, "Round")
    .ATTR(dst_type, Int, DT_INT8)
    .OP_END_FACTORY_REG(AscendQuant)

The AscendQuant operator has one input (x), two required attributes (scale and offset), and three optional attributes (sqrt_mode, round_mode, and dst_type). They are described as follows:

x: a tensor of type float16 or float32, for the input of the AscendQuant operator.
y: a tensor of type int8 or int4, for the output of the operator.
scale: a float16 or float32, for the quantization factor (scale = 1/scale_d). Should be a float16; otherwise, set sqrt_mode to True.
offset: a float16 or float32, for the quantization offset (offset = 1/offset_d)
sqrt_mode: a bool to specify whether to perform square root extraction on scale. Defaults to False (recommended). If the value of scale exceeds the float16 range, set sqrt_mode to True to avoid accuracy loss (square root extraction is performed on scale).
round_mode: method for converting a float type to an int type. Selected from Round (default), Floor, Ceiling, and Truncate.
dst_type: output data type. The default value is 2, indicating that the data type is DT_INT8.

Create an AscendQuant operator instance based on the operator prototype definition.

auto quant = op::AscendQuant("quant")
  .set_input_x(data)
  .set_attr_scale(1.00049043)      // Specify scale.
  .set_attr_offset(-128.0);        // Specify offset.

Conv2D

Set AscendQuant as the input of Conv2D, and set the output type to int32.

// const op: conv2d weight
auto weight_shape = ge::Shape({ 5,17,1,1 });
TensorDesc desc_weight_1(weight_shape, FORMAT_NCHW, DT_INT8);
Tensor weight_tensor(desc_weight_1);
uint32_t weight_1_len = weight_shape.GetShapeSize();
bool res = GetConstTensorFromBin(PATH+"const_0.bin", weight_tensor, weight_1_len);
if(!res) {
    std::cout << "GetConstTensorFromBin Failed!" << std::endl;
    return -1;
}
auto conv_weight = op::Const("const_0")
    .set_attr_value(weight_tensor);

// const op: conv2d bias
auto bias_shape = ge::Shape({ 5 });
TensorDesc desc_bias(bias_shape, FORMAT_NCHW, DT_INT32);
Tensor bias_tensor(desc_bias);
uint32_t bias_len = bias_shape.GetShapeSize() * sizeof(int32_t);
res = GetConstTensorFromBin(PATH + "const_1.bin", bias_tensor, bias_len);
if(!res) {
    std::cout << "GetConstTensorFromBin Failed!" << std::endl;
    return -1;
}
auto conv_bias = op::Const("const_1")
    .set_attr_value(bias_tensor);

// conv2d op
auto conv2d = op::Conv2D("Conv2d")
    .set_input_x(quant)                                                 // AscendQuant is used as the input of the Conv2D operator.
    .set_input_filter(conv_weight)
    .set_input_bias(conv_bias)
    .set_attr_strides({ 1, 1, 1, 1 })
    .set_attr_pads({ 0, 0, 0, 0 })
    .set_attr_dilations({ 1, 1, 1, 1 });

TensorDesc conv2d_input_desc_x(ge::Shape(), FORMAT_NCHW, DT_INT8);        // After quantization, set the data type of the input x to int8.
TensorDesc conv2d_input_desc_filter(ge::Shape(), FORMAT_NCHW, DT_INT8);   
TensorDesc conv2d_input_desc_bias(ge::Shape(), FORMAT_NCHW, DT_INT32);    
TensorDesc conv2d_output_desc_y(ge::Shape(), FORMAT_NCHW, DT_INT32);      
conv2d.update_input_desc_x(conv2d_input_desc_x);
conv2d.update_input_desc_filter(conv2d_input_desc_filter);
conv2d.update_input_desc_bias(conv2d_input_desc_bias);
conv2d.update_output_desc_y(conv2d_output_desc_y);

Inserting AscendDequant After Conv2D

AscendDequant operator prototype definition is as follows.

REG_OP(AscendDequant)
    .INPUT(x, TensorType({DT_INT32}))
    .INPUT(deq_scale, TensorType({DT_FLOAT16, DT_UINT64}))
    .OUTPUT(y, TensorType({DT_FLOAT16, DT_FLOAT}))
    .ATTR(sqrt_mode, Bool, false)
    .ATTR(relu_flag, Bool, false)
    .ATTR(dtype, Int, false, DT_FLOAT)
    .OP_END_FACTORY_REG(AscendDequant)

The AscendDequant operator has two inputs (x and deq_scale), and three optional attributes (sqrt_mode, relu_flag, and dtype). The parameters are described as follows:

x: a tensor of type int32 for the input of the AscendDequant operator.

deq_scale: a tensor of type uint64 for the dequantization factor (deq_scale = scale_d x scale_w). With shape 1, or the same as the channel dimension of the Conv2D output.

You need to convert the float32 data obtained by multiplying the scale_d and scale_w into uint64 before filling the result in the lower 32 bits of deq_scale. The upper 32 bits must be all 0s.

import numpy as np
def trans_float32_scale_deq_to_uint64(scale_deq):
    float32_scale_deq = np.array(scale_deq, np.float32)
    uint32_scale_deq = np.frombuffer(float32_scale_deq, np.uint32)
    uint64_result = np.zeros(float32_scale_deq.shape, np.uint64)
    uint64_result |= np.uint64(uint32_scale_deq)
    return uint64_result

sqrt_mode: a bool to specify whether to perform square root extraction on deq_scale. Defaults to False (recommended). If the value of deq_scale exceeds the float16 range, set sqrt_mode to True to avoid accuracy loss (square root extraction is performed on deq_scale).
relu_flag: a bool to specify whether to perform ReLU. Defaults to False.
dtype: output data type. The default value is 0, indicating that the data type is DT_FLOAT.

Create an AscendDequant operator instance based on the operator prototype definition.

// Construct dequant_scale.
TensorDesc desc_dequant_shape(ge::Shape({ 5 }), FORMAT_NCHW, DT_UINT64);
Tensor dequant_tensor(desc_dequant_shape);
uint32_t dequant_scale_len = 5 * sizeof(uint64_t);
res = GetConstTensorFromBin(PATH + "const_2.bin", dequant_tensor, dequant_scale_len);
if(!res) {
    std::cout << "GetConstTensorFromBin Failed!" << std::endl;
    return -1;
}
auto dequant_scale = op::Const("dequant_scale")
    .set_attr_value(dequant_tensor);

// Define the AscendDequant operator.
auto dequant = op::AscendDequant("dequant")
  .set_input_x(conv2d)                                  // Conv2D is used as the input of the AscendDequant operator.
  .set_input_deq_scale(dequant_scale);

Set the output of AscendDequant as the input of other operators, or as the graph output.

auto bias_add_1 = op::BiasAdd("bias_add_1")
   .set_input_x(dequant)
   .set_input_bias(bias_weight_1)
   .set_attr_data_format("NCHW");

Parent topic: Running a Graph Asynchronously