Quantization and Dequantization of Matrix Multiplication Outputs

Overview

For specific input and output data types, Matmul allows you to quantize or dequantize the output C matrix elements when the computation result is moved from CO1 to to the global memory.

Matmul quantization scenario: During Matmul computation, the left matrix A and right matrix B are of the half or bfloat16_t data type, and the output matrix C is of the int8_t data type. In this scenario, when the data of matrix C is moved from CO1 to the global memory, quantization is performed to quantize the final result to the int8_t type, as shown in the following figure.
Figure 1 Matmul quantization scenario

Matmul dequantization scenario: During Matmul computation, the left matrix A and right matrix B are of the int8_t or int4b_t data type, and the output matrix C is of the half data type. Alternatively, both the left matrix A and right matrix B are of the int8_t data type, and the output matrix C is of the int8_t data type. In this scenario, when the data of matrix C is moved from CO1 to the global memory, dequantization is performed to dequantize the final result to the half or int8_t type, as shown in the following figure.
Figure 2 Matmul dequantization scenario

There are two Matmul quantization or dequantization modes: quantization or dequantization of the same coefficient, and vector quantization or dequantization. You can call the SetDequantType API in the operator tiling process to set the quantization or dequantization mode. The differences between the two modes are as follows:

Quantization or dequantization of the same coefficient (PER_TENSOR mode): The entire matrix C corresponds to one quantization parameter, whose shape is [1]. You can call the SetQuantScalar API on the operator kernel to set quantization parameters.
Vector quantization or dequantization (PER_CHANNEL mode): The shape of matrix C is [m, n]. Each channel dimension, that is, each column of matrix C, corresponds to a quantization parameter, whose shape is [n]. You can call the SetQuantVector API on the operator kernel to set quantization parameters.

**Table 1** API configurations corresponding to the quantization or dequantization modes
Mode	Tiling API	Kernel API
Quantization or dequantization of the same coefficient	SetDequantType(DequantType::SCALAR)	SetQuantScalar(gmScalar)
Vector quantization or dequantization	SetDequantType(DequantType::TENSOR)	SetQuantVector(gmTensor)

Application Scenarios

Matrix computation results need to be quantized or dequantized. The following table lists the data types supported by the Matmul input and output matrices in this scenario.

**Table 2** Data types supported by Matmul quantization or dequantization
Matrix A	Matrix B	Matrix C	Supporting Platform
half	half	int8_t	Atlas A3 training products / Atlas A3 inference products Atlas A2 training products / Atlas A2 inference products
bfloat16_t	bfloat16_t	int8_t	Atlas A3 training products / Atlas A3 inference products Atlas A2 training products / Atlas A2 inference products
int8_t	int8_t	half	Atlas A3 training products / Atlas A3 inference products Atlas A2 training products / Atlas A2 inference products
int4b_t	int4b_t	half	Atlas A3 training products / Atlas A3 inference products Atlas A2 training products / Atlas A2 inference products
int8_t	int8_t	int8_t	Atlas A3 training products / Atlas A3 inference products Atlas A2 training products / Atlas A2 inference products

Restrictions

The SetQuantScalar and SetQuantVector APIs must be called before the Iterate or IterateAll API.

The quantization or dequantization mode set on the kernel should be the same as that set in the tiling process.
- The SetQuantScalar API is called on the kernel to set the quantization or dequantization mode of the same coefficient, and the SetDequantType API is called in the tiling process to set the mode to DequantType::SCALAR.
- The SetQuantVector API is called on the kernel to set the vector quantization or dequantization mode, and the SetDequantType API is called in the tiling process to set the mode to DequantType::TENSOR.

If matrix A and matrix B are of the int8_t or int4b_t type and matrix C is of the half type, the outputs of the features mentioned in this section do not support the INF_NAN mode. If the results need to be output as INF_NAN, you are advised to send the results to TPosition::VECIN when calling the Matmul API, set the output data type to int32_t, and then use the high-level API AscendDequant based on the AIV core to dequantize the results to the half type.

Example

For a complete operator example, see matmul_quant operator sample.

Tiling implementation

Call the SetDequantType API to set the quantization or dequantization mode. Other implementation details are the same as those in basic scenarios.

         
              auto ascendcPlatform = platform_ascendc::PlatformAscendC(context->GetPlatformInfo());
matmul_tiling::MatmulApiTiling tiling(ascendcPlatform); 
tiling.SetAType(matmul_tiling::TPosition::GM, matmul_tiling::CubeFormat::ND, matmul_tiling::DataType::DT_INT8);
tiling.SetBType(matmul_tiling::TPosition::GM, matmul_tiling::CubeFormat::ND, matmul_tiling::DataType::DT_INT8);   
tiling.SetCType(matmul_tiling::TPosition::GM, matmul_tiling::CubeFormat::ND, matmul_tiling::DataType::DT_INT32);   
tiling.SetBiasType(matmul_tiling::TPosition::GM, matmul_tiling::CubeFormat::ND, matmul_tiling::DataType::DT_INT32);   
tiling.SetShape(M, N, K);   
tiling.SetOrgShape(M, N, K);  
tiling.EnableBias(true);
tiling.SetDequantType(DequantType::SCALAR); // Set the quantization or dequantization of the same coefficient.
// tiling.SetDequantType(DequantType::TENSOR); //: Set the vector quantization or dequantization.
... // Perform other configuration operations.

Kernel implementation

Call the SetQuantScalar or SetQuantVector API to set quantization parameters in line with the quantization mode. Other implementation details are the same as those in basic scenarios.

Quantization or dequantization of the same coefficient

           
                REGIST_MATMUL_OBJ(&pipe, GetSysWorkSpacePtr(), mm, &tiling);
float tmp = 0.1;  // Multiplied by 0.1 during output to the global memory.
uint64_t ans = static_cast<uint64_t>(*reinterpret_cast<int32_t*>(&tmp)); // Quantization coefficient of the floating-point value converted to the uint64_t type for setting.
mm.SetQuantScalar(ans);
mm.SetTensorA(gm_a);
mm.SetTensorB(gm_b);
mm.SetBias(gm_bias);
mm.IterateAll(gm_c);

Vector quantization or dequantization

           
                GlobalTensor gmQuant;
...
REGIST_MATMUL_OBJ(&pipe, GetSysWorkSpacePtr(), mm, &tiling);
mm.SetQuantVector(gmQuant);
mm.SetTensorA(gm_a);
mm.SetTensorB(gm_b);
mm.SetBias(gm_bias);
mm.IterateAll(gm_c);

Parent topic: Feature Scenarios