Enabling Full Constant Quantization for Matmul Tiling Using High-Level APIs
Case Study
This case demonstrates the effect of enabling full constant quantization for Matmul tiling to improve the operator performance when the Matmul high-level API is used for matrix multiplication. During the initialization and iteration of the Matmul API, there is a large amount of scalar computation. The scalar computation during Matmul initialization affects the instruction header overhead, and the scalar computation between Matmul iterations may block the MTE2 pipeline. When the Matmul API is called to implement matrix multiplication, the MatmulApiStaticTiling parameter is used to replace the TCubeTiling variable parameter, and the scalar computation is performed in advance in the compilation phase to reduce the scalar computation overhead during runtime and improve the operator performance.
- Matmul tiling constant quantization is applicable to the following scenarios:
- A large number of Scalar computations are performed during Matmul initialization, affecting the instruction header overhead.
- A large number of Scalar computations are performed between Matmul iterations, blocking the MTE2 pipeline.
- For Matmul tiling constant quantization, some tiling parameters need to be determined in the compilation phase. Based on the determined parameters, there are two scenarios: full constant quantization and partial constant quantization. To use Matmul tiling constant quantization, either of the following conditions must be met:
- Full constant quantization: The constant singleCore Shape (singleCoreM/singleCoreN/singleCoreK) and constant base Shape (basicM/basicN/basicK, also called baseM/baseN/baseK) can be determined.
- Partial constant quantization: The constant base Shape (basicM/basicN/basicK, also called baseM/baseN/baseK) can be determined.
Full constants can reduce more Scalar computation overheads than partial constants.
The operator specifications are as follows.
|
Input |
Shape |
Data type |
Format |
|---|---|---|---|
|
a |
128, 64 |
float16 |
ND |
|
b |
64, 30720 |
float16 |
ND |
The AI processor used in this case has 24 cores, each of which contains one AIC core and two AIV cores.
The tiling parameters are as follows:
- Original shape: M = 128, N = 30720, K = 64.
- Single-core shape: The input is tiled into 24 AIC cores, with singleCoreM = 128, singleCoreN = 1280, and singleCoreK = 64.
Matrix B is split along the N axis into 24 singleCoreNs, and data of a size of K x singleCoreN is processed on a single core. For matrix A, the M axis is not split, that is, singleCoreM=M. Data of a size of singleCoreM x K is processed on a single core. A total of 24 cores are involved in the calculation.
- Basic block shape: baseM=128, baseN=256, baseK=64.
- L1-related tiling parameters: stepM=1, stepN=1, stepKa=4, stepKb=4, depthA1=8, depthB1=8.
Obtaining Profile Data
Use the msProf tool to obtain the operator simulation pipeline and board profiling data. Compared with basic scenarios, constant quantization converts some or all tiling parameters from variables to constant values during compilation. During operator execution, the constant tiling parameters are directly used to reduce the scalar performance overhead. Therefore, the scalar pipeline is analyzed.
Analyzing Main Bottlenecks
- The following figure shows the pipeline diagram before optimization. By default, constant quantization of tiling is disabled. The tiling parameters need to be copied from the host to the kernel. As a result, a large number of scalar calculations are performed during Matmul initialization. The first MTE2 instruction starts at about 3.536 μs, the instruction header overhead before MTE2 accounts for a large proportion in the entire operator pipeline. Therefore, the scalar calculation needs to be optimized.
Optimization Solution
As shown in the following figure, when tiling constant quantization is disabled by default, you can create a tiling object on the host and call the API to automatically obtain tiling parameters. Then, the tiling parameters are transferred from the host to the kernel during initialization on the kernel. During operator execution, the tiling variable is used to perform matrix multiplication.
As shown in the following figure, when the constant quantization function is enabled, you only need to call the GetMatmulApiTiling API to obtain the constant quantization tiling information during compilation when creating a Matmul object on the kernel side. During operator execution, constant-quantized tiling parameters are used to perform matrix multiplication, reducing the scalar computation overhead.
For details about how to use the Matmul API to enable full constant quantization of tiling, see operator sample for Matmul static tiling. To enable tiling full constant quantization, perform the following steps:
- When the GetMMConfig API is called to obtain the MatmulConfig template, set MatmulShapeParams to a constant value to obtain a customized MatmulConfig template with constant parameters.
1 2 3 4 5 6 7 8 9 10 11 12 13
constexpr int32_t MAX_M = 10000; // custom matmul kernel support max value of M Dim shape constexpr int32_t MAX_N = 10000; // custom matmul kernel support max value of N Dim shape constexpr int32_t MAX_K = 10000; // custom matmul kernel support max value of K Dim shape constexpr int32_t BASE_M = 128; // BASE_M * BASE_K * sizeof(typeA) <=L0A size constexpr int32_t BASE_N = 256; // BASE_N * BASE_K * sizeof(typeB) <=L0B size constexpr int32_t BASE_K = 64; // BASE_M * BASE_N * sizeof(typeC) <=L0C size constexpr MatmulShapeParams shapeParams = { MAX_M, MAX_N, MAX_K, BASE_M, BASE_N, BASE_K }; constexpr MatmulConfig CUSTOM_CFG = GetMMConfig<MatmulConfigMode::CONFIG_MDL>(shapeParams);
- Create a Matmul object. Call the GetMatmulApiTiling API to perform constant quantization on the tiling information and obtain the constant quantization template parameter CONSTANT_CFG, including the constant quantization Matmul tiling information and MatmulConfig template. When creating a Matmul object, use the constant quantization template parameter CONSTANT_CFG.
1 2 3 4 5 6
using A_TYPE = AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, aType>; using B_TYPE = AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, bType>; using C_TYPE = AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, cType>; using BIAS_TYPE = AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, biasType>; constexpr static auto CONSTANT_CFG = AscendC::GetMatmulApiTiling<A_TYPE, B_TYPE, C_TYPE, BIAS_TYPE>(CUSTOM_CFG); AscendC::Matmul<A_TYPE, B_TYPE, C_TYPE, BIAS_TYPE, CONSTANT_CFG> matmulObj;
- Perform the initialization operation. During full constant quantization, you can use a null pointer to replace the position of the tiling parameter in the input parameter of the REGIST_MATMUL_OBJ API. During partial constant quantization, tiling is still required when the REGIST_MATMUL_OBJ API is used to initialize the Matmul object on the kernel side.
1 2 3 4 5
// Example of initialization in the full constant quantization scenario REGIST_MATMUL_OBJ(&pipe, GetSysWorkSpacePtr(), matmulObj, (TCubeTiling*)nullptr); // Example of initialization in the partial constant quantization scenario REGIST_MATMUL_OBJ(&pipe, GetSysWorkSpacePtr(), matmulObj, &tiling);
Verifying Optimization Benefits
- The following figure shows the optimized pipeline. By enabling full constant quantization of tiling, the tiling parameters do not need to be copied from the host to the kernel. Instead, tiling constant quantization is completed during compilation, reducing the scalar computation during Matmul initialization. The time from 0 μs to the time when the first MTE2 instruction is initiated is the Matmul initialization time. The Matmul initialization time is reduced from 3.536 μs to 2.185 μs, and the performance is improved.
- The following figure shows the optimized profiling data. According to the aic_time data in column C, the maximum operator execution time among multiple cores is 7.87 μs, which is 25.9% shorter than the 10.62 μs before the optimization. According to the aic_scalar_time data in column G, the average scalar execution time is 3.38 μs, which is 46.5% shorter than the 6.32 μs before the optimization.
Congratulations
When an operator calls the Matmul API to complete matrix multiplication, if there are a large number of scalar computations during Matmul initialization, the instruction header overhead is affected. Alternatively, if there are a large number of scalar computations between Matmul iterations, the MTE2 pipeline is blocked. In these two scenarios, if the conditions for enabling tiling constant quantization (full constant quantization or partial constant quantization) are met, you can enable tiling constant quantization to reduce the scalar computation overhead and improve the operator performance.



