Matmul High-level API Enabling UnitFlag

Case Study

This case shows the optimized operator performance when UnitFlag is enabled in the matrix multiplication of the high-level Matmul API in the matrix multiplication operator scenario. UnitFlag provides fine-grained synchronization based on memory access for the MMAD computation instructions and FIXPIPE data transfer instructions in AICs, enabling parallel execution of computation and transfer. To enable UnitFlag, set the enUnitFlag parameter in MatmulConfig to true. For details about the enUnitFlag parameter, see MatmulConfig.

Application scenarios of enabling UnitFlag
The MMAD pipeline and FIXPIPE pipeline of the operator are executed in serial mode. The FIXPIPE pipeline moves out the result only after the MMAD computation is complete. The instruction synchronization time accounts for a large proportion of the overall operator execution time. In this scenario, you can enable UnitFlag to obtain the performance benefits of parallel execution of the MMAD and FIXPIPE pipelines. If the original MMAD and FIXPIPE pipelines of the operator can be overlapped by other pipelines (for example, MTE2 Bound), enabling UnitFlag will bring only a small overall benefit.

Constraints on enabling UnitFlag
- UnitFlag supports only the Norm, IBShare, and MDL templates.
- When UnitFlag is enabled, the operator cannot have two pipelines that move data from CO1 (L0C) to the global memory and from A1 (L1) to the global memory at the same time.
- When both UnitFlag and L0C accumulation are enabled, multiple Iterate computations and one GetTensorC output are not supported.

The operator specifications are as follows.

**Table 1** Operator specifications
Input	Shape	Data Type	Format
a	128, 64	float16	ND
b	64, 30720	float16	ND

The AI Processor used in this case has 20 cores, each of which contains one AIC and two AIVs.

The tiling parameters of the operator are as follows:

Original shape: M = 128, N = 30720, K = 64.
Single-core shape: The tiling is performed based on 20 AICs. singleCoreM = 128, singleCoreN = 1536, and singleCoreK = 64.
For matrix B, the tiling is performed along the N axis, resulting in 20 single-core tiles (singleCoreN). A single core processes K x singleCoreN data. For matrix A, the M axis is not tiled, that is, singleCoreM = M. A single core processes singleCoreN x K data. A total of 20 cores are involved in the calculation.
Basic block shape: baseM = 128, baseN = 256, and baseK = 64.
L1-related tiling parameters: stepM = 1, stepN = 1, stepKa = 4, stepKb = 4, depthA1 = 8, and depthB1 = 8.

Obtaining Profile Data

Use the msProf tool to obtain the Operator Simulation Pipeline and On-board Profiling data. UnitFlag is mainly used to optimize the serial execution of MMAD and FIXPIPE pipelines. Therefore, after obtaining the profile data, the focus is on analyzing Cube and FIXPIPE pipelines.

Analyzing Main Bottlenecks

The pipeline before optimization is as follows. As shown in the red box in the following figure, the MMAD computation pipeline and the FIXPIPE data transfer-out pipeline are executed in serial mode in each round. FIXPIPE data transfer-out starts only after the MMAD computation is complete. To optimize the operator performance, consider implementing parallel execution of the MMAD and FIXPIPE pipelines.
The following figure shows the profiling data before optimization. According to the aic_time data in column C, the maximum operator execution time among multiple cores is 37.39 μs.

Optimization Solution

As shown in the following figure, when UnitFlag is disabled, MMAD and FIXPIPE are synchronized at the instruction level. The FIXPIPE instruction transfers the result only after the MMAD instruction is executed. MMAD and FIXPIPE pipelines are executed in serial mode.

Figure 1 UnitFlag disabled

As shown in the following figure, when UnitFlag is enabled, MMAD and FIXPIPE instructions are synchronized at a fine granularity of 512 bytes. During the execution of an MMAD instruction, each time the computation of a 512-byte data result is complete, FIXPIPE immediately starts to transfer the 512-byte data out. In this way, the MMAD and FIXPIPE pipelines are executed in parallel, improving operator performance.

Figure 2 UnitFlag enabled

For details about the complete example of enabling UnitFlag by using the Matmul API, see Matmul API performance optimization sample. The procedure for enabling UnitFlag is as follows:

Customize MatmulConfig template parameters and set enUnitFlag to true to enable UnitFlag.

        
             __aicore__ inline constexpr MatmulConfig GetCustomMDLCFG()
{
    auto mmCfg = CFG_MDL;
    mmCfg.enUnitFlag = true;
    return mmCfg;
}
constexpr static MatmulConfig CUSTOM_CFG_MDL = GetCustomMDLCFG();

Create a Matmul object based on the customized MatmulConfig template parameters.

        
             using A_TYPE = AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, AType>;
using B_TYPE = AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, BType>;
using C_TYPE = AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, CType>;
using BIAS_TYPE =  AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, BiasType>;
AscendC::Matmul<A_TYPE, B_TYPE, C_TYPE, BIAS_TYPE, CUSTOM_CFG_MDL > matmulObj;

Verifying Optimization Benefits

The following figure shows the optimized pipeline. The MMAD computation pipeline and FIXPIPE data movement pipeline are executed in parallel.
The following table shows the profiling data after optimization. According to the aic_time data in column C, the maximum operator execution time among multiple cores is 34.66 μs, achieving a 7.3% performance improvement compared with the 37.39 us before optimization.

Summary

When the MMAD computation pipeline and FIXPIPE data movement pipeline of an operator are executed in serial mode and are not overlapped by other pipelines (such as MTE2 Bound), consider enabling UnitFlag to implement parallel execution of the MMAD computation pipeline and FIXPIPE data movement pipeline, improving operator performance.

Parent topic: Matmul Performance Tuning Cases