Enabling UnitFlag Using the Matmul High-Level API

Case Study

This case demonstrates how to use the Matmul high-level API to perform matrix multiplication in the matrix multiplication operator scenario and how to enable the UnitFlag function to improve the operator performance. The UnitFlag function provides fine-grained synchronization based on memory access for the MMAD computation instructions and FIXPIPE data movement instructions in the AI Core, enabling parallel execution of computation and data movement. To enable the UnitFlag function, set the enUnitFlag parameter in MatmulConfig to true. For details about the enUnitFlag parameter, see MatmulConfig.

  • Application scenarios of enabling UnitFlag

    The MMAD pipeline and FIXPIPE pipeline of the operator are executed in serial mode. The FIXPIPE pipeline moves out the result only after the MMAD computation is complete. The instruction synchronization wait time accounts for a large proportion of the overall operator execution time. In this scenario, you can enable the UnitFlag function to obtain the performance benefits of parallel execution of the MMAD and FIXPIPE pipelines. If the original MMAD and FIXPIPE pipelines of the operator can be masked by other pipelines (for example, MTE2 Bound), enabling the UnitFlag function will bring only a small overall benefit.

  • Restrictions on enabling UnitFlag
    • The UnitFlag function supports only the Norm, IBShare, and MDL templates.
    • When the UnitFlag function is enabled, the operator cannot have two pipelines that move data from CO1 (L0C) to the global memory and from A1 (L1) to the global memory at the same time.
    • When the UnitFlag function is enabled, if the L0C accumulation function is also enabled, multiple Iterate computations and one GetTensorC output are not supported.

The operator specifications are as follows.

Table 1 Operator specifications

Input

Shape

Data type

Format

a

128, 64

float16

ND

b

64, 30720

float16

ND

The AI Processor used in this case has 20 cores, each of which contains one AI Core and two AI Vector cores.

The tiling parameters of the operator are as follows:

  • Original shape: M = 128, N = 30720, K = 64.
  • Single-core shape: The operator is tiled into 20 AICs, with singleCoreM = 128, singleCoreN = 1536, and singleCoreK = 64.

    For matrix B, it is tiled into 20 singleCoreNs along the N axis, and each core processes data of size K x SingleCoreN. For matrix A, the M axis is not split, that is, singleCoreM=M. Data of a size of singleCoreM x K is processed on a single core. A total of 20 cores are involved in the calculation.

  • Basic block shape: baseM=128, baseN=256, baseK=64.
  • L1-related tiling parameters: stepM=1, stepN=1, stepKa=4, stepKb=4, depthA1=8, depthB1=8.

Obtaining Profile Data

Use the msProf tool to obtain the operator simulation pipeline and board profiling data. The UnitFlag function is used to optimize the MMAD and FIXPIPE pipeline serialization. Therefore, after obtaining the performance data, analyze the Cube and FIXPIPE pipelines.

Analyzing Main Bottlenecks

  • The following figure shows the flow chart before the optimization. As shown in the red box in the following figure, each round of MMAD calculation pipeline and FIXPIPE data transfer-out pipeline are executed in serial mode. FIXPIPE data transfer-out starts only after the MMAD calculation is complete, implement pipeline parallelism between MMAD and FIXPIPE to optimize operator performance.

  • The profile data before tuning is as follows. According to the aic_time data in column C, the maximum operator execution time among multiple cores is 37.39 μs.

Optimization Solution

As shown in the following figure, when the UnitFlag function is disabled, the MMAD and FIXPIPE are synchronized at the instruction level. The FIXPIPE instruction is moved out only after the MMAD instruction is executed. The MMAD and FIXPIPE are serial.

Figure 1 The UnitFlag function is not enabled.

As shown in the following figure, when the UnitFlag function is enabled, the MMAD and FIXPIPE instructions are synchronized at a fine granularity of 512 bytes. During the execution of an MMAD instruction, each time the calculation of a 512-byte data result is complete, FIXPIPE immediately starts to move the 512-byte data out. In this way, pipeline parallelism between MMAD and FIXPIPE is implemented, improving operator performance.

Figure 2 Enabling the UnitFlag Function

For details about how to enable the UnitFlag function using the Matmul API, see Matmul API performance optimization sample. To enable the UnitFlag function, perform the following steps:

  1. Customize MatmulConfig template parameters and set enUnitFlag to true to enable the UnitFlag function.
    1
    2
    3
    4
    5
    6
    7
    __aicore__ inline constexpr MatmulConfig GetCustomMDLCFG()
    {
        auto mmCfg = CFG_MDL;
        mmCfg.enUnitFlag = true;
        return mmCfg;
    }
    constexpr static MatmulConfig CUSTOM_CFG_MDL = GetCustomMDLCFG();
    
  2. Create a Matmul object based on the customized MatmulConfig template parameters.
    1
    2
    3
    4
    5
    using A_TYPE = AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, AType>;
    using B_TYPE = AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, BType>;
    using C_TYPE = AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, CType>;
    using BIAS_TYPE =  AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, BiasType>;
    AscendC::Matmul<A_TYPE, B_TYPE, C_TYPE, BIAS_TYPE, CUSTOM_CFG_MDL > matmulObj;
    

Verifying Optimization Benefits

  • The following figure shows the optimized pipeline diagram. Pipeline parallelism is implemented between the MMAD computation pipeline and the FIXPIPE data output pipeline.

  • The following figure shows the optimized profiling data. According to the aic_time data in column C, the maximum operator execution time among multiple cores is 34.66 μs, which is about 7.3% faster than the 37.39 μs before the optimization.

Congratulations

If the MMAD computation pipeline and the FIXPIPE data output pipeline are serial and are not masked by other pipelines (for example, MTE2 Bound), enable the UnitFlag function to implement pipeline parallelism between the MMAD computation pipeline and the FIXPIPE data output pipeline, improving the operator performance.