Enabling the Pure Cube Mode Using the Matmul High-Level API
Case Study
This case demonstrates the performance improvement of the Matmul high-level API in pure Cube mode in the matrix multiplication operator scenario. As shown in the following figure, the Matmul API uses the MIX mode by default. That is, the user initiates a message from the AIV, and the message is forwarded through the message communication framework and then Matmul computation is performed on the AIC. This message processing mechanism brings extra scalar performance overhead. Compared with the MIX mode, the pure Cube mode can directly skip the message communication framework to complete Matmul computation, improving the operator performance.
- Application scenarios of the pure Cube mode
Non-fused operators, only in the matrix computation scenario. That is, there is no vector computation in this scenario, compared with the MIX mode (including matrix computation and vector computation). The operator specifications are as follows.
|
Input |
Shape |
Data type |
Format |
|---|---|---|---|
|
a |
128, 64 |
float16 |
ND |
|
b |
64, 30720 |
float16 |
ND |
The AI processor used in this case has 24 cores, each of which contains one AIC core and two AIV cores.
The tiling parameters are as follows:
- Original shape: M = 128, N = 30720, K = 64.
- Single-core shape:
- MIX scenario: The computation is tiled based on 48 AIV cores. singleCoreM = 128, singleCoreN = 640, and singleCoreK = 64.
- Cube-only scenario: The computation is tiled based on 24 AIC cores. singleCoreM = 128, singleCoreN = 1280, and singleCoreK = 64.
- Basic block shape: baseM = 128, baseN = 256, and baseK = 64.
- L1-related tiling parameters: stepM = 1, stepN = 1, stepKa = 4, stepKb = 4, depthA1 = 8, and depthB1 = 8.
Obtaining Profile Data
Use the msProf tool to obtain the operator simulation pipeline diagram and on-board profiling data. The pure Cube mode is mainly used to optimize the scalar pipeline performance. Therefore, you can focus on analyzing the scalar pipeline.
Analyzing Main Bottlenecks
- The following figure shows the profiling data before the optimization. According to the aic_time data in column C, the maximum operator execution time of multiple cores is 17.85 μs. According to the aic_scalar_time data in column G, the average scalar execution time is 15.02 μs, and the performance bottleneck lies in the scalar pipeline.
- The following figure shows the pipeline before optimization. In the default MIX mode, each Matmul computation involves the message communication framework to process messages. As a result, the scalar pipeline is heavy and the performance overhead is large, as shown in the red box in the following figure.
Optimization Solution
In the default MIX mode, the user initiates a message on the AIV side. After the message is forwarded by the message communication framework, the Matmul computation is performed on the AIC side. Based on this process, when using the Matmul high-level API to compile operator code, you can use the REGIST_MATMUL_OBJ macro, without distinguishing the AIV and AIC. However, this message processing mechanism causes additional performance overhead, as shown in Figure 1 Matmul process in the default MIX mode.
The procedure for implementing the default MIX mode is as follows:
- On the kernel side, define the Matmul object.
1 2 3 4 5 6 7
#include "lib/matmul_intf.h" using A_TYPE = AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, AType>; using B_TYPE = AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, BType>; using C_TYPE = AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, CType>; using BIAS_TYPE = AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, BiasType>; AscendC::Matmul<A_TYPE, B_TYPE, C_TYPE, BIAS_TYPE, CFG_NORM> matmulObj;
- On the host side, the Matmul multi-core tiling object calls the SetDim API to set the number of cores involved in the computation.
1 2 3 4
auto ascendcPlatform = platform_ascendc::PlatformAscendCManager::GetInstance(); matmul_tiling::MultiCoreMatmulTiling cubeTiling(*ascendcPlatform); int32_t blockDim = ascendcPlatform->GetCoreNumAiv(); // In MIX mode, use GetCoreNumAiv to obtain the number of available AI processors. cubeTiling.SetDim(blockDim);
- Call the kernel function. For details, see Kernel Function Definition and Calling and set the blockDim parameter of the kernel function.
1In matmul_custom_do(ascendcPlatform->GetCoreNumAic(), stream, x1, x2, bias, y, workspaceDevice, tilingDevice); // MIX mode, the startup is performed based on the combination of AIV and AIC. The blockDim parameter is used to set the number of AI Cores to be started.
In the operator scenario without vector computation, you can skip the message communication framework mechanism and enable the pure Cube mode to complete Matmul computation, reducing the performance overhead of message communication and improving the operator performance.
For details about the complete example of enabling the pure Cube mode using the Matmul API, see Matmul API performance optimization sample. The procedure for enabling the pure Cube mode is as follows:
- On the kernel side, set the ASCENDC_CUBE_ONLY macro before including the matmul_intf.h header file in the code for defining the Matmul object.
1 2 3 4 5 6 7 8
#define ASCENDC_CUBE_ONLY // Set the ASCENDC_CUBE_ONLY macro before #include "lib/matmul_intf.h". #include "lib/matmul_intf.h" using A_TYPE = AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, AType>; using B_TYPE = AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, BType>; using C_TYPE = AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, CType>; using BIAS_TYPE = AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, BiasType>; AscendC::Matmul<A_TYPE, B_TYPE, C_TYPE, BIAS_TYPE, CFG_NORM> matmulObj;
- On the host side, the Matmul multi-core tiling object calls the SetDim API to set the number of cores involved in the computation.
1 2 3 4
auto ascendcPlatform = platform_ascendc::PlatformAscendCManager::GetInstance(); matmul_tiling::MultiCoreMatmulTiling cubeTiling(*ascendcPlatform); int32_t blockDim = ascendcPlatform->GetCoreNumAic(); // In Cube-only mode, the GetCoreNumAic API is used to obtain the number of available AI processors. cubeTiling.SetDim(blockDim);
- Call the kernel function. For details, see Kernel Function Definition and Calling. Set the blockDim parameter of the kernel function.
1matmul_custom_do(ascendcPlatform->GetCoreNumAic(), stream, x1, x2, bias, y, workspaceDevice, tilingDevice); // contains only operators with Cube computation. blockDim is used to set the number of AICs to be started.
- On the kernel side, the return branch from the AIV is added to the kernel function implementation.
1 2 3 4 5 6 7 8 9
extern "C" __global__ __aicore__ void matmul_custom(GM_ADDR a, GM_ADDR b, GM_ADDR bias, GM_ADDR c, GM_ADDR workspace, GM_ADDR tilingGm) { if (g_coreType == AscendC::AIV) { // In Cube-only mode, the AIV directly returns. return; } ... // Other code }
Verifying Optimization Benefits
- The following figure shows the optimized profile data. According to the aic_time data in column C, the maximum operator execution time among multiple cores is 11.21 μs, which is greatly reduced from 17.85 μs before the optimization. According to the aic_scalar_time data in column G, the average scalar execution time is reduced from 15.02 μs before the optimization to 5.17 μs.
- The following figure shows the optimized pipeline diagram. Compared with the pipeline diagram before the optimization, the scalar pipeline in the red box is obviously sparser. Compared with the MIX mode, the pure Cube mode reduces the processing of message communication and optimizes the overall scalar performance overhead.
Congratulations
In scenarios where only matrix computation is performed and vector computation is not involved, you can enable the pure Cube mode to optimize the message communication performance overhead in Matmul computation and improve the operator performance.




