Matmul High-level API Enabling CUBE_ONLY

Case Study

This case shows the optimized performance when CUBE_ONLY of the high-level Matmul API is enabled in the matrix multiplication operator scenario. As shown in the following figure, the Matmul API uses the MIX mode by default. That is, the user initiates a message from the AIV, and the message is forwarded through the message communication framework and then Matmul computation is performed on the AIC. This message processing mechanism causes extra scalar performance overhead. Compared with the MIX mode, CUBE_ONLY can directly skip the message communication framework to complete MatMul computation, improving operator performance.

Figure 1 Matmul process in MIX mode by default

Application scenarios for enabling CUBE_ONLY
Non-fused operators, only in the Cube computation scenario. That is, there is no Vector computation in this scenario, compared with the MIX mode (including Cube computation and Vector computation). The operator specifications are as follows.

**Table 1** Operator case specifications
Input	Shape	Data type	Format
a	128, 64	float16	ND
b	64, 30720	float16	ND

On the AI processor used in the current case, there are 24 cores in total, and each core contains one AIC and two AIVs.

The tiling parameters are as follows:

Original shape: M = 128, N = 30720, K = 64.
Single-core shape:
- MIX scenario: The tiling is performed based on 48 AIVs. singleCoreM = 128, singleCoreN = 640, singleCoreK = 64.
- CUBE_ONLY scenario: The tiling is performed based on 24 AICs. singleCoreM = 128, singleCoreN = 1280, singleCoreK = 64.
Basic block shape: baseM = 128, baseN = 256, and baseK = 64.
L1-related tiling parameters: stepM = 1, stepN = 1, stepKa = 4, stepKb = 4, depthA1 = 8, and depthB1 = 8.

Obtaining Profile Data

Use the msProf tool to obtain the Operator Simulation Pipeline and On-board Profiling data. The CUBE_ONLY mode mainly optimizes the Scalar pipeline performance. Therefore, you can focus analyzing the scalar pipeline.

Analyzing Main Bottlenecks

The following figure shows the profiling data before optimization. According to the aic_time data in column C, the maximum operator execution time among multiple cores is 17.85 μs. According to the aic_scalar_time data in column G, the average Scalar execution time is 15.02 μs, and the performance bottleneck lies in the Scalar pipeline.
The following figure shows the pipeline before optimization. In the default MIX mode, each Matmul computation involves the message communication framework to process messages. The Scalar pipeline is heavy, resulting in high performance overhead, as shown in the red box in the following figure.

Optimization Solution

In the default MIX mode, the user initiates a message on the AIV side. After the message is forwarded by the message communication framework, the Matmul computation is performed on the AIC side. Based on this process, when using the high-level Matmul API to compile operator code, you can use the REGIST_MATMUL_OBJ macro, without distinguishing the AIV and AIC. However, this message processing mechanism causes additional performance overhead, as shown in Figure 1 Matmul process in the default MIX mode.

The procedure for implementing the default MIX mode is as follows:

On the kernel side, define the Matmul object.

        
             #include "lib/matmul_intf.h"

using A_TYPE = AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, AType>;
using B_TYPE = AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, BType>;
using C_TYPE = AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, CType>;
using BIAS_TYPE =  AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, BiasType>;
AscendC::Matmul<A_TYPE, B_TYPE, C_TYPE, BIAS_TYPE, CFG_NORM> matmulObj;

On the host side, the Matmul multi-core tiling object calls the SetDim API to set the number of cores involved in the computation.

        
             auto ascendcPlatform = platform_ascendc::PlatformAscendCManager::GetInstance();
matmul_tiling::MultiCoreMatmulTiling cubeTiling(*ascendcPlatform);
int32_t blockDim = ascendcPlatform->GetCoreNumAiv(); // In MIX mode, use GetCoreNumAiv to obtain the number of available cores of the AI processor.
cubeTiling.SetDim(blockDim);

Call the kernel function and set the blockDim parameter of the kernel function by referring to Kernel Function Definition and Calling.

        
             In matmul_custom_do(ascendcPlatform->GetCoreNumAic(), stream, x1, x2, bias, y, workspaceDevice, tilingDevice); // In the MIX mode, the startup is performed by groups of AIVs and AICs. BlockDim is used to set the number of AI Cores to be started.

In the operator scenario without Vector computation, you can skip the message communication framework mechanism and enable the CUBE_ONLY mode to complete Matmul computation, reducing the performance overhead of message communication and improving the operator performance.

Figure 2 Matmul process in CUBE_ONLY mode

For details about the complete example of enabling the CUBE_ONLY mode using the Matmul API, see Matmul API performance optimization sample. The procedure for enabling the CUBE_ONLY mode is as follows:

On the kernel side, set the ASCENDC_CUBE_ONLY macro before including the matmul_intf.h header file in the code for defining the Matmul object.

        
             #define ASCENDC_CUBE_ONLY // Set the ASCENDC_CUBE_ONLY macro before #include "lib/matmul_intf.h".
#include "lib/matmul_intf.h"

using A_TYPE = AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, AType>;
using B_TYPE = AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, BType>;
using C_TYPE = AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, CType>;
using BIAS_TYPE =  AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, BiasType>;
AscendC::Matmul<A_TYPE, B_TYPE, C_TYPE, BIAS_TYPE, CFG_NORM> matmulObj;

On the host side, the Matmul multi-core tiling object calls the SetDim API to set the number of cores involved in the computation.

        
             auto ascendcPlatform = platform_ascendc::PlatformAscendCManager::GetInstance();
matmul_tiling::MultiCoreMatmulTiling cubeTiling(*ascendcPlatform);
int32_t blockDim = ascendcPlatform->GetCoreNumAic(); // In CUBE_ONLY mode, the GetCoreNumAic API is used to obtain the number of available cores of the AI processor.
cubeTiling.SetDim(blockDim);

Call the kernel function and set the blockDim parameter of the kernel function by referring to Kernel Function Definition and Calling.

        
             matmul_custom_do(ascendcPlatform->GetCoreNumAic(), stream, x1, x2, bias, y, workspaceDevice, tilingDevice); // For operators that contain only Cube Units, BlockDim is used to set the number of AICs to be started.

On the kernel side, add the return branch from the AIV to the kernel function implementation.

        
             extern "C" __global__ __aicore__ void matmul_custom(GM_ADDR a, GM_ADDR b,
    GM_ADDR bias, GM_ADDR c, GM_ADDR workspace, GM_ADDR tilingGm)
{
    if (g_coreType == AscendC::AIV) { // CUBE_ONLY mode, with AIV directly returning
        return;
    }
    ...
    // Other code
}

Verifying Optimization Benefits

The following figure shows the profiling data after optimization. According to the aic_time data in column C, the maximum operator execution time among multiple cores is 11.21 μs, which is significantly improved compared with 17.85 μs before optimization. According to the aic_scalar_time data in column G, the average Scalar execution time is reduced from 15.02 μs before optimization to 5.17 μs.
The pipeline after optimization is as follows. Compared with the pipeline before optimization, the Scalar pipeline in the red box is obviously sparser. Compared with the MIX mode, the CUBE_ONLY mode reduces the processing of message communication and optimizes the overall Scalar performance overhead.

Summary

In scenarios where only Cube computation is involved and Vector computation is not involved, you can enable the CUBE_ONLY mode to optimize the message communication performance overhead in Matmul computation and improve operator performance.

Parent topic: Matmul Performance Tuning Cases