Operator Implementation
Workflow
The previous section describes the data tiling solution and data flows of Matmul. Ascend C provides a group of Matmul high-level APIs that encapsulate the common algorithm logic for tiling, data movement, and compute, helping you quickly implement Matmul. You can call APIs on the host to automatically obtain tiling parameters. After the parameter is passed to the kernel upon initialization. The matrix multiplication operation can be completed through several simple APIs. For details about the complete example, see here.

Procedure for the host to automatically obtain tiling parameters:
- Create a tiling object.
1 2
auto ascendcPlatform = platform_ascendc::PlatformAscendC(context->GetPlatformInfo()); matmul_tiling::MatmulApiTiling cubeTiling(ascendcPlatform);
When creating an object, you need to pass the hardware platform information, which can be obtained by calling GetPlatformInfo.
- Set the data types and formats of A, B, and bias.
1 2 3 4
cubeTiling.SetAType(AscendC::TPosition::GM, CubeFormat::ND, matmul_tiling::DataType::DT_FLOAT16); cubeTiling.SetBType(AscendC::TPosition::GM, CubeFormat::ND, matmul_tiling::DataType::DT_FLOAT16); cubeTiling.SetCType(AscendC::TPosition::GM, CubeFormat::ND, matmul_tiling::DataType::DT_FLOAT); cubeTiling.SetBiasType(AscendC::TPosition::GM, CubeFormat::ND, matmul_tiling::DataType::DT_FLOAT);
- Set the matrix shape.
1 2
cubeTiling.SetShape(M, N, K); cubeTiling.SetOrgShape(M, N, K);
- Set the size of the available space.
1cubeTiling.SetBufferSpace(-1, -1, -1);
- Set other parameters as required, for example, bias that will participate in the compute.
1cubeTiling.SetBias(true);
- Obtain tiling parameters.
1 2 3 4
MatmulCustomTilingData tiling; if (cubeTiling.GetTiling(tiling.cubeTilingData) == -1){ return ge::GRAPH_FAILED; }
- Perform other operations such as serialization and saving of tiling parameters.
Procedure of using the Matmul API operations on the kernel:
- Create a Matmul object.
The following is an example of creating a Matmul object:
- In the CUBE_ONLY (with only Cube computation) scenario, you need to set the ASCENDC_CUBE_ONLY code macro. This section uses the CUBE_ONLY mode as an example.
- By default, the MIX mode (including Cube computation and Vector computation) is used. In this scenario, the ASCENDC_CUBE_ONLY code macro cannot be set. For more information, see Fusion Operator Programming.
1 2 3 4 5 6 7 8
// In CUBE_ONLY, set this code macro before #include "lib/matmul_intf.h". #define ASCENDC_CUBE_ONLY #include "lib/matmul_intf.h" typedef matmul::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, half> aType; typedef matmul::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, half> bType; typedef matmul::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, float> cType; typedef matmul::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, float> biasType; matmul::Matmul<aType, bType, cType, biasType> mm;
During object creation, input the type information of parameters A, B, C, and Bias. The type information is defined by MatmulType, including the logical location of memory, data format, and data type.
- Perform initialization.
1REGIST_MATMUL_OBJ(&pipe, GetSysWorkSpacePtr(), mm, &tiling); // Initialization
The system workspace is required for internal implementation of Matmul high-level APIs. You need to:
- Set the total workspace size (including the user workspace and system workspace) when implementing tiling on the host. The workspace space is allocated and managed by the framework. The size of the system workspace can be obtained by calling GetLibApiWorkSpaceSize.
1 2 3 4
size_t userWorkspaceSize = 0; size_t systemWorkspaceSize = static_cast<size_t>(ascendcPlatform.GetLibApiWorkSpaceSize()); size_t *currentWorkspace = context->GetWorkspaceSizes(1); currentWorkspace[0] = userWorkspaceSize + systemWorkspaceSize;
- If the operator project is neither a custom operator project nor a kernel launch operator project with the -DHAVE_WORKSPACE compilation macro, the kernel needs to set the system workspace through SetSysWorkSpace before Matmul initialization.
1 2 3 4 5
// The workspace must be set when Matmul is used. SetSysWorkspace(workspace); if (GetSysWorkSpacePtr() == nullptr) { return; }
- Set the total workspace size (including the user workspace and system workspace) when implementing tiling on the host. The workspace space is allocated and managed by the framework. The size of the system workspace can be obtained by calling GetLibApiWorkSpaceSize.
- Set the left matrix A, right matrix B, and bias.
1 2 3
mm.SetTensorA(gm_a); // Set the left matrix A. mm.SetTensorB(gm_b); // Set the right matrix B. mm.SetBias(gm_bias); // Set the bias.
- Execute the matrix multiplication.
- Call Iterate to complete a single iterative computation, and use a while loop to compute the full data on a single core. The Iterate method allows for flexible control over the number of iterations required to compute the desired amount of data.
1 2 3
while (mm.Iterate()) { mm.GetTensorC(gm_c); }
- Call IterateAll to compute all data on a single core. The IterateAll method does not require cyclic iterations and is relatively simple to use.
1mm.IterateAll(gm_c);
- Call Iterate to complete a single iterative computation, and use a while loop to compute the full data on a single core. The Iterate method allows for flexible control over the number of iterations required to compute the desired amount of data.
- End the matrix multiplication.
1mm.End();
Setting Shape Information
Shape information can be set during host tiling for tiling compute. Some shape information can also be modified when the kernel is running for scenarios such as tail block setting and Matmul reuse (multiple Matmul computations reuse one Matmul object). This section describes the shape concepts involved and provides guidance on how to set the tiling information on the host and kernel.
- orgShape: M, N, K
- singleCoreShape: singleCoreM, singleCoreN, singleCoreK
- singleShape: singleM, singleN, singleK
- baseShape: baseM, baseN, baseK
In Data Tiling, we have learned the concepts of orgShape (M, N, and K), singleCoreShape (singleCoreM, singleCoreN, and singleCoreK), and baseShape (baseM, baseN, and baseK), as shown in the following figure.

In addition, during single-core Matmul tiling, the shape that actually participates in Matmul computation can be a part of the original shape. singleM, singleN, and singleK express the shape that actually participates in Matmul computation, as shown in the following figure. In single-core scenarios, singleM, singleN, and singleK are passed through the singleCoreM, singleCoreN, singleCoreK.

- Kernel runtime settings
- SetTail and SetSingleShape are used to modify singleCoreM, singleCoreN, and singleCoreK during runtime. SetTail is used to process the tail block. SetSingleShape is used to modify the shapes in the Matmul reuse scenarios (one Matmul object is used by multiple Matmul computations).
- SetOrgShape is used to modify M, N, and K during runtime. You can also use it to reset shapes in the Matmul reuse scenarios.
- Single-core tiling settings
- Multi-core tiling settings
Setting the Format
During Matmul object creation, input the type information of parameters A, B, C, and Bias. The type information is defined by MatmulType, including the logical location of memory, data format, and data type. The following is an example:
1 2 3 4 5 |
typedef matmul::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, half> aType; typedef matmul::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, half> bType; typedef matmul::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, float> cType; typedef matmul::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, float> biasType; matmul::Matmul<aType, bType, cType, biasType> mm; |
Data formats include CubeFormat::ND, CubeFormat::NZ, and CubeFormat::ND_ALIGN. For details about the ND and NZ formats, see Data Format.
ND_ALIGN is used to configure the matmul result matrix based on certain padding rules. ND to ND_ALIGN conversion is shown in the following figure. The matrix data type is uint32_t. Assume that the result matrix is output to the UB and the N direction of the original matrix is not 32-byte aligned. Pad 0s to align ND_ALIGN to 32 bytes.
