Operator Implementation

The following uses the MatMul+LeakyReLU fused operator as an example to describe the design and implementation process of the mix fused operator. This example can run only on the Atlas A2 training products / Atlas A2 inference products .

The operator design process consists of operator analysis, data flow analysis, and tiling strategy design.

Operator Analysis

Operator analysis is to specify the mathematical expression, input, output, and kernel function name of an operator.

Specify the mathematical expression and compute logic of an operator. Computation logic of the operator is as follows: A matrix multiplication operation is performed first, and then a LeakyReLU operation is performed on a matrix multiplication result and an alpha parameter. The mathematical expression is as follows:
1

c = LeakyRelu(a * b + bias, alpha);
Specify the input and output.
- The inputs of the MatMul+LeakyRelu operator include a, b, and bias, and the output is c. alpha, the coefficient of the activation function LeakyRelu, is a fixed value and can be used as a constant value for compute in operator implementation.
- In this example, the data type supported by the operator inputs a and b is the half type (float16), the data type supported by the operator input bias is float32, and the data type of the operator output c is float32.
- The shape of the input matrix a is [M, K], the shape of the input matrix b is [K, N], the shape of the output matrix c is [M, N], and the shape of the input bias is [1, N].
- The input and output data format of the operator is ND.
Define the kernel function name and parameters.
- You can customize the kernel function name. In this example, the kernel function is named matmul_leakyrelu_custom.
- Based on the analysis of the operator input and output, the kernel function has these parameters: a, b, bias, and c. a, b, and bias indicate the memory address of the input in the global memory, and c indicates the memory address of the output in the global memory.

**Table 1** Design specifications of the Ascend C MatMul+LeakyRelu operator
OpType	MATMUL_LEAKYRELU
Operator input	Name	Shape	Data Type	Format
	a	[M, K]	half	ND
	b	[K, N]	half	ND
	bias	[1, N]	float32	-
Operator output	c	[M, N]	float32	ND
Kernel function name	matmul_leakyrelu_custom

Data Flow Analysis

Analyze the data flow of the operator. The data flow direction is as follows: After Matmul compute is completed on the AI Cube Core, data is moved to the AI Vector Core for LeakyRelu compute. Based on the preceding data flows and the programming paradigm of the fused operator, plan parallel pipeline tasks. The following figure shows the details:

Move input data from the global memory to the AI Cube Core.
Perform internal MatMul compute. The compute formula is as follows:
Note that the shape of bias is [1, N]. Each row of the A x B result matrix is biased.
Figure 1 MatMul matrix multiplication
Move the MatMul compute result to the AI Vector Core.
Vector compute is performed. In this example, LeakyReLU compute is performed.
LeakyReLU (Leaky Rectified Linear Unit) activation function is a commonly used activation function in artificial neural networks. Its mathematical expression and function graph are shown below:

$\text{[math]}$
Move the output to the global memory.

The first three steps are encapsulated in the MatMul high-level APIs. In this example, there are only three stages. The following figure shows the details:

According to the preceding analysis, the MatMul high-level APIs, LeakyReLU Vector Compute API, DataCopy, EnQue, and DeQue are used during the implementation.

Tiling Strategy Design

Tiling strategy design mainly includes multi-core tiling and intra-core tiling strategies.

Multi-core tiling: Based on the current number of cores, tile the input shape (M, K, N) into multiple cores, resulting in single-core shapes singleCoreM, singleCoreK, and singleCoreN.
Intra-core tiling: Based on the restrictions on the local memory size, further tile the single-core shapes to obtain the shape sizes baseM, baseN, and baseK of matrices A, B, and C participating in one matrix multiplication instruction. Note that if the result of GetTensorC is placed in the local memory (UB), the size of baseM x baseN must not exceed the UB limit.

The following figure shows the tiling strategy. For details, see Data Tiling.

Operator Implementation

In Cube Programming, Ascend C provides a group of MatMul high-level APIs, which encapsulate common algorithm logic for tiling, data movement, and compute, helping you quickly implement MatMul matrix multiplication. The implementation of cube programming in a fused operator is similar. You can call APIs on the host to automatically obtain tiling parameters. After the parameters are passed to the kernel during initialization, matrix multiplication can be completed by using several simple APIs. With reference to the foregoing programming paradigm of the fused operator, the procedure for implementing the fused operator is as follows. For details, see MatmulLeakyRelu.

The following figure shows the code framework implemented on the kernel. After the MatMul object is initialized and the left matrix A, right matrix B, and bias are set, the subsequent MatMul compute, LeakyReLU compute, and CopyOut process are completed by adding a single Iterate loop and a while loop.

      
       
         
         
           template<typename aType, typename bType, typename cType, typename biasType>
__aicore__ inline void MatmulLeakyKernel<aType, bType, cType, biasType>::Process(){
    uint32_t computeRound = 0;
    // Initialize the MatMul object.
    REGIST_MATMUL_OBJ(&pipe, GetSysWorkSpacePtr(), matmulObj);
    // Set the input of MatMul (including the left matrix, right matrix, and bias).
    matmulObj.Init(&tiling);
    matmulObj.SetTensorA(aGlobal);
    matmulObj.SetTensorB(bGlobal);
    matmulObj.SetBias(biasGlobal);
    // Call MatMul iterate to obtain the compute result of [baseM, baseN].
    while (matmulObj.template Iterate<true>())
    {
        MatmulCompute();
        LeakyReluCompute();
        CopyOut(computeRound);
        computeRound++;

    }
    matmulObj.End();
}

          

        

      
     

The implementation code of MatMul compute, LeakyReLU compute, and CopyOut is as follows:

MatMul compute:

Move input data from the global memory to the AI Cube Core.
Perform the internal MatMul compute.

Move the MatMul compute result to the AI Vector Core.

          
           
             
             
               template<typename aType, typename bType, typename cType, typename biasType>
__aicore__ inline void MatmulLeakyKernel<aType, bType, cType, biasType>::Process(){
    uint32_t computeRound = 0;
    // ...
    // Call MatMul iterate to obtain the compute result of [baseM, baseN].
    while (matmulObj.template Iterate<true>())
    {
        MatmulCompute();
        // ...
        computeRound++;

    }
    matmulObj.End();
}

template<typename aType, typename bType, typename cType, typename biasType>
__aicore__ inline void MatmulLeakyKernel<aType, bType, cType, biasType>::MatmulCompute(){
    reluOutLocal = reluOutQueue_.AllocTensor<cType>();
    // Call GetTensorC to move the MatMul compute result to the AI Vector Core.
    matmulObj.template GetTensorC<true>(reluOutLocal, false, true);
}

              

            

          
         

Perform LeakyReLU compute.

        
             // Call the LeakyRule API to compute.
template<typename aType, typename bType, typename cType, typename biasType>
__aicore__ inline void MatmulLeakyKernel<aType, bType, cType, biasType>::LeakyReluCompute(){
    AscendC::LeakyRelu(reluOutLocal, reluOutLocal, (cType)alpha, tiling.baseM * tiling.baseN);
    reluOutQueue_.EnQue(reluOutLocal);
}

CopyOut, move the output to the global memory.

        
             // Move the result to the global memory.
template<typename aType, typename bType, typename cType, typename biasType>
__aicore__ inline void MatmulLeakyKernel<aType, bType, cType, biasType>::CopyOut(uint32_t count){
    reluOutQueue_.DeQue<cType>();
    const uint32_t roundM = tiling.singleCoreM / tiling.baseM;
    const uint32_t roundN = tiling.singleCoreN / tiling.baseN;
    uint32_t startOffset = (count % roundM * tiling.baseM * tiling.N + count / roundM * tiling.baseN);
    AscendC::DataCopyParams copyParam = {(uint16_t)tiling.baseM,
        (uint16_t)(tiling.baseN * sizeof(cType) / DEFAULT_C0_SIZE), 0,
        (uint16_t)((tiling.N - tiling.baseN) * sizeof(cType) / DEFAULT_C0_SIZE)};
    AscendC::DataCopy(cGlobal[startOffset], reluOutLocal, copyParam);
    reluOutQueue_.FreeTensor(reluOutLocal);
}

Procedure for the host to automatically obtain tiling parameters:

Create a tiling object.

        
             auto ascendcPlatform = platform_ascendc::PlatformAscendC(context->GetPlatformInfo());
matmul_tiling::MultiCoreMatmulTiling cubeTiling(ascendcPlatform);

When creating an object, you need to pass the hardware platform information, which can be obtained by calling GetPlatformInfo.

Set the data types and formats of A, B, and bias.

The following is a setting example. TPosition::LCM is the logical location on Unified Buffer, which is equivalent to TPosition::VECCALC. For details about TPosition, see TPosition.

         
              cubeTiling.SetAType(matmul_tiling::TPosition::GM, matmul_tiling::CubeFormat::ND, matmul_tiling::DataType::DT_FLOAT16);
cubeTiling.SetBType(matmul_tiling::TPosition::GM, matmul_tiling::CubeFormat::ND, matmul_tiling::DataType::DT_FLOAT16);
cubeTiling.SetCType(matmul_tiling::TPosition::LCM, matmul_tiling::CubeFormat::ND, matmul_tiling::DataType::DT_FLOAT);
cubeTiling.SetBiasType(matmul_tiling::TPosition::GM, matmul_tiling::CubeFormat::ND, matmul_tiling::DataType::DT_FLOAT);

Set the matrix shape.

        
             cubeTiling.SetShape(M, N, K);
cubeTiling.SetOrgShape(M, N, K);

Set the size of the available space.

Set the size of the available space on L1 Buffer, L0C Buffer, or Unified Buffer for MatMul compute. The value –1 indicates the size of the buffer corresponding to the AI processor.

         
              cubeTiling.SetBufferSpace(-1, -1, -1);

Set other parameters as required, for example, bias that will participate in the compute.
1

cubeTiling.SetBias(true);

Obtain tiling parameters.

        
             MatmulLeakyreluCustomTilingData tiling;
if (cubeTiling.GetTiling(tiling.cubeTilingData) == -1){
    return ge::GRAPH_FAILED;
}

Perform other operations such as serialization and saving of tiling parameters.

In multi-core scenarios, you need to set the number of cores used for MatMul compute through SetDim. The rules for setting the MIX mode (including Cube compute and Vector compute) are as follows:
- Separated mode: The MatMul API is initiated from the AIV. When Iterate is called for compute, the AIV only notifies the AIC to perform Cube compute. After that, the AIC notifies the AIV that the compute is complete. In this architecture, SetBlockDim is set to the number of AI Cores (AIC + AIV) used for compute, and SetDim is set to the number of AIVs used for compute. For example, when SetBlockDim is set to 20, 20 AI Cores (AIC + AIV) are started, and when SetDim is set to 40, 40 AIVs are tiled.
- Coupled mode: The number of cores loaded by SetBlockDim is the same as that used by the MatMul API for compute. The values of SetDim and SetBlockDim are the same.

The system workspace is required for internal implementation of MatMul high-level APIs. You need to:

Set the total workspace size (including the user workspace and system workspace) when implementing tiling on the host. The workspace space is allocated and managed by the framework. The size of the system workspace can be obtained by calling GetLibApiWorkSpaceSize.

            
                 size_t userWorkspaceSize = 0;
size_t systemWorkspaceSize = ascendcPlatform.GetLibApiWorkSpaceSize();
size_t *currentWorkspace = context->GetWorkspaceSizes(1);
currentWorkspace[0] = userWorkspaceSize + systemWorkspaceSize;

If the operator project is neither a custom operator project nor a kernel launch operator project with the HAVE_WORKSPACE compilation macro, the kernel needs to set the system workspace through SetSysWorkSpace before MatMul initialization.

            
                 // The workspace space must be set when MatMul is used.
SetSysWorkspace(workspace);
if (GetSysWorkSpacePtr() == nullptr) {
    return;
}

In the preceding implementation method, the code isolation and inter-core synchronization on the AIC and AIV are implemented by the framework. In addition to this method, you can also choose to implement fusion operators in separated mode through underlying coding, which is more flexible. When using the underlying coding method, pay attention to the following points:

Use ASCEND_IS_AIV and ASCEND_IS_AIC to isolate the AIV and AIC code.

Implement synchronization between AICs and AIVs. For example, in the MatMul + LeakyRelu operator sample, ensure that the AIV performs LeakyRelu compute after the AIC completes Cube compute.
When using MatMul high-level APIs, you need to set ASCENDC_CUBE_ONLY to indicate that MatMul APIs are called only on the AIC.
Use the Setting Kernel Type API to set the kernel type to KERNEL_TYPE_MIX_xxx and enable both AIVs and AICs.

          
               #define ASCENDC_CUBE_ONLY // Specify that MatMul operators run on AICs.
KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_MIX_AIC_1_2); // Set the kernel type to KERNEL_TYPE_MIX_xxx.
if ASCEND_IS_AIC {
    ...
    // The AIC performs MatMul compute.
    // After the AIC completes the compute, the synchronization flag is sent by using AscendC::CrossCoreSetFlag<modeId, pipe>(flagId).
}
if ASCEND_IS_AIV {
    ...
    // The AIV receives the synchronization flag by using AscendC::CrossCoreWaitFlag(flagId).
    // The AIV performs LeakyReLU compute.
}

For details, refer to the BareMix sample.

Parent topic: CV Fusion