Operator Implementation

The following uses the MatMul+LeakyReLU fused operator as an example to describe the design and implementation process of the mix fused operator. This sample can run only on the Atlas A2 training products / Atlas A2 inference products .

The operator design process consists of operator analysis, data flow analysis, and tiling policy design.

Operator Analysis

Operator analysis is to specify the mathematical expression, input, output, and kernel function name of an operator.

  1. Specify the mathematical expression and computation logic of an operator. Computation logic of the operator is as follows: A matrix multiplication operation is performed first, and then a LeakyReLU operation is performed on a matrix multiplication result and an alpha parameter. The mathematical expression is as follows:
    1
    c = LeakyRelu(a * b + bias, alpha);
    
  2. Specify the input and output.
    • The inputs of the MatMul+LeakyRelu operator include a, b, and bias, and the output is c. alpha, the coefficient of the activation function LeakyRelu, is a fixed value and can be used as a constant value for computation in operator implementation.
    • In this sample, the data type supported by the operator inputs a and b is the half type (float16), the data type supported by the operator input bias is float32, and the data type of the operator output c is float32.
    • The shape of the input matrix a is [M, K], the shape of the input matrix b is [K, N], the shape of the output matrix c is [M, N], and the shape of the input bias is [1, N].
    • The input and output data format of the operator is ND.
  3. Define the kernel function name and parameters.
    • You can customize the kernel function name. In this example, the kernel function is named matmul_leakyrelu_custom.
    • Based on the analysis of the operator input and output, the kernel function has these parameters: a, b, bias, and c. a, b, and bias indicate the memory address of the input in the global memory, and c indicates the memory address of the output in the global memory.
Table 1 Design specifications of the Ascend C MatMul+LeakyRelu operator

OpType

MATMUL_LEAKYRELU

Operator input

Name

Shape

Data Type

Format

a

[M, K]

half

ND

b

[K, N]

half

ND

bias

[1, N]

float32

-

Operator output

c

[M, N]

float32

ND

Kernel function name

matmul_leakyrelu_custom

Data Flow Analysis

Analyze the data flow of the operator: The data flow direction is as follows: After MatMul computation is completed on the Cube Core, data is moved to the AI Vector Core for LeakyRelu compute. Plan parallel pipeline tasks based on the preceding data flows and the programming paradigm of the fused operator. The following figure shows the details:

  1. Move input data from the global memory to the Cube Core.
  2. Perform internal MatMul computation. The compute formula is as follows:
    Note that the shape of bias is [1, N]. Each row of the A x B result matrix is biased.
    Figure 1 MatMul matrix multiplication
  3. Move the MatMul compute result to the AI Vector core.
  4. Vector compute is performed. In this example, LeakyReLU compute is performed.

    LeakyReLU (Leaky Rectified Linear Unit) activation function is a commonly used activation function in artificial neural networks. Its mathematical expression and function graph are shown below:

  5. Move the output to the global memory.

The first three steps are encapsulated in the MatMul high-level APIs. In this example, there are only three stages. The following figure shows the details:

According to the preceding analysis, the MatMul high-level APIs, LeakyReLU Vector Compute API, DataCopy, EnQue, and DeQue are used during the implementation.

Tiling Policy Design

Tiling strategy design mainly includes multi-core tiling and intra-core tiling strategies.

  • Multi-core tiling: Based on the current number of cores, tile the input shape (M, K, N) into multiple cores, resulting in single-core shapes singleCoreM, singleCoreK, and singleCoreN.
  • Intra-core tiling: Based on the constraints of the local memory size, further tile the single-core shapes to obtain the shape sizes baseM, baseN, and baseK of the A, B, and C matrices participating in one matrix multiplication instruction. Note that if the result of GetTensorC is placed in the local memory (UB), the size of baseM x baseN must not exceed the UB limit.

The following figure shows the tiling policy. For details, see Data Tiling.

Operator Implementation

In Cube Programming, Ascend C provides a group of MatMul high-level APIs, which encapsulate common algorithm logic for tiling, data movement, and compute, helping you quickly implement MatMul. The implementation of cube programming in a fused operator is similar. You can call APIs on the host to automatically obtain tiling parameters. After the parameters are passed to the kernel during initialization, matrix multiplication can be completed by using several simple APIs. With reference to the foregoing programming paradigm of the fused operator, steps of implementing the fused operator are as follows. For details about the complete example, click MatmulLeakyRelu.

The following figure shows the code framework implemented on the kernel. After the MatMul object is initialized and the left matrix A, right matrix B, and bias are set, the subsequent MatMul compute, LeakyReLU compute, and CopyOut process are completed by adding a single Iterate loop and a while loop.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
template<typename aType, typename bType, typename cType, typename biasType>
__aicore__ inline void MatmulLeakyKernel<aType, bType, cType, biasType>::Process(){
    uint32_t computeRound = 0;
    // Initialize the MatMul object.
    REGIST_MATMUL_OBJ(&pipe, GetSysWorkSpacePtr(), matmulObj);
    // Set the input of MatMul (including the left matrix, right matrix, and bias).
    matmulObj.Init(&tiling);
    matmulObj.SetTensorA(aGlobal);
    matmulObj.SetTensorB(bGlobal);
    matmulObj.SetBias(biasGlobal);
    // Call MatMul iterate to obtain the compute result of [baseM, baseN].
    while (matmulObj.template Iterate<true>())
    {
        MatmulCompute();
        LeakyReluCompute();
        CopyOut(computeRound);
        computeRound++;

    }
    matmulObj.End();
}

The implementation code of MatMul compute, LeakyReLU compute, and CopyOut is as follows:

  1. MatMul compute:
    1. Move input data from the global memory to the Cube Core.
    2. Perform the internal MatMul compute.
    3. Move the MatMul compute result to the AI Vector core.
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      template<typename aType, typename bType, typename cType, typename biasType>
      __aicore__ inline void MatmulLeakyKernel<aType, bType, cType, biasType>::Process(){
          uint32_t computeRound = 0;
          // ...
          // Call MatMul iterate to obtain the compute result of [baseM, baseN].
          while (matmulObj.template Iterate<true>())
          {
              MatmulCompute();
              // ...
              computeRound++;
      
          }
          matmulObj.End();
      }
      
      template<typename aType, typename bType, typename cType, typename biasType>
      __aicore__ inline void MatmulLeakyKernel<aType, bType, cType, biasType>::MatmulCompute(){
          reluOutLocal = reluOutQueue_.AllocTensor<cType>();
          // Call GetTensorC to move the MatMul compute result to the AI Vector Core.
          matmulObj.template GetTensorC<true>(reluOutLocal, false, true);
      }
      
  2. Perform LeakyReLU compute.
    1
    2
    3
    4
    5
    6
    // Call the LeakyRule API to compute.
    template<typename aType, typename bType, typename cType, typename biasType>
    __aicore__ inline void MatmulLeakyKernel<aType, bType, cType, biasType>::LeakyReluCompute(){
        AscendC::LeakyRelu(reluOutLocal, reluOutLocal, (cType)alpha, tiling.baseM * tiling.baseN);
        reluOutQueue_.EnQue(reluOutLocal);
    }
    
  3. CopyOut, move the output to the global memory.
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    // Move the result to the global memory.
    template<typename aType, typename bType, typename cType, typename biasType>
    __aicore__ inline void MatmulLeakyKernel<aType, bType, cType, biasType>::CopyOut(uint32_t count){
        reluOutQueue_.DeQue<cType>();
        const uint32_t roundM = tiling.singleCoreM / tiling.baseM;
        const uint32_t roundN = tiling.singleCoreN / tiling.baseN;
        uint32_t startOffset = (count % roundM * tiling.baseM * tiling.N + count / roundM * tiling.baseN);
        AscendC::DataCopyParams copyParam = {(uint16_t)tiling.baseM,
            (uint16_t)(tiling.baseN * sizeof(cType) / DEFAULT_C0_SIZE), 0,
            (uint16_t)((tiling.N - tiling.baseN) * sizeof(cType) / DEFAULT_C0_SIZE)};
        AscendC::DataCopy(cGlobal[startOffset], reluOutLocal, copyParam);
        reluOutQueue_.FreeTensor(reluOutLocal);
    }
    

Procedure for the host to automatically obtain tiling parameters:

  1. Create a tiling object.
    1
    2
    auto ascendcPlatform = platform_ascendc::PlatformAscendC(context->GetPlatformInfo());
    matmul_tiling::MultiCoreMatmulTiling cubeTiling(ascendcPlatform);
    

    When creating an object, you need to pass the hardware platform information, which can be obtained by calling GetPlatformInfo.

  2. Set the data types and formats of A, B, and bias.
    The following is a setting example. TPosition::LCM is the logical location on Unified Buffer, which is equivalent to TPosition::VECCALC. For details about TPosition, see TPosition.
    1
    2
    3
    4
    cubeTiling.SetAType(matmul_tiling::TPosition::GM, matmul_tiling::CubeFormat::ND, matmul_tiling::DataType::DT_FLOAT16);
    cubeTiling.SetBType(matmul_tiling::TPosition::GM, matmul_tiling::CubeFormat::ND, matmul_tiling::DataType::DT_FLOAT16);
    cubeTiling.SetCType(matmul_tiling::TPosition::LCM, matmul_tiling::CubeFormat::ND, matmul_tiling::DataType::DT_FLOAT);
    cubeTiling.SetBiasType(matmul_tiling::TPosition::GM, matmul_tiling::CubeFormat::ND, matmul_tiling::DataType::DT_FLOAT);
    
  3. Set the matrix shape.
    1
    2
    cubeTiling.SetShape(M, N, K);
    cubeTiling.SetOrgShape(M, N, K);
    
  4. Set the size of the available space.
    Set the size of the available space on L1 Buffer, L0C Buffer, or Unified Buffer for MatMul compute. The value –1 indicates the size of the buffer corresponding to the AI processor.
    1
    cubeTiling.SetBufferSpace(-1, -1, -1);
    
  5. Set other parameters as required, for example, bias that will participate in the compute.
    1
    cubeTiling.SetBias(true);
    
  6. Obtain tiling parameters.
    1
    2
    3
    4
    MatmulLeakyreluCustomTilingData tiling;
    if (cubeTiling.GetTiling(tiling.cubeTilingData) == -1){
        return ge::GRAPH_FAILED;
    }
    
  7. Perform other operations such as serialization and saving of tiling parameters.
  • In multi-core scenarios, you need to set the number of cores used for MatMul compute through SetDim. The rules for setting the MIX mode (including Cube computation and Vector computation) are as follows:
    • Separated mode: The MatMul API is initiated from the AIV. When Iterate is called for computation, the AIV only notifies the AIC to perform Cube computation. After that, the AIC notifies the AIV that the computation is complete. In this architecture, SetBlockDim is set to the number of AI Cores (AIC + AIV) used for computation, and SetDim is set to the number of AIVs used for computation. For example, when SetBlockDim is set to 20, 20 AI Cores (AIC + AIV) are started, and when SetDim is set to 40, 40 AIVs are tiled.
    • Coupled mode: The number of cores loaded by SetBlockDim is the same as that used by the MatMul API for computation. The values of SetDim and SetBlockDim are the same.
  • The system workspace is required for internal implementation of MatMul high-level APIs. You need to:
    • Set the total workspace size (including the user workspace and system workspace) when implementing tiling on the host. The workspace space is allocated and managed by the framework. The size of the system workspace can be obtained by calling GetLibApiWorkSpaceSize.
      1
      2
      3
      4
      size_t userWorkspaceSize = 0;
      size_t systemWorkspaceSize = ascendcPlatform.GetLibApiWorkSpaceSize();
      size_t *currentWorkspace = context->GetWorkspaceSizes(1);
      currentWorkspace[0] = userWorkspaceSize + systemWorkspaceSize;
      
    • If the operator project is neither a custom operator project nor a kernel launch operator project with the HAVE_WORKSPACE compilation macro, the kernel needs to set the system workspace through SetSysWorkSpace before MatMul initialization.
      1
      2
      3
      4
      5
      // The workspace space must be set when MatMul is used.
      SetSysWorkspace(workspace);
      if (GetSysWorkSpacePtr() == nullptr) {
          return;
      }
      
  • In the preceding implementation method, the code isolation and inter-core synchronization on the AIC and AIV sides are implemented by the framework. You do not need to pay attention to them. In addition to this method, you can also choose to implement fusion operators in split mode through underlying coding, which is more flexible. When using the underlying coding method, pay attention to the following points:
    • Implement synchronization between AICs and AIVs. For example, in the MatMul + LeakyRelu operator sample, ensure that AIV performs LeakyRelu computation after AIC completes Cube computation.
    • When using MatMul high-level APIs, you need to set ASCENDC_CUBE_ONLY to indicate that MatMul APIs are called only on the AIC side.
    • Use the API for setting the kernel type to set the kernel type to KERNEL_TYPE_MIX_xxx and enable both AIVs and AICs.
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    #define ASCENDC_CUBE_ONLY // Specify that MatMul operators run on AICs.
    KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_MIX_AIC_1_2); // Set the kernel type to KERNEL_TYPE_MIX_xxx.
    if ASCEND_IS_AIC {
        ...
        // AICs perform MatMul compute.
        // After AICs complete the compute, they send the synchronization flag by using AscendC::CrossCoreSetFlag<modeId, pipe>(flagId).
    }
    if ASCEND_IS_AIV {
        ...
        // AIVs receive the synchronization flag by using AscendC::CrossCoreWaitFlag(flagId).
        // AIVs perform LeakyReLU compute.
    } 
    

    For details about the complete example, click BareMix sample.