Basics

Before learning fused operator programming, ensure that you are familiar with the cube programming knowledge.

CV Fused Operator

A fused operator is formed by multiple independent small operators, and has equivalent functions as these operators, but usually with better performance. Vector and Cube operators can be fused based on specific algorithms to achieve performance benefits. An operator that fuses Cube and Vector computation is called a CV fused operator.

For example, the following figure shows the core implementation of Flash Attention, the core fused operator in an LLM. The MatMul operator (Cube), Scale operator (Vector), Mask operator (Vector), and SoftMax operator (Vector) in the figure are fused into a large operator Flash Attention.

Figure 1 Core implementation of Flash Attention

Application Scenarios and Advantages

A vector operator and a cube operator that have data dependencies can be fused into a fused operator, which can be carried by an operator kernel function, to achieve performance gains. The following figure shows the comparison between the execution duration of the independent vector operator, cube operator, and mix fused operator. Based on the comparison, you can see why developing the mix fused operator brings performance gains.

Figure 2 Execution duration comparison between independent vector operators, cube operators, and mix fused operators
  • Implementation of independent vector operators and cube operators: The Cube computation result needs to be moved to the global memory, and then to the local memory for Vector computation. The compute and movement are performed in serial mode. Scheduling and execution of multiple operators will increase the total scheduling time.
  • Implementation method of the fused operator: Data can be tiled, and then parallel computing is implemented for the Vector Unit and the Cube Unit through pipeline design. Compared with scheduling single operators, scheduling fused operator is more efficient.

In addition to effectively improving operator performance and fully utilizing the computing power of AI processors, fused operators have the following advantages:

  • Reduced computation amount: A fused operator refers to the combination of multiple operators, which can simplify the compute process, reduce the computation amount, and improve the computing efficiency.
  • Reduced memory usage: A fused operator can combine the intermediate results of multiple operators into one to reduce memory usage and improve resource efficiency.
  • Optimized data flows: A fused operator can optimize data flows and reduce data transmissions between different operators, improving data processing efficiency.
  • Simplified code implementation: Fused operators can simplify code implementation, reduce the code amount, and improve code readability and maintainability.

The fused operator can effectively optimize computing, improve computing efficiency and memory usage, optimize data flows, and simplify code implementation.

Programming Paradigm

Ascend Cprovides fused operator programming paradigm so that you can express the data flow of the fused operator based on the paradigm and quickly implement custom fused operators.

The fused operator data flow refers to the flow direction of the input and output of the fused operator between storage locations. Take a typical cube and vector fused operator as an example. The following figure shows the data flows between logical locations.

  • The output of the cube computation can be used as the input of the vector computation: CO2 -> VECIN.
  • The output of the vector computation can be used as the input of the cube computation: VECOUT -> A1 -> A2 or VECOUT -> B1 -> B2.

Based on the fused operator programming paradigm of the MatMul high-level APIs, the preceding data flows are simplified as follows.
Figure 3 Fused operator programming paradigm
  1. Initialize a MatMul object and move the input data from the global memory to the AI Cube core.
  2. Perform the internal Matmul compute.
  3. Move the MatMul compute result to the AI Vector core.
  4. Perform the vector compute.
  5. Move the output to the global memory.

The sample code (pseudocode) of the entire process is as follows. For details about the complete example, see MatmulLeakyRelu.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
template<typename aType, typename bType, typename cType, typename biasType>
__aicore__ inline void MatmulLeakyKernel<aType, bType, cType, biasType>::Process()
{
    // Step 1: Initialize a MatMul object and move the input data from the global memory to the AI Cube core.
    uint32_t computeRound = 0;
    REGIST_MATMUL_OBJ(&pipe, GetSysWorkSpacePtr(), matmulObj);
    matmulObj.Init(&tiling);
    matmulObj.SetTensorA(aGlobal);
    matmulObj.SetTensorB(bGlobal);
    matmulObj.SetBias(biasGlobal);
    
    while (matmulObj.template Iterate<true>()) { // Step 2: Perform the internal MatMul compute.
        // Step 3: Move the MatMul compute result to the AI Vector core.
        reluOutLocal = reluOutQueue_.AllocTensor<cType>();
        matmulObj.template GetTensorC<true>(reluOutLocal, false, true);
       // Step 4: Perform the vector compute.
        AscendC::LeakyRelu(reluOutLocal, reluOutLocal, (cType)alpha, tiling.baseM * tiling.baseN);
        reluOutQueue_.EnQue(reluOutLocal);
        // Step 5: Move the output to the global memory.
        reluOutQueue_.DeQue<cType>();
        ...
        AscendC::DataCopy(cGlobal[startOffset], reluOutLocal, copyParam);
        reluOutQueue_.FreeTensor(reluOutLocal);

        computeRound++;
    }
    matmulObj.End();
}