Basics

Before learning fused operator programming, ensure that you are familiar with the cube programming knowledge.

CV Fused Operator

A fused operator is formed by multiple independent small operators, and has equivalent functions as these operators, but usually with better performance. Vector and cube operators can be fused based on specific algorithms to achieve performance benefits. An operator that fuses cube and vector compute is called a CV fused operator.

For example, the following figure shows the core implementation of Flash Attention, the core fused operator in an LLM. The MatMul operator (Cube), Scale operator (Vector), Mask operator (Vector), and SoftMax operator (Vector) in the figure are fused into a large operator Flash Attention.

Figure 1 Core implementation of Flash Attention

Application Scenarios and Advantages

A vector operator and a cube operator that have data dependencies can be fused into a fused operator, which can be carried by an operator kernel function, to achieve performance gains. The following figure shows the comparison between the execution duration of the independent vector operator, cube operator, and mix fused operator. Based on the comparison, you can see why developing the mix fused operator brings performance gains.

Figure 2 Execution duration comparison between independent vector operators, cube operators, and mix fused operators

Implementation of independent vector operators and cube operators: The Cube compute result needs to be moved to the global memory, and then to the local memory for Vector compute. The compute and movement are performed in serial mode. Scheduling and execution of multiple operators will increase the total scheduling duration.
Implementation method of fused operators: Data can be tiled, and then parallel compute is implemented for the Vector Unit and the Cube Unit through pipeline design. Compared with scheduling single operators, scheduling fused operators is more efficient.

In addition to effectively improving operator performance and fully utilizing the computing power of AI processors, fused operators have the following advantages:

Reduced compute amount: A fused operator refers to the combination of multiple operators, which can simplify the compute process, reduce the compute amount, and improve the compute efficiency.
Reduced memory usage: A fused operator can combine the intermediate results of multiple operators into one to reduce memory usage and improve resource efficiency.
Optimized data flows: A fused operator can optimize data flows and reduce data transmissions between different operators, improving data processing efficiency.
Simplified code implementation: Fused operators can simplify code implementation, reduce the code amount, and improve code readability and maintainability.

The fused operator can effectively optimize compute, improve compute efficiency and memory usage, optimize data flows, and simplify code implementation.

Programming Paradigm

Ascend Cprovides fused operator programming paradigm so that you can express the data flow of the fused operator based on the paradigm and quickly implement custom fused operators.

The fused operator data flow refers to the flow direction of the input and output of the fused operator between storage locations. Take a typical cube and vector fused operator as an example. The following figure shows the data flows between logical locations.

The output of cube can be used as the input of vector: CO2 -> VECIN.
The output of vector can be used as the input of cube: VECOUT -> A1 -> A2 or VECOUT -> B1 -> B2.

Based on the fused operator programming paradigm of the MatMul high-level APIs, the preceding data flows are simplified as follows.

Figure 3 Fused operator programming paradigm

Initialize a MatMul object and move the input data from the global memory to the AI Cube core.
Perform the internal Matmul compute.
Move the MatMul compute result to the AI Vector core.
Perform the vector compute.
Move the output to the global memory.

The sample code (pseudocode) of the entire process is as follows. For details, see MatmulLeakyRelu.

template<typename aType, typename bType, typename cType, typename biasType>
__aicore__ inline void MatmulLeakyKernel<aType, bType, cType, biasType>::Process()
{
    // Step 1: Initialize a MatMul object and move the input data from the global memory to the AI Cube core.
    uint32_t computeRound = 0;
    REGIST_MATMUL_OBJ(&pipe, GetSysWorkSpacePtr(), matmulObj);
    matmulObj.Init(&tiling);
    matmulObj.SetTensorA(aGlobal);
    matmulObj.SetTensorB(bGlobal);
    matmulObj.SetBias(biasGlobal);
    
    while (matmulObj.template Iterate<true>()) { // Step 2: Perform the internal MatMul compute.
        // Step 3: Move the MatMul compute result to the AI Vector core.
        reluOutLocal = reluOutQueue_.AllocTensor<cType>();
        matmulObj.template GetTensorC<true>(reluOutLocal, false, true);
       // Step 4: Perform the vector compute.
        AscendC::LeakyRelu(reluOutLocal, reluOutLocal, (cType)alpha, tiling.baseM * tiling.baseN);
        reluOutQueue_.EnQue(reluOutLocal);
        // Step 5: Move the output to the global memory.
        reluOutQueue_.DeQue<cType>();
        ...
        AscendC::DataCopy(cGlobal[startOffset], reluOutLocal, copyParam);
        reluOutQueue_.FreeTensor(reluOutLocal);

        computeRound++;
    }
    matmulObj.End();
}

Parent topic: CV Fusion