Basics
Before learning fusion operator programming, ensure that you are familiar with the cube programming knowledge.
Fusion Operator
A fusion or fused operator is formed by multiple independent small operators, and has equivalent functions as these operators, but usually with better performance. Vector and Cube operators can be fused based on specific algorithms to achieve performance benefits.
For example, the following figure shows the core implementation of Flash Attention, the core fusion operator in an LLM. The Matmul operator (Cube), Scale operator (Vector), Mask operator (Vector), and SoftMax operator (Vector) in the figure are fused into a large operator Flash Attention.

Application Scenarios and Advantages of the Fusion Operator
When a relatively high performance requirement is imposed on an operator, a vector operator and a cube operator can be fused into a fusion operator, which can be carried by an operator kernel function, to achieve performance gains. The following figure shows the comparison between the execution duration of the independent vector operator, cube operator, and mix fusion operator. Based on the comparison, you can see why developing the mix fusion operator brings performance gains.

- Implementation of independent vector operators and cube operators: The Cube compute result needs to be moved to the global memory, and then to the local memory for Vector computation. The compute and movement are performed in sequence. Scheduling and execution of multiple operators will increase the total scheduling time.
- Implementation method of the fusion operator: Data can be tiled, and then parallel computing is implemented for the Vector Unit and the Cube Unit through pipeline design. Compared with scheduling single operators, scheduling fusion operator is more efficient.
In addition to effectively improving operator performance and fully utilizing the computing power of AI processors, fusion operators have the following advantages:
- Reduced computation: A fusion operator refers to the combination of multiple operators, which can simplify the compute process, reduce computation, and improve the compute efficiency.
- Reduced memory usage: A fusion operator can combine the intermediate results of multiple operators into one to reduce and improve memory usage.
- Optimized data flows: Fusion operators can optimize data flows and reduce data copy between different operators, thereby improving data processing efficiency.
- Simplified code implementation: Fusion operators can simplify code implementation, reduce the code amount, and improve code readability and maintainability.
The fusion operator can effectively optimize computing, improve computing efficiency and memory usage, optimize data flows, and simplify code implementation.
Programming Paradigm
Ascend C provides fusion operator programming paradigm so that you can express the data flow of the fusion operator based on the paradigm and quickly implement your own fusion operator.
The fusion operator data flow refers to the flow direction of the input and output of the fusion operator between storage locations. Take a typical cube and vector fusion operator as an example. The following figure shows the data flow between logical locations.
- The output of cube can be used as the input of vector: CO2 -> VECIN.
- The output of vector can be used as the input of cube: VECOUT -> A1 -> A2 or VECOUT -> B1 -> B2.


- Initialize a Matmul object and move the input data from the global memory to the Cube Core.
- Perform internal Matmul computation.
- Move the Matmul compute result to the Vector core.
- Perform Vector computation.
- Move the output to the global memory.
The sample code (pseudocode) of the entire process is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | template<typename aType, typename bType, typename cType, typename biasType> __aicore__ inline void MatmulLeakyKernel<aType, bType, cType, biasType>::Process() { // Step 1: Initialize a Matmul object and move the input data from the global memory to the Cube Core. uint32_t computeRound = 0; REGIST_MATMUL_OBJ(&pipe, GetSysWorkSpacePtr(), matmulObj); matmulObj.Init(&tiling); matmulObj.SetTensorA(aGlobal); matmulObj.SetTensorB(bGlobal); matmulObj.SetBias(biasGlobal); while (matmulObj.template Iterate<true>()) { // Step 2: Perform the internal Matmul compute. // Step 3: Move the compute result of Matmul to the Vector Core. reluOutLocal = reluOutQueue_.AllocTensor<cType>(); matmulObj.template GetTensorC<true>(reluOutLocal, false, true); // Step 4: Perform Vector computation. AscendC::LeakyRelu(reluOutLocal, reluOutLocal, (cType)alpha, tiling.baseM * tiling.baseN); reluOutQueue_.EnQue(reluOutLocal); // Step 5: Move the output to the global memory. reluOutQueue_.DeQue<cType>(); ... AscendC::DataCopy(cGlobal[startOffset], reluOutLocal, copyParam); reluOutQueue_.FreeTensor(reluOutLocal); computeRound++; } matmulObj.End(); } |