Overview of Matmul Performance Optimization Strategies

This section provides a series of performance tuning cases for operators involving Matmul computation. You can refer to the optimization methods and ideas in the cases and apply them to your specific scenarios. The following table describes the case categories and introduction. For details, see the following sections.

**Table 1** Overview of Matmul performance optimization strategies
Category	Subcategory	Application Scenario	Case
Tiling optimization	Tiling optimization: Optimize the strategy for tiling cores and basic blocks.	Large-shape scenarios with enough data size	Tiling Strategy for Matmul Operator Optimization
Parallelism optimization	Inter-core task parallelism: Properly allocate data to different cores to execute tasks.	Scenarios where the K-axis of the matrix is large and the M-axis and N-axis are smaller than the K-axis.	Matmul High-level API Enabling K-axis Tiling of Matrix Data in Multi-core Parallel Computation
	Inter-core data access parallelism: Optimize the multi-core data parallel access mechanism, for example, optimize the address access conflicts of the same memory data in multi-core scenarios, to improve the multi-core data access efficiency.	Scenarios where Matmul is executed on multiple cores, the K-axis of the input matrix is large, and the K-axis is not fully loaded.	Matmul High-level API Enabling Multi-core K-axis Staggered Access to Device Memory
	Intra-core pipeline parallelism: Different instruction queues can be executed independently and in parallel, which can be used to optimize intra-core pipeline parallelism.	The MMAD pipeline and FIXPIPE pipeline of the operator are executed in serial mode. The synchronization waiting time accounts for a large proportion of the total execution time of the operator.	Matmul High-level API Enabling UnitFlag
		MTE2 Bound and the MTE2 pipeline are executed in serial mode with other pipelines.	Matmul High-level API Enabling NBuffer33 Template
Memory optimization	Memory sharing and reuse: Reduce the overhead caused by repeated data movement through buffer sharing and cache reuse.	In the MIX scenario, the GM addresses of matrix A or matrix B of multiple AIVs are the same, and the matrix A or matrix B reused by multiple AIVs are fully loaded on L1 Buffer.	Matmul High-level API Enabling IBShare Template for Sharing Matrix A and Matrix B Data Matmul High-level API Enabling IBShare Template for Sharing Matrix B Data
Memory optimization	Memory alignment: Ensure that the processed data meets specific alignment requirements. Use different data movement strategies for unaligned data to improve the data movement efficiency.	Scenarios where the axis in the input matrix is not 256-byte aligned and the data size is large.	ND2NZ Format Conversion on AIVs
Scalar optimization	Static tiling: Complete Matmul tiling computation during kernel compilation. Convert variables into constants and spread them to the system to reduce scalar computations and improve performance.	A large number of Scalar computations are performed during Matmul initialization, affecting the instruction header overhead. A large number of Scalar computations are performed between Matmul iterations, blocking the MTE2 pipeline.	Matmul High-level API Enabling Full Static Tiling
Scalar optimization	CUBE_ONLY: Reduce the extra Scalar overhead caused by the message processing mechanism.	Compared with the MIX mode, the Vector computation is not performed, and only Cube computation is performed.	Matmul High-level API Enabling CUBE_ONLY
Movement optimization	Movement throughput optimization: Properly control the size of the data block to be moved to improve the bandwidth utilization and movement efficiency.	Large-shape scenarios with a large number of MTE2 cyclic movements.	Matmul High-level API Enabling MDL Template
		Scenarios where the size of input and output data exceeds the L2 cache size.	Matmul High-level API Enabling L2 Cache Tiling
	Preloading movement: Preload the data blocks to be moved to reduce the gap between pipelines.	Scenarios where the MTE2 pipeline gap is large and the value of M or N is large.	Matmul High-level API Enabling MTE2 Preload

Parent topic: Matmul Performance Tuning Cases