Matmul Performance Optimization Policies

This section provides a series of performance tuning cases for operators involving Matmul computation. You can refer to the optimization methods and ideas in the cases and apply them to your specific scenarios. The following table describes the case classification and introduction. For details, see the following sections.

Table 1 Overview of Matmul performance optimization policies

Category

Subcategory

Application Scenario

Case

Tiling Optimization

Tiling optimization: Optimize the tiling strategy for splitting kernels and basic blocks.

Large-shape scenarios with sufficient data volume.

Parallelism degree optimization

Inter-core task parallelism: Properly allocate data to different cores to execute tasks.

Scenarios where the K axis of the matrix is large and the M axis and N axis are smaller than the K axis.

Inter-core data access parallelism: Optimize the parallel data access mechanism among multiple cores, for example, optimize the address access conflicts of the same memory data in multi-core scenarios, to improve the multi-core data access efficiency.

Scenarios where Matmul is executed on multiple cores, the K axis of the input matrix is large, and the K axis is not fully loaded.

Intra-core pipeline parallelism: Use the independence and parallel execution features of different instruction queues to optimize the intra-core pipeline parallelism degree.

The MMAD pipeline and FIXPIPE pipeline of the operator are executed in serial mode. The synchronization waiting time accounts for a large proportion of the total execution time of the operator.

MTE2 Bound and the MTE2 pipeline are executed in serial mode with other pipelines.

Memory Optimization

Memory sharing and reuse: Reduce the overhead caused by repeated data movement through buffer sharing and cache reuse.

In the MIX scenario, the GM addresses of the A or B matrix of multiple AIVs are the same, and the A or B matrix of multiple AIVs is fully loaded in the L1 buffer.

Memory alignment: Ensure that the processed data meets specific alignment requirements. Different data movement policies are used for unaligned data to improve the data movement efficiency.

Scenarios where the axis in the input matrix is not 256-byte aligned and the data volume is large.

Scalar optimization

Tiling constant quantization: The Matmul tiling computation is completed during kernel compilation. Variables are converted into constants and spread to the system to reduce the number of scalars and improve performance.

  • A large number of Scalar computations are performed during Matmul initialization, affecting the instruction header overhead.
  • A large number of Scalar computations are performed between Matmul iterations, blocking the MTE2 pipeline.

Cube-only mode: Reduces the extra scalar overhead caused by the message processing mechanism.

Compared with the MIX mode, the vector calculation is not performed, and only matrix calculation is performed.

Transfer Optimization

Transfer throughput optimization: The size of the data block to be transferred is properly controlled to improve the bandwidth utilization and transfer efficiency.

The scenario where MTE2 performs cyclic transfer for a large number of times in the case of large shape.

The input and output data volume exceeds the size of the L2 cache.

Preloading transfer: Data blocks to be transferred are preloaded to reduce the gap between pipelines.

The scenario where the MTE2 pipeline gap is large and the value of M or N is large.