Optimization Suggestion Overview

**Table 1** Overview of performance optimization suggestions
Category	Description	Optimization Suggestions
Tiling strategy	Provides tiling-related optimization suggestions for you to select a proper tiling strategy.	Inter-core Load Balancing
Header and tailer overhead optimization	Provides optimization suggestions for reducing the header and tailer overheads (latency generated before and after the operator performs compute).	Setting a Proper Number of Cores and Operator Kernel Type
		Restricting the Size of the TilingData Structure
		Preventing TPipe from Being Created and Initialized Inside the Object
Pipeline orchestration	Improves hardware resource utilization and achieves higher throughput by means of task parallelization and asynchronous scheduling.	Enabling Double Buffer
Pipeline orchestration		Enabling Asynchronous the Iterate or IterateAll API to Avoid AIC/AIV Synchronization Dependency
Memory access	Maximizes the transfer efficiency by controlling the size of the data block to be transferred and the GM address. Reduces the memory usage and improves the computing efficiency by sharing and reusing buffers, compressing and simplifying data, using dedicated storage space, and optimizing memory access scheduling.	Transferring a Large Data Block at a Time
		Using 512-Byte Alignment for the GM Address
		Using the Data Transfer API Efficiently
		Avoiding Same-Address Access
		Setting a Proper L2 CacheMode
		Using Shared Temporary Buffer for Operators and High-Level APIs
		Reusing VECIN and VECOUT for Movement Operators
		Reducing the Tensor ShapeInfo Dimensions to Optimize the Stack Space
		Avoiding Bank Conflicts in the Unified Buffer
		L2 Cache Tiling
Vector compute	Provides optimization suggestions related to vector compute.	Continuous Vector Computations Through UB Fusion
		Using the Counter Mode for Vector Operators
		Selecting Low-Latency Instructions to Optimize Reduction Operation Performance
Cube compute	Provides optimization suggestions related to Cube compute.	Efficient Bias Computate by Using the BT Buffer
		Efficient Quantization by Storing Quantization Parameters in the FP Buffer
		Efficient Matrix Multiplication Accumulation by Using L0C Buffer
		Smaller Matrices Residing on L1 Buffer, Only Larger Matrices Transferred in Batches
		Enabling the AtomicAdd Option for Matmul

Parent topic: SIMD Operator Performance Optimization