Optimization Suggestion Overview

Table 1 Overview of performance optimization suggestions

Category

Description

Optimization Suggestions

Tiling strategy

Provides tiling-related optimization suggestions for you to select a proper tiling strategy.

Inter-core Load Balancing

Header and tailer overhead optimization

Provides optimization suggestions for reducing the header and tailer overheads (latency generated before and after the operator performs compute).

Setting a Proper Number of Cores and Operator Kernel Type

Restricting the Size of the TilingData Structure

Preventing TPipe from Being Created and Initialized Inside the Object

Pipeline orchestration

Improves hardware resource utilization and achieves higher throughput by means of task parallelization and asynchronous scheduling.

Enabling Double Buffer

Enabling Asynchronous the Iterate or IterateAll API to Avoid AIC/AIV Synchronization Dependency

Memory access

Maximizes the transfer efficiency by controlling the size of the data block to be transferred and the GM address. Reduces the memory usage and improves the computing efficiency by sharing and reusing buffers, compressing and simplifying data, using dedicated storage space, and optimizing memory access scheduling.

Transferring a Large Data Block at a Time

Using 512-Byte Alignment for the GM Address

Using the Data Transfer API Efficiently

Avoiding Same-Address Access

Setting a Proper L2 CacheMode

Using Shared Temporary Buffer for Operators and High-Level APIs

Reusing VECIN and VECOUT for Movement Operators

Reducing the Tensor ShapeInfo Dimensions to Optimize the Stack Space

Avoiding Bank Conflicts in the Unified Buffer

L2 Cache Tiling

Vector compute

Provides optimization suggestions related to vector compute.

Continuous Vector Computations Through UB Fusion

Using the Counter Mode for Vector Operators

Selecting Low-Latency Instructions to Optimize Reduction Operation Performance

Cube compute

Provides optimization suggestions related to Cube compute.

Efficient Bias Computate by Using the BT Buffer

Efficient Quantization by Storing Quantization Parameters in the FP Buffer

Efficient Matrix Multiplication Accumulation by Using L0C Buffer

Smaller Matrices Residing on L1 Buffer, Only Larger Matrices Transferred in Batches

Enabling the AtomicAdd Option for Matmul