Custom Pass

Cases

Case 1: BMM + Tile

The BMM operator has the broadcast logic internally. If the surrounding operators also implement the broadcast logic, the broadcast logic can be eliminated to improve performance. As shown in the following figure, after the Tile operator is deleted, the performance is improved in the following aspects:

Time of the Tile operator.
The memory access introduced by the Tile operator is 37.5 MB (128 x 8 x 300 x 32 x 4), which affects the cache hit ratio of other instances.
MTE speed of the BMM operator, which is accelerated due to broadcasting.

Case 2: Concat operator

In this scenario, the concat operator is used for graph construction, which has the same function as the Tile operator. However, the amount of data moved by the Tile operator is only 1/240 of that moved by the concat operator. In addition, if the tail axis is not aligned, the concat operator will suffer a lot of performance loss.

Case 3: BMM operator

It can be observed that the original Batch MatMul (BMM) performs a matrix multiplication of 1 x 32 and 300 x 32. On the NPU hardware unit, this only utilizes 1/16 of the Cube matrix unit's capacity, resulting in significant computational waste. Since the leading axis involves broadcasting, the broadcasting axis is folded into the K-axis. By replacing 128 x 8 iterations of (1 x 32) x (300 x 32) with 8 iterations of (128 x 32) x (300 x 32), the overall Cube utilization is improved. The introduced Reshape operator is not executed on the NPU. The impact of the introduced transpose operator is less than the improvement brought by the computing efficiency. The overall benefit is obvious.

Case 4: Tile + Concat

By exchanging the execution sequence of Tile and Concat, the read access volume is reduced from 1 x 32 x 3 + 128 x 32 x 3 to 1 x 32 x 3 + 1 x 96, which is reduced by 98.5%. The write access volume is reduced from 128 x 32 x 3 + 128 x 96 to 1 x 96 + 128 x 96, which is reduced by 50%.

Parent topic: Performance Tuning Methods