ND2NZ format conversion on the AIV core
Case Study
This case describes how to use the Matmul high-level API for computation in the matrix multiplication operator scenario. The input matrix is not 256-byte aligned along the inner axis (the row direction of the matrix). The conversion from the ND format to the NZ format on the AIV core improves the operator performance. To improve the computation efficiency of the Cube Unit, the input matrix in ND format is converted to the NZ format before Cube computation. For details about the ND and NZ formats, see Data Formats. The Matmul API internally uses the ND2NZ instruction to convert the format and move data. However, when the data is not 256-byte aligned, the bandwidth utilization of the ND2NZ instruction is low. Therefore, when the inner axis of the input matrix is not 256-byte aligned, the Vector Unit on the AIV core is used to convert the format from ND to NZ before Matmul computation. This avoids the low efficiency of transferring non-aligned data and improves the operator performance.
- Application scenarios of ND2NZ format conversion on the AIV core
The inner axis of the input matrix is not 256-byte aligned, and the data volume is large, affecting the efficiency of format conversion.
The operator specifications are as follows.
|
Input |
Shape |
Data type |
Format |
|---|---|---|---|
|
a |
1024, 1024 |
float16 |
ND |
|
b |
1024, 4095 |
float16 |
ND |
In this case, the AI Processor has 24 cores, and the high-level API Matmul is enabled in pure Cube mode. The tiling parameters of the MDL template are as follows:
- Original shape: M = 1024, N = 4095, K = 1024.
- Single-core shape: singleCoreM = 128, singleCoreN = 1408, singleCoreK = 1024.
- Basic block shape: baseM = 128, baseN = 256, baseK = 64.
- Tiling parameters related to the L1 cache: stepM = 1, stepN = 1, stepKa = 4, stepKb = 4.
Obtaining Profile Data
Use the msProf tool to obtain the operator simulation pipeline diagram and on-board profiling data, and analyze the MTE2 pipeline.
Analyzing Main Bottlenecks
- The following figure shows the Cube pipeline before the optimization. The data format is converted during the MTE2 data transfer because the ND2NZ instruction is used. As a result, the MTE2 accounts for a large proportion of the total time.

- The following figure shows the profiling data before the optimization. It can be seen that only the Cube Unit is used for computation, and the maximum time consumed by aic_time is 149.04 μs, with aic_mte2_ratio accounting for a large proportion.

Optimization Solution
For the input matrix in ND format, the associated ND2NZ instruction is no longer used for format conversion. Instead, the data format is converted using the capability of the Vector Unit. First, the DataCopyPad API is used to move the unaligned matrix data to the Unified Buffer. The Duplicate API is used to fill the data that needs to be aligned. Then, the Copy API is called row by row to rearrange the data from the ND format to the NZ format, and the rearranged NZ data is written to the workspace memory. Finally, the NZ data in the workspace is directly read for Matmul computation.
For details about the complete example of ND2NZ format conversion on the AIV core, see Matmul operator sample for converting the ND format of the input matrix to the NZ format. The main steps to implement ND2NZ format conversion on the AIV core are as follows:
- When creating a Matmul object, define the format of matrix B whose inner axis is not 256-byte aligned as NZ.
1 2 3 4 5 6
using A_TYPE = AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, ATYPE, true>; // Use CubeFormat::NZ to define the type information of matrix B. using B_TYPE = AscendC::MatmulType<AscendC::TPosition::GM, AscendC::TPosition::GM, CubeFormat::NZ, BType, true>; using C_TYPE = AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, CType>; using BIAS_TYPE = AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, BiasType>; AscendC::Matmul<A_TYPE, B_TYPE, C_TYPE, BIAS_TYPE, CFG_MDL> matmulObj;
- Use the Vector Unit to implement ND2NZ format conversion. In the following code, MatrixBtoNZ is a function that converts the ND format of matrix B to the NZ format. For details about the implementation of this function, see the complete sample code.
1 2 3 4 5 6 7 8 9 10 11 12 13
// Vector ND2NZ if ASCEND_IS_AIV { pipe->InitBuffer(ubBuf, TOTAL_UB_SIZE); MatrixBtoNZ<typename B_TYPE::T>(tempGM, bGMNZ, tiling, isTransB, ubBuf, tiling.baseK, tiling.baseN); // ND2NZ format conversion function SyncAll(); // CV SYNC NotifyEvent<PIPE_MTE3>(4); return; } if ASCEND_IS_AIC { WaitEvent(4); // Wait for the Vector to complete the ND2NZ format conversion. }
- Set the left matrix A, right matrix B, and bias to complete the matrix multiplication operation.
1 2 3 4 5 6 7
matmulObj.SetTail(tailM, tailN, shapes.k); matmulObj.SetTensorA(aGlobal, false); matmulObj.SetTensorB(bGlobal, false); if (shapes.isBias) { matmulObj.SetBias(biasGlobal); } matmulObj.IterateAll(cGlobal);
Verifying Optimization Benefits
- The following figure shows the optimized Vector pipeline. The Vector Unit is used to convert the data format of matrix B.

- The following figure shows the optimized Cube pipeline. After the ND2NZ instruction is not used to convert the format of matrix B, the proportion of MTE2 decreases significantly.

- The following figure shows the optimized profiling data. It can be seen that the Cube Unit and Vector Unit are used at the same time, and the maximum time consumed by aic_time is 90.95 μs, with the proportion of aic_mte2_ratio significantly reduced.


|
Optimization Methods |
Total Time (μs) |
Average AIC_MTE2 Time (μs) |
Average AIV_MTE2 Time (μs) |
|---|---|---|---|
|
ND2NZ instruction |
149.82 |
130.77 |
0 |
|
Vector-side ND2NZ |
93.76 |
22.85 |
10.31 |
According to the execution time comparison in the preceding table, the total time consumed is greatly reduced and the end-to-end performance is significantly improved after the ND2NZ instruction is not used.
Congratulations
In the scenario where the axis in the matrix is not 256-byte aligned during matrix multiplication, the bandwidth utilization of the ND2NZ instruction is low, affecting the operator performance. Therefore, the ND2NZ data rearrangement is performed on the AIV core to improve the overall operator performance. It should be noted that the bandwidth utilization is related to the data volume. If the total amount of matrix data is too small, the effective bandwidth cannot be significantly improved even if the ND2NZ conversion is performed on the AIV core. Instead, the end-to-end operator performance will deteriorate due to multi-core synchronization.