ND2NZ Format Conversion on AIVs

Case Study

This case shows the optimized operator performance when converting non-256-byte aligned input matrices to the ND2NZ format on AIVs in the matrix multiplication of the high-level Matmul API in the matrix multiplication operator scenario. To improve the computing efficiency of the Cube Unit, the input matrix in ND format is converted to the NZ format before Cube computation. For details about the ND and NZ formats, see Data Formats. The Matmul API internally uses the Real-time ND2NZ Instruction to perform both format conversion and data transfer. However, when the data is not 256-byte aligned, the bandwidth utilization of the real-time ND2NZ instruction is low. Therefore, when the inner axis of the input matrix is not 256-byte aligned, the Vector Unit on the AIV can be used to convert the ND format to the NZ format before Matmul computation. This avoids the low efficiency caused by real-time non-aligned data transfer and improves operator performance.

Applicable scenarios of ND2NZ format conversion on AIVs
The inner axis of the input matrix is not 256-byte aligned, and the data size is large, affecting the efficiency of real-time format conversion.

The operator specifications are as follows.

**Table 1** Operator specifications
Input	Shape	Data Type	Format
a	1024, 1024	float16	ND
b	1024, 4095	float16	ND

On the AI processor used in the current case, there are 24 cores in total, and the high-level API Matmul in CUBE_ONLY is enabled for the operator. Use the MDL template. The tiling parameters are as follows:

Original shape: M = 1024, N = 4095, K = 1024.
Single-core shape: singleCoreM = 128, singleCoreN = 1408, singleCoreK = 1024.
Base block shape: baseM = 128, baseN = 256, and baseK = 64.
Tiling parameters related to the L1 cache: stepM = 1, stepN = 1, stepKa = 4, stepKb = 4.

Obtaining Profile Data

Use the msProf tool to obtain the Operator Simulation Pipeline and On-board Profiling data, and focus on analyzing the MTE2 pipeline.

Analyzing Main Bottlenecks

The following figure shows the Cube pipeline before optimization. The data format is converted during the MTE2 data transfer because the real-time ND2NZ instruction is used. As a result, the MTE2 accounts for a large proportion.
The profile data before optimization is as follows. It can be seen that only the Cube Unit is used for computation. The maximum aic_time is 149.04 μs, and aic_mte2_ratio accounts for a large proportion.

Optimization Solution

For the input matrix in ND format, the real-time ND2NZ instruction is no longer used for format conversion. Instead, the Vector unit is used to convert the data format. First, use the DataCopyPad API to transfer the non-aligned matrix data to the Unified Buffer. Then, use the Duplicate API to fill in the data that needs to be aligned. Next, call the Copy API row by row to rearrange the data from ND to NZ format. Write the rearranged NZ data to the workspace memory. Finally, directly read the NZ data from the workspace and perform Matmul computation.

For details about the sample of ND2NZ format conversion on the AIV, see Matmul operator sample for converting the ND format of the input matrix to the NZ format. The main steps for implementing ND2NZ format conversion on the AIV are as follows:

When creating a Matmul object, define the format of matrix B whose inner axis is not 256-byte aligned as NZ.

        
             using A_TYPE = AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, ATYPE, true>;
// Use CubeFormat::NZ to define the type information of matrix B.
using B_TYPE = AscendC::MatmulType<AscendC::TPosition::GM, AscendC::TPosition::GM, CubeFormat::NZ, BType, true>;
using C_TYPE = AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, CType>;
using BIAS_TYPE =  AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, BiasType>;
AscendC::Matmul<A_TYPE, B_TYPE, C_TYPE, BIAS_TYPE, CFG_MDL> matmulObj;

Use the Vector Unit to implement ND2NZ format conversion. In the following code, MatrixBtoNZ is a function that converts the ND format of matrix B to the NZ format. For details about the implementation of this function, see the complete sample code.

        
             // Vector ND2NZ
if ASCEND_IS_AIV {
    pipe->InitBuffer(ubBuf, TOTAL_UB_SIZE);
    MatrixBtoNZ<typename B_TYPE::T>(tempGM, bGMNZ, tiling, isTransB, ubBuf, tiling.baseK,
        tiling.baseN); // ND2NZ format conversion function
    SyncAll();
    // CV SYNC
    NotifyEvent<PIPE_MTE3>(4);
    return;
}
if ASCEND_IS_AIC {
    WaitEvent(4); // Wait for the vector to complete ND2NZ format conversion.
}

Set the left matrix A, right matrix B, and bias to complete the matrix multiplication operation.

        
             matmulObj.SetTail(tailM, tailN, shapes.k);
matmulObj.SetTensorA(aGlobal, false);
matmulObj.SetTensorB(bGlobal, false);
if (shapes.isBias) {
    matmulObj.SetBias(biasGlobal);
}
matmulObj.IterateAll(cGlobal);

Verifying Optimization Benefits

The following figure shows the vector pipeline after optimization. The Vector Unit is used to convert the data format of matrix B.
The following figure shows the cube pipeline after optimization. After the real-time ND2NZ instruction is not used to convert the format of matrix B, the proportion of MTE2 decreases significantly.
The following table shows the profile data after optimization. When both the Cube Unit and Vector Unit are used, the maximum aic_time is 90.95 μs, and the proportion of aic_mte2_ratio decreases significantly.

**Table 2** End-to-end performance comparison
Optimization Method	Total Duration (μs)	Average AIC_MTE2 Duration (μs)	Average AIV_MTE2 Duration (μs)
Real-time ND2NZ	149.82	130.77	0
Vector-side ND2NZ	93.76	22.85	10.31

As shown in the preceding table, after the real-time ND2NZ instruction is not used, the total execution duration is greatly reduced, and the end-to-end performance is significantly improved.

Summary

In the scenario where the inner axis of the matrix is not 256-byte aligned during matrix multiplication, the bandwidth utilization of the real-time ND2NZ instruction is low, affecting the operator performance. By rearranging data on the AIV using the ND2NZ instruction, the overall operator performance can be improved. It should be noted that the bandwidth utilization is related to the data size. If the total size of matrix data is too small, the effective bandwidth cannot be significantly improved even if the ND2NZ conversion is performed on the AIV. Instead, the introduction of multi-core synchronization may cause the end-to-end performance of the operator to deteriorate.

Parent topic: Matmul Performance Tuning Cases