GroupedMatmul Operator Performance Optimization

Case Study

This case analyzes and optimizes the performance of GroupedMatmul operators in the per-token quantization scenario. The computation process of a GroupedMatmul operator (written in Python code) is as follows:

offset = 0
for i in range(g):
    mmOut = x[offset:offset + groupList[i]] * weight[i] + bias[i]
    y[offset:offset + groupList[i]] = Gelu(mmOut * scale[i] * pertokenScale[offset:offset + groupList[i]])
    offset += groupList[i]

The verification platform is Atlas A2 training products/Atlas A2 inference products.

The following operator specifications are used as an example.

**Table 1** Operator specifications
input	shape	data type	format
x	(1024,1024)	int8	ND
weight	(8,1024,8192)	int8	NZ
bias	(8,8192)	int32	ND
groupList	8	int64	ND
scale	(8,8192)	float	ND
pertokenScale	1024	float	ND
y	(1024,8192)	float16	ND

The optimization methods are described below:

Set the startup ratio of AICs to AIVs in the AI Core to 1:2 if the Vector ratio is high (reaching the vector bound).
Optimize the CV parallelism pipeline to reduce the idle waiting time between the Cube and Vector computations.
Optimize the vector computation pipeline to improve the parallel Vector computation speed.

Obtaining Profile Data

Eight cores are used for the test. That is, blockDim is fixed at 8 in the current performance and subsequent tiling optimization.

Use the msProf tool to obtain the operator profile data.

Obtain the profile data (ArithmeticUtilization.csv for cycle ratios of instructions) executed in the actual environment, including the ratio of each pipeline.
Obtain the simulation profile data (instruction pipeline chart), including the utilization of each pipeline. You can observe the dependency between pipelines to optimize the parallelism efficiency.

Analyzing Main Bottlenecks

Eight cores are used for the test. Run the msprof op command to obtain the cycle ratios of instructions.

Figure 1 ArithmeticUtilization.csv for cycle ratios of instructions (total time: 218.1 μs)

The following figure shows the instruction pipeline chart obtained by running msprof op simulator.

Figure 2 Instruction pipeline chart

Analyze the performance based on the preceding two types of data (real data and simulation data).

The vector bound is reached, and the ratio is set to 1:1 to reduce the core startup overhead.
During actual optimization, after the preceding problems are optimized and the Vector ratio decreases, there are gaps between Cube and Vector computations with waiting time.
The double buffering function is not enabled for the Vector computations, and the computation and data transfer are not performed in parallel.

Optimization Solution

Set the startup ratio of AICs to AIVs in the AI Core to 1:2. For the data output by the AIC each time, the corresponding dequantization and activation functions are computed by two AIVs in parallel. In the Vector loop, AIV0 and AIV1 are used for computing alternately (the prerequisite is that the number of loops is not 1). A sample code is as follows:

uint32_t vecCount = 0;
uint32_t taskRation = GetTaskRation();
for (uint32_t offsetN = 0; offsetN < curCubeSingleN; offsetN += mnConfig.baseN) {
    if (unlikely(offsetN + mnConfig.baseN >= curCubeSingleN)) {
        curVecBaseN = curCubeSingleN - offsetN;
    }
    uint32_t alignBaseN = Ceil(curVecBaseN, uint32_t(8)) * 8;  //  8: num int32_t in 32B ub block
    DataCopyScale(curVecBaseN, alignBaseN, scaleOffset + offsetN);
    uint32_t curVecBaseM = vecBaseM;
    uint64_t mmOutOffset = mnConfig.workSpaceOffset + offsetN * mnConfig.baseM;
    CrossCoreWaitFlag(SYNC_AIC_TO_AIV);
    for (uint32_t offsetM = 0; offsetM < curCubeSingleM; offsetM += vecBaseM) {
         vecCount++;
        if (vecCount % taskRation != subBlockIdx) {
            continue;  // AIV0 and AIV1 are used for computing alternately.
        }
        if (unlikely(offsetM + vecBaseM >= curCubeSingleM)) { 
            curVecBaseM = curCubeSingleM - offsetM; 
        }
        // Use the AscendDequant API to perform per-channel dequantization.
        LocalTensor<cT::T> mmOutLocal = vecInQueue.AllocTensor<cT::T>();
        DataCopyPad2D(mmOutLocal, mmOutGm[mmOutOffset + offsetM * curVecBaseN],
                      curVecBaseM, curVecBaseN, curVecBaseN);
        vecInQueue.EnQue(mmOutLocal);
        ComputeDequantAndActivate(mnConfig, curVecBaseM, alignBaseN, curVecBaseN, offsetM);
        LocalTensor<DTYPE_Y> yLocal = vecOutQueue.DeQue<DTYPE_Y>();
        DataCopyPad2D(yGm[outOffset + offsetM * tiling->n + offsetN], yLocal,
                      curVecBaseM, curVecBaseN, alignBaseN, tiling->n);
        vecOutQueue.FreeTensor(yLocal);
    }
    ...
}

After the startup ratio of AICs to AIVs in the AI Core is set to 1:2, there are gaps between Cube and Vector computations with waiting time. The reason is that the Vector and Cube computations use the same workspace for data transfer. If four workspaces are used for optimization, workspaces are allocated on the host based on 4 x baseM x baseN, and Cube computation can skip the first four rounds of waiting.

if ASCEND_IS_AIC {
    if (cubeCount >= tiling->parallNum) {  // tiling->parallNum is set to 4.
        CrossCoreWaitFlag(SYNC_AIV_TO_AIC);
    }
    mm.SetOrgShape(mnConfig.m, tiling->n, tiling->k);
    mm.SetSingleShape(curSingleM, curSingleN, tiling->k);
    mm.SetTensorA(xGm[xOffset]);
    auto weightSlice = weightGm[weightOffset];
    if (mnConfig.blockDimM == 1) {
        weightSlice.SetL2CacheHint(CacheMode::CACHE_MODE_DISABLE);
    }
    mm.SetTensorB(weightSlice);
    uint64_t workspaceOffset = mnConfig.workSpaceOffset;
    while (mm.Iterate()) {
        mm.GetTensorC(mmOutGm[workspaceOffset], 0, true);
        CrossCoreSetFlag<2, PIPE_FIX>(SYNC_AIC_TO_AIV);
        workspaceOffset += (mnConfig.baseM * mnConfig.baseN);
    }
}
cubeCount++;

After the double buffering function is enabled for Vector computation, InitBuffer specifies that the number of buffers to be allocated is 2.

pipe->InitBuffer(scaleInQueue, 2, tiling->mmTilingData.baseN * sizeof(DTYPE_SCALE));
pipe->InitBuffer(perTokenScaleInQueue, 2, tiling->mmTilingData.baseM * sizeof(float));
pipe->InitBuffer(vecInQueue, 2, tiling->ubCalSize * sizeof(cT::T));
pipe->InitBuffer(vecOutQueue, 2, tiling->ubCalSize * sizeof(DTYPE_Y));

Verifying Optimization Benefits

After the startup ratio of AICs to AIVs in the AI Core is set to 1:2, the total execution time decreases from 218.1 μs to 154.2 μs. The instruction pipeline chart shows that the waiting time between Cube computations decreases.
As shown in the preceding figure, the Vector computation does not reach the vector bound, but there are gaps between the Cube and Vector computations (as indicated by the two arrows). Possible causes are as follows:
When Vector computation is waiting for the output data of Cube computation, the Cube computation needs to release the workspace to store the computation result of the next round after the Vector computation is complete. Currently, two workspaces are used for parallel Cube and Vector computations.

Vector and Cube computations may use the same workspace for data transfer, which leads to data dependency. Therefore, there are waiting intervals.

Four workspaces can be used for optimization.

After the optimization, the total time is reduced from 154.2 μs to 131.8 μs. The instruction pipeline chart shows that the gaps between the Vector and Cube computations are significantly reduced.
After the double buffering function is enabled for Vector computation, the total execution time decreases from 131.8 μs to 128.1 μs.

Summary

If the main bottleneck lies in Vector computation, set the startup ratio of AICs to AIVs in the AI Core to 1:2.
If the time of Cube and Vector computations is close and there are wait gaps between them, four workspaces should be used for optimization.
Check whether data transfer and computing are mutually masked. If multiple rounds of computing do not have data dependency and the buffer is large enough, enabling the double buffering function can improve the parallelism efficiency.

Parent topic: Best Practices