Batch Scenario

Computes multiple matrices C of size singleCoreM × singleCoreN by each call to IterateBatch. If the shape processed in a single Matmul computation is small, the performance may be affected because each computation involves internal communication. This API provides the function of processing Matmul computation in batches.

In the following example, there are four matrix multiplication operations: a x a, b x b, c x c, and d x d. Multiple singleCoreM x singleCoreN operations need to be computed on a single core. If the shape is small, BatchMatmul can be enabled for batch processing. Take the BMK x BKN = BMN (for details about the format, see IterateBatch) scenario as an example. As shown in the following figure, A = a x a, B = b x b, C = c x c and D = d x d can be computed concurrently.

When instantiating the Matmul, you need to set the layout mode of the input and output to NORMAL through MatmulType. (The data layout mode of BMNK is represented by NORMAL.) During tiling on the host, call SetBatchInfoForNormal to set the M, N, and K axes of A, B, and C matrices and the number of batch numbers of matrix A and matrix B.

In the following example, the aGM and bGM matrices are multiplied and the result is saved to cGm. The layout mode of the aGM, bGM, and cGM data is NORMAL. The left matrix computes batchA MK data each time, and the right matrix computes batchB KN data each time. For more examples of data formats, see BatchMatmul sample.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
#include "kernel_operator.h"
#include "lib/matmul_intf.h"

extern "C" __global__  __aicore__ void kernel_matmul_rpc_batch(GM_ADDR aGM, GM_ADDR bGM, GM_ADDR cGM, GM_ADDR biasGM, GM_ADDR tilingGM, GM_ADDR workspaceGM, uint32_t isTransposeAIn, uint32_t isTransposeBIn, int32_t batchA,  int32_t batchB)
{
    // Define MatmulType.
    typedef matmul::MatmulType <AscendC::TPosition::GM, CubeFormat::ND, half, false, LayoutMode::NORMAL> aType;
    typedef matmul::MatmulType <AscendC::TPosition::GM, CubeFormat::ND, half, true, LayoutMode::NORMAL> bType;
    typedef matmul::MatmulType <AscendC::TPosition::GM, CubeFormat::ND, float, false, LayoutMode::NORMAL> cType;
    typedef matmul::MatmulType <AscendC::TPosition::GM, CubeFormat::ND, float> biasType;

    // Initialize tiling data.
    TCubeTiling tiling;
    auto tempTilingGM = (__gm__ uint32_t*)tilingGM;
    auto tempTiling = (uint32_t*)&tiling;
    for (int i = 0; i < sizeof(TCubeTiling) / sizeof(int32_t); ++i, ++tempTilingGM, ++tempTiling) {
        *tempTiling = *tempTilingGM;
    }

    // Initialize GM data.
    AscendC::GlobalTensor<half> aGlobal;
    AscendC::GlobalTensor<half> bGlobal;
    AscendC::GlobalTensor<float> cGlobal;
    AscendC::GlobalTensor<float> biasGlobal;
    int32_t sizeA = tiling.ALayoutInfoB * tiling.singleCoreM * tiling.singleCoreK * sizeof(A_T);
    int32_t sizeB = tiling.BLayoutInfoB * tiling.singleCoreK * tiling.singleCoreN * sizeof(B_T);
    int32_t sizeC = tiling.CLayoutInfoB * tiling.singleCoreM * tiling.singleCoreN * sizeof(C_T);
    int32_t sizebias = tiling.CLayoutInfoB * tiling.singleCoreN * sizeof(C_T);
    aGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ half*>(aGM), sizeA);
    bGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ half*>(bGM), sizeB);
    cGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ float*>(cGM), sizeC);
    biasGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ float*>(biasGM), sizebias);
    tiling.shareMode = 0;            
    tiling.shareL1Size = 512 * 1024;
    tiling.shareL0CSize = 128 * 1024;
    tiling.shareUbSize = 0;
    int offset_a=0, offset_b=0, offset_c=0, offset_bias=0;
    AscendC::GlobalTensor<A_T> gm_a;
    gm_a.SetGlobalBuffer(const_cast<__gm__ A_T*>(aGlobal[offset_a].GetPhyAddr()), tiling.singleCoreM * tiling.singleCoreK);
    AscendC::GlobalTensor<B_T> gm_b;
    gm_b.SetGlobalBuffer(const_cast<__gm__ B_T*>(bGlobal[offset_b].GetPhyAddr()), tiling.singleCoreK * tiling.singleCoreN);
    AscendC::GlobalTensor<C_T> gm_c;
    gm_c.SetGlobalBuffer(const_cast<__gm__ C_T*>(cGlobal[offset_c].GetPhyAddr()), tiling.singleCoreM * tiling.singleCoreN) ;
    AscendC::GlobalTensor<BiasT> gm_bias;
    gm_bias.SetGlobalBuffer(const_cast<__gm__ BiasT*>(biasGlobal[offset_bias].GetPhyAddr()), tiling.singleCoreN);
    // Create a Matmul instance.
    constexpr MatmulConfig MM_CFG = GetNormalConfig(false, false, false, BatchMode::BATCH_LESS_THAN_L1);
    matmul::Matmul<aType, bType, cType, biasType, MM_CFG> mm1;
    AscendC::TPipe pipe;
    g_cubeTPipePtr = &pipe;
    SetSysWorkspace(workspaceGM);
    REGIST_MATMUL_OBJ(&pipe, GetSysWorkSpacePtr(), mm1);
    mm1.Init(&tiling);
    mm1.SetTensorA(gm_a, isTransposeAIn);
    mm1.SetTensorB(gm_b, isTransposeBIn);
    if(tiling.isBias) {
        mm1.SetBias(gm_bias);
    }
    // Execute multi-batch Matmul computation.
    mm1.IterateBatch(gm_c, batchA, batchB, false);
}