IterateNBatch

Function Description

When IterateNBatch is called once, IterateBatch computation is performed for N times to compute N multi-batch matrix C with a size of singleCoreM × singleCoreN.

For the BSNGD, SBNGD, and BNGS1S2 layouts, before calling this API, you need to use SetALayout, SetBLayout, SetCLayout, and SetBatchNum in tiling to set the layout axis information and maximum number of batches for matrices A, B, and C. For the NORMAL layout, use SetBatchInfoForNormal to set the M, N, and K axes of matrices A, B, and C, and the number of batches of matrices A and B. During Matmul instantiation, MatmulType is used to set the layout type. Currently, four layout types are supported: BSNGD, SBNGD, BNGS1S2 and Normal.

Prototype

1
2
template <bool sync = true, bool waitIterateBatch = false>
__aicore__ inline void IterateNBatch(const uint32_t batchLoop, uint32_t batchA, uint32_t batchB, bool enSequentialWrite, const uint32_t matrixStrideA = 0, const uint32_t matrixStrideB = 0, const uint32_t matrixStrideC = 0)

Parameters

Table 1 Parameters in the template

Parameter

Description

sync

Matrix C can be obtained in synchronous or asynchronous mode.

  • Synchronous mode: Wait until IterateBatch is executed.
  • Asynchronous mode: Do not need to wait until IterateBatch is executed.

This parameter specifies the two modes: true for the synchronous mode and false for the asynchronous mode. The synchronous mode is used by default. In asynchronous scenarios, this API must be used together with GetBatchC.

waitIterateBatch

Used only in asynchronous scenarios, indicating whether to use WaitIterateBatch to wait for the completion of IterateBatch execution. The default value is false.

true: WaitIterateBatch is used to wait for the completion of IterateBatch execution.

false: WaitIterateBatch is not used to wait for the completion of IterateBatch execution. Developers can handle this waiting process themselves.

Parameter

Input/Output

Description

batchLoop

Input

Number of currently computed BMMs

batchA

Input

Number of batches of the left matrix in a single BMM call

batchB

Input

Number of batches of the right matrix in a single BMM call. batchA and batchB are different in the brc scenario.

enSequentialWrite

Input

Whether the output data is stored continuously

matrixStrideA

Input

Offset between the start addresses of adjacent nd matrices of the matrix A's source operand

matrixStrideB

Input

Offset between the start addresses of adjacent nd matrices of the matrix B's source operand

matrixStrideC

Input

Offset between the start addresses of adjacent nd matrices of the matrix C's source operand

Returns

None

Availability

Example

In this example, the aGM and bGM matrices are multiplied and the result is saved to cGm. The layout format of the aGM, bGM, and cGM data is BSNGD, BSNGD, and BSNGS1S2, respectively. The left matrix computes batchA SD data each time, the right matrix computes batchB SD data each time.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
#include "kernel_operator.h"
#include "lib/matmul_intf.h"

extern "C" __global__  __aicore__ void kernel_matmul_rpc_batch(GM_ADDR aGM, GM_ADDR bGM, GM_ADDR cGM, GM_ADDR biasGM, GM_ADDR tilingGM, GM_ADDR workspaceGM, uint32_t isTransposeAIn, uint32_t isTransposeBIn, int32_t batchA,  int32_t batchB)
{
    // Define MatmulType.
    typedef matmul::MatmulType <AscendC::TPosition::GM, CubeFormat::ND, half, false, LayoutMode::BSNGD> aType;
    typedef matmul::MatmulType <AscendC::TPosition::GM, CubeFormat::ND, half, true, LayoutMode::BSNGD> bType;
    typedef matmul::MatmulType <AscendC::TPosition::GM, CubeFormat::ND, float, false, LayoutMode::BSNGS1S2> cType;
    typedef matmul::MatmulType <AscendC::TPosition::GM, CubeFormat::ND, float> biasType;
    set_atomic_none();

   // Initialize tiling data.
    TCubeTiling tiling;
    auto tempTilingGM = (__gm__ uint32_t*)tilingGM;
    auto tempTiling = (uint32_t*)&tiling;
    for (int i = 0; i < sizeof(TCubeTiling) / sizeof(int32_t); ++i, ++tempTilingGM, ++tempTiling) {
        *tempTiling = *tempTilingGM;
    }

    // Initialize GM data.
    AscendC::GlobalTensor<half> aGlobal;
    AscendC::GlobalTensor<half> bGlobal;
    AscendC::GlobalTensor<float> cGlobal;
    AscendC::GlobalTensor<float> biasGlobal;
    int32_t sizeA = tiling.ALayoutInfoB * tiling.ALayoutInfoS * tiling.ALayoutInfoN * tiling.ALayoutInfoG * tiling.ALayoutInfoD * sizeof(half);
    int32_t sizeB = tiling.BLayoutInfoB * tiling.BLayoutInfoS * tiling.BLayoutInfoN * tiling.BLayoutInfoG * tiling.BLayoutInfoD * sizeof(half);
    int32_t sizebias = tiling.CLayoutInfoB * tiling.CLayoutInfoN * tiling.CLayoutInfoG * tiling.CLayoutInfoS2 * sizeof(float);
    aGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ half*>(aGM), sizeA);
    bGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ half*>(bGM), sizeB);
    biasGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ float*>(biasGM), sizebias);
    tiling.shareMode = 0;            
    tiling.shareL1Size = 512 * 1024;
    tiling.shareL0CSize = 128 * 1024;
    tiling.shareUbSize = 0;
    int offset_a=0, offset_b=0, offset_c=0, offset_bias=0;
    AscendC::GlobalTensor<A_T> gm_a;
    gm_a.SetGlobalBuffer(const_cast<__gm__ half*>(aGlobal[offset_a].GetPhyAddr()), tiling.ALayoutInfoS * tiling.ALayoutInfoN * tiling.ALayoutInfoG * tiling.ALayoutInfoD);
    AscendC::GlobalTensor<B_T> gm_b;
    gm_b.SetGlobalBuffer(const_cast<__gm__ half*>(bGlobal[offset_b].GetPhyAddr()), tiling.BLayoutInfoS * tiling.BLayoutInfoN * tiling.BLayoutInfoG * tiling.BLayoutInfoD);
    AscendC::GlobalTensor<BiasT> gm_bias;
    gm_bias.SetGlobalBuffer(const_cast<__gm__ float*>(biasGlobal[offset_bias].GetPhyAddr()), tiling.CLayoutInfoN * tiling.CLayoutInfoG * tiling.CLayoutInfoS2);
   // Create a Matmul instance.
    matmul::Matmul<aType, bType, cType, biasType> mm1;
    AscendC::TPipe pipe;
    g_cubeTPipePtr = &pipe;
   
    REGIST_MATMUL_OBJ(&pipe, GetSysWorkSpacePtr(), mm1);
    mm1.Init(&tiling);
    int for_exent = tiling.ALayoutInfoB * tiling.ALayoutInfoN * g_lay / tiling.BatchNum;
    mm1.SetTensorA(gm_a[0], isTransposeAIn);
    mm1.SetTensorB(gm_b[0], isTransposeBIn);
    mm1.SetWorkspace(workspaceGM, 0);
    if (tiling.isBias) {
        mm1.SetBias(gm_bias[0]);
    }
    // Execute multi-batch Matmul computation.
    mm1.IterateNBatch(for_extent, batchA, batchB, false);
}

Precautions

  • The computation within a single BMM complies with the previous constraints.
  • For the BSNGD, SBNGD, and BNGS1S2 layouts, the total size of multiple batches of matrix A and matrix B must be less than the size of L1 Buffer. There is no such restriction on the NORMAL layout format, but you need to configure the relationship between the size of multiple batches of matrix A and matrix B and the size of L1 Buffer by using MatmulConfig.
  • Total output of multiple BMMs < Allocated memory size on the vector core