IterateNBatch

Function Description

When IterateNBatch is called once, IterateBatch computation is performed for N times to compute N multi-batch matrix C with a size of singleCoreM × singleCoreN.

For the BSNGD, SBNGD, and BNGS1S2 layouts, before calling this API, you need to use SetALayout, SetBLayout, SetCLayout, and SetBatchNum in tiling to set the layout axis information and maximum number of batches for matrices A, B, and C. For the NORMAL layout, use SetBatchInfoForNormal to set the M, N, and K axes of matrices A, B, and C, and the number of batches of matrices A and B. During Matmul instantiation, MatmulType is used to set the layout type. Currently, four layout types are supported: BSNGD, SBNGD, BNGS1S2 and Normal.

Prototype

template <bool sync = true, bool waitIterateBatch = false>
__aicore__ inline void IterateNBatch(const uint32_t batchLoop, uint32_t batchA, uint32_t batchB, bool enSequentialWrite, const uint32_t matrixStrideA = 0, const uint32_t matrixStrideB = 0, const uint32_t matrixStrideC = 0)

Parameters

**Table 1** Parameters in the template
Parameter	Description
sync	Matrix C can be obtained in synchronous or asynchronous mode. Synchronous mode: Wait until IterateBatch is executed. Asynchronous mode: Do not need to wait until IterateBatch is executed. This parameter specifies the two modes: true for the synchronous mode and false for the asynchronous mode. The synchronous mode is used by default. In asynchronous scenarios, this API must be used together with GetBatchC.
waitIterateBatch	Used only in asynchronous scenarios, indicating whether to use WaitIterateBatch to wait for the completion of IterateBatch execution. The default value is false. true: WaitIterateBatch is used to wait for the completion of IterateBatch execution. false: WaitIterateBatch is not used to wait for the completion of IterateBatch execution. Developers can handle this waiting process themselves.

Parameter	Input/Output	Description
batchLoop	Input	Number of currently computed BMMs
batchA	Input	Number of batches of the left matrix in a single BMM call
batchB	Input	Number of batches of the right matrix in a single BMM call. batchA and batchB are different in the brc scenario.
enSequentialWrite	Input	Whether the output data is stored continuously
matrixStrideA	Input	Offset between the start addresses of adjacent nd matrices of the matrix A's source operand
matrixStrideB	Input	Offset between the start addresses of adjacent nd matrices of the matrix B's source operand
matrixStrideC	Input	Offset between the start addresses of adjacent nd matrices of the matrix C's source operand

Returns

None

Availability

Example

In this example, the aGM and bGM matrices are multiplied and the result is saved to cGm. The layout format of the aGM, bGM, and cGM data is BSNGD, BSNGD, and BSNGS1S2, respectively. The left matrix computes batchA SD data each time, the right matrix computes batchB SD data each time.

#include "kernel_operator.h"
#include "lib/matmul_intf.h"

extern "C" __global__  __aicore__ void kernel_matmul_rpc_batch(GM_ADDR aGM, GM_ADDR bGM, GM_ADDR cGM, GM_ADDR biasGM, GM_ADDR tilingGM, GM_ADDR workspaceGM, uint32_t isTransposeAIn, uint32_t isTransposeBIn, int32_t batchA,  int32_t batchB)
{
    // Define MatmulType.
    typedef matmul::MatmulType <AscendC::TPosition::GM, CubeFormat::ND, half, false, LayoutMode::BSNGD> aType;
    typedef matmul::MatmulType <AscendC::TPosition::GM, CubeFormat::ND, half, true, LayoutMode::BSNGD> bType;
    typedef matmul::MatmulType <AscendC::TPosition::GM, CubeFormat::ND, float, false, LayoutMode::BSNGS1S2> cType;
    typedef matmul::MatmulType <AscendC::TPosition::GM, CubeFormat::ND, float> biasType;
    set_atomic_none();

   // Initialize tiling data.
    TCubeTiling tiling;
    auto tempTilingGM = (__gm__ uint32_t*)tilingGM;
    auto tempTiling = (uint32_t*)&tiling;
    for (int i = 0; i < sizeof(TCubeTiling) / sizeof(int32_t); ++i, ++tempTilingGM, ++tempTiling) {
        *tempTiling = *tempTilingGM;
    }

    // Initialize GM data.
    AscendC::GlobalTensor<half> aGlobal;
    AscendC::GlobalTensor<half> bGlobal;
    AscendC::GlobalTensor<float> cGlobal;
    AscendC::GlobalTensor<float> biasGlobal;
    int32_t sizeA = tiling.ALayoutInfoB * tiling.ALayoutInfoS * tiling.ALayoutInfoN * tiling.ALayoutInfoG * tiling.ALayoutInfoD * sizeof(half);
    int32_t sizeB = tiling.BLayoutInfoB * tiling.BLayoutInfoS * tiling.BLayoutInfoN * tiling.BLayoutInfoG * tiling.BLayoutInfoD * sizeof(half);
    int32_t sizebias = tiling.CLayoutInfoB * tiling.CLayoutInfoN * tiling.CLayoutInfoG * tiling.CLayoutInfoS2 * sizeof(float);
    aGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ half*>(aGM), sizeA);
    bGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ half*>(bGM), sizeB);
    biasGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ float*>(biasGM), sizebias);
    tiling.shareMode = 0;            
    tiling.shareL1Size = 512 * 1024;
    tiling.shareL0CSize = 128 * 1024;
    tiling.shareUbSize = 0;
    int offset_a=0, offset_b=0, offset_c=0, offset_bias=0;
    AscendC::GlobalTensor<A_T> gm_a;
    gm_a.SetGlobalBuffer(const_cast<__gm__ half*>(aGlobal[offset_a].GetPhyAddr()), tiling.ALayoutInfoS * tiling.ALayoutInfoN * tiling.ALayoutInfoG * tiling.ALayoutInfoD);
    AscendC::GlobalTensor<B_T> gm_b;
    gm_b.SetGlobalBuffer(const_cast<__gm__ half*>(bGlobal[offset_b].GetPhyAddr()), tiling.BLayoutInfoS * tiling.BLayoutInfoN * tiling.BLayoutInfoG * tiling.BLayoutInfoD);
    AscendC::GlobalTensor<BiasT> gm_bias;
    gm_bias.SetGlobalBuffer(const_cast<__gm__ float*>(biasGlobal[offset_bias].GetPhyAddr()), tiling.CLayoutInfoN * tiling.CLayoutInfoG * tiling.CLayoutInfoS2);
   // Create a Matmul instance.
    matmul::Matmul<aType, bType, cType, biasType> mm1;
    AscendC::TPipe pipe;
    g_cubeTPipePtr = &pipe;
   
    REGIST_MATMUL_OBJ(&pipe, GetSysWorkSpacePtr(), mm1);
    mm1.Init(&tiling);
    int for_exent = tiling.ALayoutInfoB * tiling.ALayoutInfoN * g_lay / tiling.BatchNum;
    mm1.SetTensorA(gm_a[0], isTransposeAIn);
    mm1.SetTensorB(gm_b[0], isTransposeBIn);
    mm1.SetWorkspace(workspaceGM, 0);
    if (tiling.isBias) {
        mm1.SetBias(gm_bias[0]);
    }
    // Execute multi-batch Matmul computation.
    mm1.IterateNBatch(for_extent, batchA, batchB, false);
}

Precautions

The computation within a single BMM complies with the previous constraints.
For the BSNGD, SBNGD, and BNGS1S2 layouts, the total size of multiple batches of matrix A and matrix B must be less than the size of L1 Buffer. There is no such restriction on the NORMAL layout format, but you need to configure the relationship between the size of multiple batches of matrix A and matrix B and the size of L1 Buffer by using MatmulConfig.
Total output of multiple BMMs < Allocated memory size on the vector core

Parent topic: Matmul