IterateNBatch

Applicability

Product	Supported
Atlas A3 training products/Atlas A3 inference products	√
Atlas A2 training products/Atlas A2 inference products	√
Atlas 200I/500 A2 inference products	x
Atlas inference product's AI Core	x
Atlas inference product's Vector Core	x
Atlas training products	x

Function

When IterateNBatch is called once, IterateBatch computation is performed for N times to compute N multi-batch matrix C with a size of singleCoreM × singleCoreN. Before calling this API, set isNBatch in MatmulConfig to true to enable the multi-batch input and output function, and call SetWorkspace to allocate temporary space for caching computation results. That is, the result of IterateNBatch is output to the global memory specified by SetWorkspace.

For the BSNGD, SBNGD, and BNGS1S2 layouts, before calling this API, you need to use SetALayout, SetBLayout, SetCLayout, and SetBatchNum in tiling to set the layout axis information and maximum number of batches for matrices A, B, and C. For the NORMAL layout, use SetBatchInfoForNormal to set the M, N, and K axes of matrices A, B, and C, and the number of batches of matrices A and B. During Matmul instantiation, MatmulType is used to set the layout type. Currently, three layout types are supported: BSNGD, SBNGD, and BNGS1S2.

Prototype

template <bool sync = true, bool waitIterateBatch = false>
__aicore__ inline void IterateNBatch(const uint32_t batchLoop, uint32_t batchA, uint32_t batchB, bool enSequentialWrite, const uint32_t matrixStrideA = 0, const uint32_t matrixStrideB = 0, const uint32_t matrixStrideC = 0)

Parameters

**Table 1** Template parameters
Parameter	Description
sync	Matrix C can be obtained in synchronous or asynchronous mode. Synchronous: Wait until IterateNBatch is executed. Developers can obtain the computation result output to the global memory. Asynchronous: Do not need to wait until IterateNBatch is executed. This parameter specifies the two modes: true for the synchronous mode and false for the asynchronous mode. The synchronous mode is used by default.
waitIterateBatch	Used only in asynchronous scenarios, indicating whether to use WaitIterateBatch to wait for the completion of IterateNBatch execution. The default value is false. true: Use the WaitIterateBatch API to wait until IterateNBatch is executed. Then, developers can obtain the computation result output to the global memory. false: Do not need to use the WaitIterateBatch API to wait until IterateNBatch is executed. After calling this API, you need to call the GetBatchTensorC API to obtain matrix C or wait until IterateNBatch is executed.

Parameter	Input/Output	Description
batchLoop	Input	Number of currently computed BMMs.
batchA	Input	Number of batches of the left matrix in a single BMM call.
batchB	Input	Number of batches of the right matrix in a single BMM call. batchA and batchB are different in the brc scenario.
enSequentialWrite	Input	Whether the output data is stored continuously.
matrixStrideA	Input	Offset between the start addresses of adjacent nd matrices of the matrix A's source operand. The default value is 0.
matrixStrideB	Input	Offset between the start addresses of adjacent nd matrices of the matrix B's source operand. The default value is 0.
matrixStrideC	Input	This parameter is reserved and can be ignored.

Returns

None

Restrictions

The computation within a single BMM complies with the previous constraints.
For the BSNGD, SBNGD, and BNGS1S2 layout formats, the total batch data input to matrices A and B must be less than the size of L1 Buffer.
This API is not supported when enableMixDualMaster (dual-master mode) is set to true.

Example

In this example, the aGM and bGM matrices are multiplied and the result is saved to cGM. The layout format of the aGM, bGM, and cGM data is BSNGD, BSNGD, and BNGS1S2, respectively. The left matrix computes batchA SD data each time, and the right matrix computes batchB SD data each time.

#include "kernel_operator.h"
#include "lib/matmul_intf.h"

extern "C" __global__  __aicore__ void kernel_matmul_rpc_batch(GM_ADDR aGM, GM_ADDR bGM, GM_ADDR cGM, GM_ADDR biasGM, GM_ADDR tilingGM, GM_ADDR workspaceGM, uint32_t isTransposeAIn, uint32_t isTransposeBIn, int32_t batchA,  int32_t batchB)
{
    // Define MatmulType.
    typedef AscendC::MatmulType <AscendC::TPosition::GM, CubeFormat::ND, half, false, LayoutMode::BSNGD> aType;
    typedef AscendC::MatmulType <AscendC::TPosition::GM, CubeFormat::ND, half, true, LayoutMode::BSNGD> bType;
    typedef AscendC::MatmulType <AscendC::TPosition::GM, CubeFormat::ND, float, false, LayoutMode::BNGS1S2> cType;
    typedef AscendC::MatmulType <AscendC::TPosition::GM, CubeFormat::ND, float> biasType;
    SetAtomicNone();

   // Initialize tiling data.
    TCubeTiling tiling;
    auto tempTilingGM = (__gm__ uint32_t*)tilingGM;
    auto tempTiling = (uint32_t*)&tiling;
    for (int i = 0; i < sizeof(TCubeTiling) / sizeof(int32_t); ++i, ++tempTilingGM, ++tempTiling) {
        *tempTiling = *tempTilingGM;
    }

    // Initialize GM data.
    AscendC::GlobalTensor<half> aGlobal;
    AscendC::GlobalTensor<half> bGlobal;
    AscendC::GlobalTensor<float> cGlobal;
    AscendC::GlobalTensor<float> biasGlobal;
    int32_t sizeA = tiling.ALayoutInfoB * tiling.ALayoutInfoS * tiling.ALayoutInfoN * tiling.ALayoutInfoG * tiling.ALayoutInfoD * sizeof(half);
    int32_t sizeB = tiling.BLayoutInfoB * tiling.BLayoutInfoS * tiling.BLayoutInfoN * tiling.BLayoutInfoG * tiling.BLayoutInfoD * sizeof(half);
    int32_t sizebias = tiling.CLayoutInfoB * tiling.CLayoutInfoN * tiling.CLayoutInfoG * tiling.CLayoutInfoS2 * sizeof(float);
    aGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ half*>(aGM), sizeA);
    bGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ half*>(bGM), sizeB);
    biasGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ float*>(biasGM), sizebias);
    tiling.shareMode = 0;            
    tiling.shareL1Size = 512 * 1024;
    tiling.shareL0CSize = 128 * 1024;
    tiling.shareUbSize = 0;
    int offset_a=0, offset_b=0, offset_c=0, offset_bias=0;
    AscendC::GlobalTensor<A_T> gm_a;
    gm_a.SetGlobalBuffer(const_cast<__gm__ half*>(aGlobal[offset_a].GetPhyAddr()), tiling.ALayoutInfoS * tiling.ALayoutInfoN * tiling.ALayoutInfoG * tiling.ALayoutInfoD);
    AscendC::GlobalTensor<B_T> gm_b;
    gm_b.SetGlobalBuffer(const_cast<__gm__ half*>(bGlobal[offset_b].GetPhyAddr()), tiling.BLayoutInfoS * tiling.BLayoutInfoN * tiling.BLayoutInfoG * tiling.BLayoutInfoD);
    AscendC::GlobalTensor<BiasT> gm_bias;
    gm_bias.SetGlobalBuffer(const_cast<__gm__ float*>(biasGlobal[offset_bias].GetPhyAddr()), tiling.CLayoutInfoN * tiling.CLayoutInfoG * tiling.CLayoutInfoS2);
   // Create a Matmul instance.
    AscendC::Matmul<aType, bType, cType, biasType> mm1;
    AscendC::TPipe pipe;
    g_cubeTPipePtr = &pipe;
   
    REGIST_MATMUL_OBJ(&pipe, GetSysWorkSpacePtr(), mm1);
    mm1.Init(&tiling);
    int g_lay = tiling.ALayoutInfoG > tiling.BLayoutInfoG ? tiling.ALayoutInfoG : tiling.BLayoutInfoG;
    int for_extent = tiling.ALayoutInfoB * tiling.ALayoutInfoN * g_lay / tiling.BatchNum;
    mm1.SetTensorA(gm_a[0], isTransposeAIn);
    mm1.SetTensorB(gm_b[0], isTransposeBIn);
    mm1.SetWorkspace(workspaceGM, 0);
    if (tiling.isBias) {
        mm1.SetBias(gm_bias[0]);
    }
    // Execute multi-batch Matmul computation.
    mm1.IterateNBatch(for_extent, batchA, batchB, false);
}

Parent topic: Matmul Kernel APIs