IterateNBatch
Applicability
Product |
Supported |
|---|---|
√ |
|
√ |
|
x |
|
x |
|
x |
|
x |
Function
When IterateNBatch is called once, IterateBatch computation is performed for N times to compute N multi-batch matrix C with a size of singleCoreM × singleCoreN. Before calling this API, set isNBatch in MatmulConfig to true to enable the multi-batch input and output function, and call SetWorkspace to allocate temporary space for caching computation results. That is, the result of IterateNBatch is output to the global memory specified by SetWorkspace.
For the BSNGD, SBNGD, and BNGS1S2 layouts, before calling this API, you need to use SetALayout, SetBLayout, SetCLayout, and SetBatchNum in tiling to set the layout axis information and maximum number of batches for matrices A, B, and C. For the NORMAL layout, use SetBatchInfoForNormal to set the M, N, and K axes of matrices A, B, and C, and the number of batches of matrices A and B. During Matmul instantiation, MatmulType is used to set the layout type. Currently, three layout types are supported: BSNGD, SBNGD, and BNGS1S2.
Prototype
1 2 | template <bool sync = true, bool waitIterateBatch = false> __aicore__ inline void IterateNBatch(const uint32_t batchLoop, uint32_t batchA, uint32_t batchB, bool enSequentialWrite, const uint32_t matrixStrideA = 0, const uint32_t matrixStrideB = 0, const uint32_t matrixStrideC = 0) |
Parameters
Parameter |
Description |
|---|---|
sync |
Matrix C can be obtained in synchronous or asynchronous mode.
This parameter specifies the two modes: true for the synchronous mode and false for the asynchronous mode. The synchronous mode is used by default. |
waitIterateBatch |
Used only in asynchronous scenarios, indicating whether to use WaitIterateBatch to wait for the completion of IterateNBatch execution. The default value is false. true: Use the WaitIterateBatch API to wait until IterateNBatch is executed. Then, developers can obtain the computation result output to the global memory. false: Do not need to use the WaitIterateBatch API to wait until IterateNBatch is executed. After calling this API, you need to call the GetBatchTensorC API to obtain matrix C or wait until IterateNBatch is executed. |
Parameter |
Input/Output |
Description |
|---|---|---|
batchLoop |
Input |
Number of currently computed BMMs. |
batchA |
Input |
Number of batches of the left matrix in a single BMM call. |
batchB |
Input |
Number of batches of the right matrix in a single BMM call. batchA and batchB are different in the brc scenario. |
enSequentialWrite |
Input |
Whether the output data is stored continuously. |
matrixStrideA |
Input |
Offset between the start addresses of adjacent nd matrices of the matrix A's source operand. The default value is 0. |
matrixStrideB |
Input |
Offset between the start addresses of adjacent nd matrices of the matrix A's source operand. The default value is 0. |
matrixStrideC |
Input |
This parameter is reserved and can be ignored. |
Returns
None
Restrictions
- The computation within a single BMM complies with the previous constraints.
- For the BSNGD, SBNGD, and BNGS1S2 layout formats, the total batch data input to matrices A and B must be less than the size of L1 Buffer.
- This API is not supported when enableMixDualMaster (dual-master mode) is set to true.
Example
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 | #include "kernel_operator.h" #include "lib/matmul_intf.h" extern "C" __global__ __aicore__ void kernel_matmul_rpc_batch(GM_ADDR aGM, GM_ADDR bGM, GM_ADDR cGM, GM_ADDR biasGM, GM_ADDR tilingGM, GM_ADDR workspaceGM, uint32_t isTransposeAIn, uint32_t isTransposeBIn, int32_t batchA, int32_t batchB) { // Define MatmulType. typedef AscendC::MatmulType <AscendC::TPosition::GM, CubeFormat::ND, half, false, LayoutMode::BSNGD> aType; typedef AscendC::MatmulType <AscendC::TPosition::GM, CubeFormat::ND, half, true, LayoutMode::BSNGD> bType; typedef AscendC::MatmulType <AscendC::TPosition::GM, CubeFormat::ND, float, false, LayoutMode::BNGS1S2> cType; typedef AscendC::MatmulType <AscendC::TPosition::GM, CubeFormat::ND, float> biasType; SetAtomicNone(); // Initialize tiling data. TCubeTiling tiling; auto tempTilingGM = (__gm__ uint32_t*)tilingGM; auto tempTiling = (uint32_t*)&tiling; for (int i = 0; i < sizeof(TCubeTiling) / sizeof(int32_t); ++i, ++tempTilingGM, ++tempTiling) { *tempTiling = *tempTilingGM; } // Initialize GM data. AscendC::GlobalTensor<half> aGlobal; AscendC::GlobalTensor<half> bGlobal; AscendC::GlobalTensor<float> cGlobal; AscendC::GlobalTensor<float> biasGlobal; int32_t sizeA = tiling.ALayoutInfoB * tiling.ALayoutInfoS * tiling.ALayoutInfoN * tiling.ALayoutInfoG * tiling.ALayoutInfoD * sizeof(half); int32_t sizeB = tiling.BLayoutInfoB * tiling.BLayoutInfoS * tiling.BLayoutInfoN * tiling.BLayoutInfoG * tiling.BLayoutInfoD * sizeof(half); int32_t sizebias = tiling.CLayoutInfoB * tiling.CLayoutInfoN * tiling.CLayoutInfoG * tiling.CLayoutInfoS2 * sizeof(float); aGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ half*>(aGM), sizeA); bGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ half*>(bGM), sizeB); biasGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ float*>(biasGM), sizebias); tiling.shareMode = 0; tiling.shareL1Size = 512 * 1024; tiling.shareL0CSize = 128 * 1024; tiling.shareUbSize = 0; int offset_a=0, offset_b=0, offset_c=0, offset_bias=0; AscendC::GlobalTensor<A_T> gm_a; gm_a.SetGlobalBuffer(const_cast<__gm__ half*>(aGlobal[offset_a].GetPhyAddr()), tiling.ALayoutInfoS * tiling.ALayoutInfoN * tiling.ALayoutInfoG * tiling.ALayoutInfoD); AscendC::GlobalTensor<B_T> gm_b; gm_b.SetGlobalBuffer(const_cast<__gm__ half*>(bGlobal[offset_b].GetPhyAddr()), tiling.BLayoutInfoS * tiling.BLayoutInfoN * tiling.BLayoutInfoG * tiling.BLayoutInfoD); AscendC::GlobalTensor<BiasT> gm_bias; gm_bias.SetGlobalBuffer(const_cast<__gm__ float*>(biasGlobal[offset_bias].GetPhyAddr()), tiling.CLayoutInfoN * tiling.CLayoutInfoG * tiling.CLayoutInfoS2); // Create a Matmul instance. AscendC::Matmul<aType, bType, cType, biasType> mm1; AscendC::TPipe pipe; g_cubeTPipePtr = &pipe; REGIST_MATMUL_OBJ(&pipe, GetSysWorkSpacePtr(), mm1); mm1.Init(&tiling); int g_lay = tiling.ALayoutInfoG > tiling.BLayoutInfoG ? tiling.ALayoutInfoG : tiling.BLayoutInfoG; int for_extent = tiling.ALayoutInfoB * tiling.ALayoutInfoN * g_lay / tiling.BatchNum; mm1.SetTensorA(gm_a[0], isTransposeAIn); mm1.SetTensorB(gm_b[0], isTransposeBIn); mm1.SetWorkspace(workspaceGM, 0); if (tiling.isBias) { mm1.SetBias(gm_bias[0]); } // Execute multi-batch Matmul computation. mm1.IterateNBatch(for_extent, batchA, batchB, false); } |