IterateNBatch
Function Description
When IterateNBatch is called once, IterateBatch computation is performed for N times to compute N multi-batch matrix C with a size of singleCoreM × singleCoreN.
For the BSNGD, SBNGD, and BNGS1S2 layouts, before calling this API, you need to use SetALayout, SetBLayout, SetCLayout, and SetBatchNum in tiling to set the layout axis information and maximum number of batches for matrices A, B, and C. For the NORMAL layout, use SetBatchInfoForNormal to set the M, N, and K axes of matrices A, B, and C, and the number of batches of matrices A and B. During Matmul instantiation, MatmulType is used to set the layout type. Currently, four layout types are supported: BSNGD, SBNGD, BNGS1S2 and Normal.
Prototype
1 2 | template <bool sync = true, bool waitIterateBatch = false> __aicore__ inline void IterateNBatch(const uint32_t batchLoop, uint32_t batchA, uint32_t batchB, bool enSequentialWrite, const uint32_t matrixStrideA = 0, const uint32_t matrixStrideB = 0, const uint32_t matrixStrideC = 0) |
Parameters
Parameter |
Description |
|---|---|
sync |
Matrix C can be obtained in synchronous or asynchronous mode.
This parameter specifies the two modes: true for the synchronous mode and false for the asynchronous mode. The synchronous mode is used by default. In asynchronous scenarios, this API must be used together with GetBatchC. |
waitIterateBatch |
Used only in asynchronous scenarios, indicating whether to use WaitIterateBatch to wait for the completion of IterateBatch execution. The default value is false. true: WaitIterateBatch is used to wait for the completion of IterateBatch execution. false: WaitIterateBatch is not used to wait for the completion of IterateBatch execution. Developers can handle this waiting process themselves. |
Parameter |
Input/Output |
Description |
|---|---|---|
batchLoop |
Input |
Number of currently computed BMMs |
batchA |
Input |
Number of batches of the left matrix in a single BMM call |
batchB |
Input |
Number of batches of the right matrix in a single BMM call. batchA and batchB are different in the brc scenario. |
enSequentialWrite |
Input |
Whether the output data is stored continuously |
matrixStrideA |
Input |
Offset between the start addresses of adjacent nd matrices of the matrix A's source operand |
matrixStrideB |
Input |
Offset between the start addresses of adjacent nd matrices of the matrix B's source operand |
matrixStrideC |
Input |
Offset between the start addresses of adjacent nd matrices of the matrix C's source operand |
Returns
None
Availability
Example
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 | #include "kernel_operator.h" #include "lib/matmul_intf.h" extern "C" __global__ __aicore__ void kernel_matmul_rpc_batch(GM_ADDR aGM, GM_ADDR bGM, GM_ADDR cGM, GM_ADDR biasGM, GM_ADDR tilingGM, GM_ADDR workspaceGM, uint32_t isTransposeAIn, uint32_t isTransposeBIn, int32_t batchA, int32_t batchB) { // Define MatmulType. typedef matmul::MatmulType <AscendC::TPosition::GM, CubeFormat::ND, half, false, LayoutMode::BSNGD> aType; typedef matmul::MatmulType <AscendC::TPosition::GM, CubeFormat::ND, half, true, LayoutMode::BSNGD> bType; typedef matmul::MatmulType <AscendC::TPosition::GM, CubeFormat::ND, float, false, LayoutMode::BSNGS1S2> cType; typedef matmul::MatmulType <AscendC::TPosition::GM, CubeFormat::ND, float> biasType; set_atomic_none(); // Initialize tiling data. TCubeTiling tiling; auto tempTilingGM = (__gm__ uint32_t*)tilingGM; auto tempTiling = (uint32_t*)&tiling; for (int i = 0; i < sizeof(TCubeTiling) / sizeof(int32_t); ++i, ++tempTilingGM, ++tempTiling) { *tempTiling = *tempTilingGM; } // Initialize GM data. AscendC::GlobalTensor<half> aGlobal; AscendC::GlobalTensor<half> bGlobal; AscendC::GlobalTensor<float> cGlobal; AscendC::GlobalTensor<float> biasGlobal; int32_t sizeA = tiling.ALayoutInfoB * tiling.ALayoutInfoS * tiling.ALayoutInfoN * tiling.ALayoutInfoG * tiling.ALayoutInfoD * sizeof(half); int32_t sizeB = tiling.BLayoutInfoB * tiling.BLayoutInfoS * tiling.BLayoutInfoN * tiling.BLayoutInfoG * tiling.BLayoutInfoD * sizeof(half); int32_t sizebias = tiling.CLayoutInfoB * tiling.CLayoutInfoN * tiling.CLayoutInfoG * tiling.CLayoutInfoS2 * sizeof(float); aGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ half*>(aGM), sizeA); bGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ half*>(bGM), sizeB); biasGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ float*>(biasGM), sizebias); tiling.shareMode = 0; tiling.shareL1Size = 512 * 1024; tiling.shareL0CSize = 128 * 1024; tiling.shareUbSize = 0; int offset_a=0, offset_b=0, offset_c=0, offset_bias=0; AscendC::GlobalTensor<A_T> gm_a; gm_a.SetGlobalBuffer(const_cast<__gm__ half*>(aGlobal[offset_a].GetPhyAddr()), tiling.ALayoutInfoS * tiling.ALayoutInfoN * tiling.ALayoutInfoG * tiling.ALayoutInfoD); AscendC::GlobalTensor<B_T> gm_b; gm_b.SetGlobalBuffer(const_cast<__gm__ half*>(bGlobal[offset_b].GetPhyAddr()), tiling.BLayoutInfoS * tiling.BLayoutInfoN * tiling.BLayoutInfoG * tiling.BLayoutInfoD); AscendC::GlobalTensor<BiasT> gm_bias; gm_bias.SetGlobalBuffer(const_cast<__gm__ float*>(biasGlobal[offset_bias].GetPhyAddr()), tiling.CLayoutInfoN * tiling.CLayoutInfoG * tiling.CLayoutInfoS2); // Create a Matmul instance. matmul::Matmul<aType, bType, cType, biasType> mm1; AscendC::TPipe pipe; g_cubeTPipePtr = &pipe; REGIST_MATMUL_OBJ(&pipe, GetSysWorkSpacePtr(), mm1); mm1.Init(&tiling); int for_exent = tiling.ALayoutInfoB * tiling.ALayoutInfoN * g_lay / tiling.BatchNum; mm1.SetTensorA(gm_a[0], isTransposeAIn); mm1.SetTensorB(gm_b[0], isTransposeBIn); mm1.SetWorkspace(workspaceGM, 0); if (tiling.isBias) { mm1.SetBias(gm_bias[0]); } // Execute multi-batch Matmul computation. mm1.IterateNBatch(for_extent, batchA, batchB, false); } |
Precautions
- The computation within a single BMM complies with the previous constraints.
- For the BSNGD, SBNGD, and BNGS1S2 layouts, the total size of multiple batches of matrix A and matrix B must be less than the size of L1 Buffer. There is no such restriction on the NORMAL layout format, but you need to configure the relationship between the size of multiple batches of matrix A and matrix B and the size of L1 Buffer by using MatmulConfig.
- Total output of multiple BMMs < Allocated memory size on the vector core