IterateBatch
Applicability
|
Product |
Supported |
|---|---|
|
|
√ |
|
|
√ |
|
|
x |
|
|
√ |
|
|
x |
|
|
x |
Function
Computes multiple matrices C of size singleCoreM × singleCoreN by each call to IterateBatch. If the shape processed in a single Matmul computation is small, the performance may be affected because each computation involves internal communication. This API provides the function of processing Matmul computation in batches.
Before using this API, you need to understand the following data formats:
- NORMAL: BMNK data format. B indicates the batch processing size. M, N, and K indicate the dimensions of the matrix multiplication [M, K] × [K, N]. The following figure shows the layout format.

- BSH/SBH: B indicates batch processing size; S indicates sequence length; H = N × D, where N is the number of heads and D is the size of heads. The following figure shows the layout format.


- BSNGD: shape after reshaping the original BSH shape. S and D are the M axis (or N axis) and K axis of matrix multiplication of a single batch. An SD is the computation data of a batch. Its layout is shown as follows.

- SBNGD: shape after reshaping the original SBH shape. S and D are the M axis (or N axis) and K axis of matrix multiplication. An SD is the computation data of a batch. Its layout is shown as follows.

- BNGS1S2: matrix multiplication output of the first two layouts. The S1S2 data is stored continuously. An S1S2 is the computation data of a batch. Its layout is shown as follows.

When instantiating the Matmul, you need to set the input and output layouts through MatmulType. Currently, four layouts are supported: BSNGD, SBNGD, BNGS1S2, and NORMAL (BMNK).
For the BSNGD, SBNGD, and BNGS1S2 layouts, before calling this API, you need to use SetALayout, SetBLayout, SetCLayout, and SetBatchNum in the host tiling implementation to set the layout axis information and maximum number of batches for matrices A, B, and C. For the NORMAL layout, use SetBatchInfoForNormal to set the M, N, and K axes of matrices A, B, and C and the number of batch numbers of matrices A and B.
The iteration sequence of a single matrix multiplication can be adjusted using the tiling parameter iterateOrder.
For details about batch processing in matrix programming, see Batch Matmul basic functions.
Prototype
- Mix mode
- Output to GM
1 2
template <bool sync = true, bool waitIterateBatch = false> __aicore__ inline void IterateBatch(const GlobalTensor<DstT>& gm, uint32_t batchA, uint32_t batchB, bool enSequentialWrite, const uint32_t matrixStrideA = 0, const uint32_t matrixStrideB = 0, const uint32_t matrixStrideC = 0, const bool enPartialSum = false, const uint8_t enAtomic = 0)
- Output to VECIN
1 2
template <bool sync = true> __aicore__ inline void IterateBatch(const LocalTensor<DstT>& ubCmatrix, uint32_t batchA, uint32_t batchB, bool enSequentialWrite, const uint32_t matrixStrideA = 0, const uint32_t matrixStrideB = 0, const uint32_t matrixStrideC = 0, const bool enPartialSum = false, const uint8_t enAtomic = 0)
- Output to GM
- CUBE-ONLY mode
Before using this function, call SetBatchNum to set the sizes of batch A and batch B.
- Output to GM
1__aicore__ inline void IterateBatch(const GlobalTensor<DstT>& gm, bool enPartialSum, uint8_t enAtomic, bool enSequentialWrite, const uint32_t matrixStrideA = 0, const uint32_t matrixStrideB = 0, const uint32_t matrixStrideC = 0)
- Output to VECIN
1__aicore__ inline void IterateBatch(const LocalTensor<DstT>& ubCmatrix, bool enPartialSum, uint8_t enAtomic, bool enSequentialWrite, const uint32_t matrixStrideA = 0, const uint32_t matrixStrideB = 0, const uint32_t matrixStrideC = 0)
- Output to GM
Parameters
|
Parameter |
Description |
|---|---|
|
sync |
Matrix C can be obtained in synchronous or asynchronous mode.
This parameter specifies the two modes: true for the synchronous mode and false for the asynchronous mode. The synchronous mode is used by default. In asynchronous scenarios, this API must be used together with WaitIterateBatch. |
|
waitIterateBatch |
Used only in asynchronous scenarios, indicating whether to use WaitIterateBatch to wait for the completion of IterateBatch execution. The default value is false. true: WaitIterateBatch is used to wait for the completion of IterateBatch execution. false: WaitIterateBatch is not used to wait for the completion of IterateBatch execution. Developers can handle this waiting process themselves. |
|
Parameter |
Input/Output |
Description |
|---|---|---|
|
gm |
Output |
Matrix C. The type is GlobalTensor. For For For |
|
ubCmatrix |
Output |
Matrix C. The type is LocalTensor. For For For |
|
batchA |
Input |
Number of batches of the left matrix. |
|
batchB |
Input |
Number of batches of the right matrix. If batchA and batchB are different, the broadcast operation is performed by default. Multi-batch computation supports input broadcast and output reduce on the G axis. The G axis dimensions of the left and right matrices must be integer multiples. |
|
enSequentialWrite |
Input |
Whether the output data is continuously stored, that is, whether the continuous write mode is enabled (continuous write: data is written to [baseM, baseN]; non-continuous write: data is written to the corresponding position in [singleCoreM, singleCoreN]).
|
|
matrixStrideA |
Input |
Offset between the start addresses of adjacent nd matrices of the matrix A's source operand, in elements. The default value is 0. |
|
matrixStrideB |
Input |
Offset between the start addresses of adjacent nd matrices of the matrix A's source operand, in elements. The default value is 0. |
|
matrixStrideC |
Input |
This parameter is reserved and can be ignored. |
|
enPartialSum |
Input |
Whether to accumulate the matrix multiplication result to the existing CO1 data. The default value is false. During L0C accumulation, the specification of matrix C output by multiplication of matrix A and matrix B can only be singleM==baseM &&singleN==baseN. |
|
enAtomic |
Input |
Whether to enable the Atomic operation. Values: 0 (default): disables the Atomic operation. 1: enables the AtomicAdd (accumulation) operation. 2: enables the AtomicMax (maximum value calculation) operation. 3: enables the AtomicMin (minimum value calculation) operation. |
Returns
None
Restrictions
- This API supports only the Norm template. That is, BatchMatmul operators support only the Norm template.
- For the BSNGD, SBNGD, and BNGS1S2 layouts, the total size of multiple batches of matrix A and matrix B, after being aligned according to the fractal dimension, must be less than the size of L1 Buffer. There is no such restriction on the NORMAL layout mode, but you need to configure the relationship between the size of multiple batches of matrix A and matrix B and the size of L1 Buffer by using MatmulConfig.
- For the BSNGD, SBNGD, and BNGS1S2 layouts, if the G axis of the left matrix and right matrix is ALayoutInfoG and BLayoutInfoG, respectively, the following equation applies: ALayoutInfoG/batchA = BLayoutInfoG/batchB. For the NORMAL layout, batchA and batchB must meet the multiple relationship.
- If data is output to Unified Buffer, the size of the output matrix C (BaseM × BaseN) must be less than the size of the allocated Unified Buffer.
- When the API data is output to Unified Buffer and the size of the N direction for single-core computation (singleCoreN) is not 32-byte aligned, CubeFormat of matrix C only supports the ND_ALIGN format. When matrix C slices are output, the data along the singleCoreN direction is automatically padded to 32 bytes.
- For the BSNGD and SBNGD layouts, the input and output data must be in ND format. For the BNGS1S2 and NORMAL layouts, the input data can be in ND or NZ format.
- For the BSNGD and SBNGD layouts, continuous write is not supported.
- This API does not support the quantization mode. That is, SetQuantScalar and SetQuantVector APIs are not supported.
- In the BSNGD scenario, multiple rows of SDs cannot be computed at a time. Cyclic computation is required in the operator program. That is, (ALayoutInfoN × ALayoutInfoG)/batchA and (BLayoutInfoN × BLayoutInfoG)/batchB must be integers.
- IterateBatch cannot be moved to UB in asynchronous mode.
- This API is not supported when enableMixDualMaster (dual-master mode) is set to true.
- For
Atlas inference product 's AI Core, only the NORMAL Layout format is supported. - For
Atlas inference product 's AI Core, the input with logical memory position of matrices A and B as TPosition::TSCM is not supported. - For
Atlas inference product 's AI Core, bias cannot be reused, and the shape size of bias must be Batch × N. - When this API is used, matrices A and B do not support int4b_t inputs. That is, BatchMatmul does not support int4b_t matrix inputs.
Example
- In this example, the aGM and bGM matrices are multiplied and the result is saved to cGm. The layout format of the aGM, bGM, and cGM data is NORMAL. The left matrix computes batchA MK data each time, the right matrix computes batchB KN data each time.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
// Define MatmulType. typedef AscendC::MatmulType <AscendC::TPosition::GM, CubeFormat::ND, half, false, LayoutMode::NORMAL> aType; typedef AscendC::MatmulType <AscendC::TPosition::GM, CubeFormat::ND, half, true, LayoutMode::NORMAL> bType; typedef AscendC::MatmulType <AscendC::TPosition::GM, CubeFormat::ND, float, false, LayoutMode::NORMAL> cType; typedef AscendC::MatmulType <AscendC::TPosition::GM, CubeFormat::ND, float> biasType; // Create a Matmul instance. constexpr static MatmulConfig MM_CFG = GetNormalConfig(false, false, false, BatchMode::BATCH_LESS_THAN_L1); AscendC::Matmul<aType, bType, cType, biasType, MM_CFG> mm1; REGIST_MATMUL_OBJ(&pipe, GetSysWorkSpacePtr(), mm1); mm1.Init(&tiling); mm1.SetTensorA(gm_a, isTransposeAIn); mm1.SetTensorB(gm_b, isTransposeBIn); if(tiling.isBias) { mm1.SetBias(gm_bias); } // Execute multi-batch Matmul computation. mm1.IterateBatch(gm_c, batchA, batchB, false);
- In this example, the aGM and bGM matrices are multiplied and the result is saved to cGm. The layout format of the aGM, bGM, and cGM data is BSNGD, BSNGD, and BNGS1S2, respectively. The left matrix computes batchA SD data each time, the right matrix computes batchB SD data each time. For details about the complete BatchMatmul example where the aGM, bGM, and cGM data is in BSNDG format, see BatchMatmul sample.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
// Define MatmulType. typedef AscendC::MatmulType <AscendC::TPosition::GM, CubeFormat::ND, half, false, LayoutMode::BSNGD> aType; typedef AscendC::MatmulType <AscendC::TPosition::GM, CubeFormat::ND, half, true, LayoutMode::BSNGD> bType; typedef AscendC::MatmulType <AscendC::TPosition::GM, CubeFormat::ND, float, false, LayoutMode::BNGS1S2> cType; typedef AscendC::MatmulType <AscendC::TPosition::GM, CubeFormat::ND, float> biasType; // Create a Matmul instance. AscendC::Matmul<aType, bType, cType, biasType> mm1; REGIST_MATMUL_OBJ(&pipe, GetSysWorkSpacePtr(), mm1); mm1.Init(&tiling); int batchC = batchA > batchB ? batchA : batchB; int g_lay = tiling.ALayoutInfoG > tiling.BLayoutInfoG ? tiling.ALayoutInfoG : tiling.BLayoutInfoG; // Calculate the number of loops required for multi-batch computation. int for_exent = tiling.ALayoutInfoB x tiling.ALayoutInfoN x g_lay / tiling.BatchNum; for(int i=0; i<for_exent; ++i) { // Calculate the start address for computing matrix A/B in multiple batches each time. int batchOffsetA = i x tiling.ALayoutInfoD x batchA; int batchOffsetB = i x tiling.BLayoutInfoD x batchB; mm1.SetTensorA(gm_a[batchOffsetA], isTransposeAIn); mm1.SetTensorB(gm_b[batchOffsetB], isTransposeBIn); int idx_c = i x batchC; if (tiling.CLayoutInfoG == 1 && (tiling.BLayoutInfoG != 1 || tiling.ALayoutInfoG != 1)) { idx_c = idx_c / (tiling.BLayoutInfoG > tiling.ALayoutInfoG ? tiling.BLayoutInfoG : tiling.ALayoutInfoG); } if(tiling.isBias) { int batchOffsetBias = idx_c x tiling.CLayoutInfoS2; mm1.SetBias(gm_bias[batchOffsetBias]); } int batchOffsetC = idx_c x tiling.CLayoutInfoS2; if (C_TYPE::layout == LayoutMode::BNGS1S2) { batchOffsetC = idx_c x tiling.CLayoutInfoS2 x tiling.CLayoutInfoS1; } // Execute multi-batch Matmul computation. mm1.IterateBatch(gm_c[batchOffsetC], batchA, batchB, false); }