IterateBatch

Applicability

Product	Supported
Atlas A3 training products / Atlas A3 inference products	√
Atlas A2 training products / Atlas A2 inference products	√
Atlas 200I/500 A2 inference products	x
Atlas inference product 's AI Core	√
Atlas inference product 's Vector Core	x
Atlas training products	x

Function

Computes multiple matrices C of size singleCoreM × singleCoreN by each call to IterateBatch. If the shape processed in a single Matmul computation is small, the performance may be affected because each computation involves internal communication. This API provides the function of processing Matmul computation in batches.

Before using this API, you need to understand the following data formats:

NORMAL: BMNK data format. B indicates the batch processing size. M, N, and K indicate the dimensions of the matrix multiplication [M, K] × [K, N]. The following figure shows the layout format.
BSH/SBH: B indicates batch processing size; S indicates sequence length; H = N × D, where N is the number of heads and D is the size of heads. The following figure shows the layout format.
BSNGD: shape after reshaping the original BSH shape. S and D are the M axis (or N axis) and K axis of matrix multiplication of a single batch. An SD is the computation data of a batch. Its layout is shown as follows.
SBNGD: shape after reshaping the original SBH shape. S and D are the M axis (or N axis) and K axis of matrix multiplication. An SD is the computation data of a batch. Its layout is shown as follows.
BNGS1S2: matrix multiplication output of the first two layouts. The S1S2 data is stored continuously. An S1S2 is the computation data of a batch. Its layout is shown as follows.

When instantiating the Matmul, you need to set the input and output layouts through MatmulType. Currently, four layouts are supported: BSNGD, SBNGD, BNGS1S2, and NORMAL (BMNK).

For the BSNGD, SBNGD, and BNGS1S2 layouts, before calling this API, you need to use SetALayout, SetBLayout, SetCLayout, and SetBatchNum in the host tiling implementation to set the layout axis information and maximum number of batches for matrices A, B, and C. For the NORMAL layout, use SetBatchInfoForNormal to set the M, N, and K axes of matrices A, B, and C and the number of batches of matrices A and B.

The iteration sequence of a single matrix multiplication can be adjusted using the tiling parameter iterateOrder.

For details about batch processing in matrix programming, see Basic Functions of Batch Matmul.

Prototype

Mix mode

Output to GM

          
               template <bool sync = true, bool waitIterateBatch = false>
__aicore__ inline void IterateBatch(const GlobalTensor<DstT>& gm, uint32_t batchA, uint32_t batchB, bool enSequentialWrite, const uint32_t matrixStrideA = 0, const uint32_t matrixStrideB = 0, const uint32_t matrixStrideC = 0, const bool enPartialSum = false, const uint8_t enAtomic = 0)

Output to VECIN

          
               template <bool sync = true>
__aicore__ inline void IterateBatch(const LocalTensor<DstT>& ubCmatrix, uint32_t batchA, uint32_t batchB, bool enSequentialWrite, const uint32_t matrixStrideA = 0, const uint32_t matrixStrideB = 0, const uint32_t matrixStrideC = 0, const bool enPartialSum = false, const uint8_t enAtomic = 0)

CUBE-ONLY mode

Before using this function, call SetBatchNum to set the sizes of batch A and batch B.

Output to GM

           
                __aicore__ inline void IterateBatch(const GlobalTensor<DstT>& gm, bool enPartialSum, uint8_t enAtomic, bool enSequentialWrite, const uint32_t matrixStrideA = 0, const uint32_t matrixStrideB = 0, const uint32_t matrixStrideC = 0)

Output to VECIN

           
                __aicore__ inline void IterateBatch(const LocalTensor<DstT>& ubCmatrix, bool enPartialSum, uint8_t enAtomic, bool enSequentialWrite, const uint32_t matrixStrideA = 0, const uint32_t matrixStrideB = 0, const uint32_t matrixStrideC = 0)

Parameters

**Table 1** Template parameters
Parameter	Description
sync	Matrix C can be obtained in synchronous or asynchronous mode. Synchronous: Wait until IterateBatch is executed. Asynchronous: Do not need to wait until IterateBatch is executed. This parameter specifies the two modes: true for the synchronous mode and false for the asynchronous mode. The synchronous mode is used by default. In asynchronous scenarios, this API must be used together with WaitIterateBatch.
waitIterateBatch	Used only in asynchronous scenarios, indicating whether to use WaitIterateBatch to wait for the completion of IterateBatch execution. The default value is false. true: WaitIterateBatch is used to wait for the completion of IterateBatch execution. false: WaitIterateBatch is not used to wait for the completion of IterateBatch execution. Developers can handle this waiting process themselves.

Parameter	Input/Output	Description
gm	Output	Matrix C. The type is GlobalTensor. For Atlas A3 training products / Atlas A3 inference products , the supported data types are half, bfloat16_t, int32_t, and float. For Atlas A2 training products / Atlas A2 inference products , the supported data types are half, bfloat16_t, int32_t, and float. For Atlas inference product 's AI Core, the supported data types are half, bfloat16_t, int32_t, and float.
ubCmatrix	Output	Matrix C. The type is LocalTensor. For Atlas A3 training products / Atlas A3 inference products , the supported data types are half, bfloat16_t, int32_t, and float. For Atlas A2 training products / Atlas A2 inference products , the supported data types are half, bfloat16_t, int32_t, and float. For the Atlas inference product 's AI Core, the supported data types are half, bfloat16_t, int32_t, and float.
batchA	Input	Number of batches of the left matrix.
batchB	Input	Number of batches of the right matrix. If batchA and batchB are different, the broadcast operation is performed by default. Multi-batch computation supports input broadcast and output reduce on the G axis. The G axis dimensions of the left and right matrices must be integer multiples.
enSequentialWrite	Input	Whether the output data is continuously stored, that is, whether the continuous write mode is enabled (continuous write: data is written to [baseM, baseN]; non-continuous write: data is written to the corresponding position in [singleCoreM, singleCoreN]). If the storage location of the left and right matrices and the output matrix is Unified Buffer, set enSequentialWrite to true. If the storage location of the output matrix is GM, set enSequentialWrite to false.
matrixStrideA	Input	Offset between the start addresses of adjacent nd matrices of the matrix A's source operand, in elements. The default value is 0.
matrixStrideB	Input	Offset between the start addresses of adjacent nd matrices of the matrix B's source operand, in elements. The default value is 0.
matrixStrideC	Input	This parameter is reserved and can be ignored.
enPartialSum	Input	Whether to accumulate the matrix multiplication result to the existing CO1 data. The default value is false. During L0C accumulation, the specification of matrix C output by multiplication of matrix A and matrix B can only be singleM==baseM &&singleN==baseN.
enAtomic	Input	Enables the Atomic operation or not. Values: 0 (default): disables the Atomic operation. 1: enables the AtomicAdd (accumulation) operation. 2: enables the AtomicMax (maximum value calculation) operation. 3: enables the AtomicMin (minimum value calculation) operation.

Returns

None

Restrictions

This API supports only the Norm template. That is, BatchMatmul operators support only the Norm template.
For the BSNGD, SBNGD, and BNGS1S2 layouts, the total size of multiple batches of matrix A and matrix B, after being aligned according to the fractal dimension, must be less than the size of L1 Buffer. There is no such restriction on the NORMAL layout mode, but you need to configure the relationship between the size of multiple batches of matrix A and matrix B and the size of L1 Buffer by using MatmulConfig.
For the BSNGD, SBNGD, and BNGS1S2 layouts, if the G axis of the left matrix and right matrix is ALayoutInfoG and BLayoutInfoG, respectively, the following equation applies: ALayoutInfoG/batchA = BLayoutInfoG/batchB. For the NORMAL layout, batchA and batchB must meet the multiple relationship.
If data is output to Unified Buffer, the size of the output matrix C (BaseM × BaseN) must be less than the size of the allocated Unified Buffer.
When the API data is output to Unified Buffer and the size of the N direction for single-core computation (singleCoreN) is not 32-byte aligned, CubeFormat of matrix C only supports the ND_ALIGN format. When matrix C slices are output, the data along the singleCoreN direction is automatically padded to 32 bytes.
For the BSNGD and SBNGD layouts, the input and output data must be in ND format. For the BNGS1S2 and NORMAL layouts, the input data can be in ND or NZ format.
For the BSNGD and SBNGD layouts, continuous write is not supported.
This API does not support the quantization mode. That is, SetQuantScalar and SetQuantVector APIs are not supported.
In the BSNGD scenario, multiple rows of SDs cannot be computed at a time. Cyclic computation is required in the operator program. That is, (ALayoutInfoN × ALayoutInfoG)/batchA and (BLayoutInfoN × BLayoutInfoG)/batchB must be integers.
IterateBatch cannot be moved to UB in asynchronous mode.
This API is not supported when enableMixDualMaster (dual-master mode) is set to true.
For Atlas inference product 's AI Core, only the NORMAL Layout format is supported.
For Atlas inference product 's AI Core, the input with logical memory position of matrices A and B as TPosition::TSCM is not supported.
For Atlas inference product 's AI Core, bias cannot be reused, and the shape size of bias must be Batch × N.
When this API is used, matrices A and B do not support int4b_t inputs. That is, BatchMatmul does not support int4b_t matrix inputs.

Example

In this example, the aGM and bGM matrices are multiplied and the result is saved to cGM. The layout format of the aGM, bGM, and cGM data is NORMAL. The left matrix computes batchA MK data each time, and the right matrix computes batchB KN data each time.

        
         
           
           
                 // Define MatmulType.
typedef AscendC::MatmulType <AscendC::TPosition::GM, CubeFormat::ND, half, false, LayoutMode::NORMAL> aType;
typedef AscendC::MatmulType <AscendC::TPosition::GM, CubeFormat::ND, half, true, LayoutMode::NORMAL> bType;
typedef AscendC::MatmulType <AscendC::TPosition::GM, CubeFormat::ND, float, false, LayoutMode::NORMAL> cType;
typedef AscendC::MatmulType <AscendC::TPosition::GM, CubeFormat::ND, float> biasType;
   // Create a Matmul instance.
constexpr static MatmulConfig MM_CFG = GetNormalConfig(false, false, false, BatchMode::BATCH_LESS_THAN_L1);
AscendC::Matmul<aType, bType, cType, biasType, MM_CFG> mm1;
REGIST_MATMUL_OBJ(&pipe, GetSysWorkSpacePtr(), mm1);
mm1.Init(&tiling);
mm1.SetTensorA(gm_a, isTransposeAIn);
mm1.SetTensorB(gm_b, isTransposeBIn);
if(tiling.isBias) {
    mm1.SetBias(gm_bias);
}
    // Execute multi-batch Matmul computation.
mm1.IterateBatch(gm_c, batchA, batchB, false);

            

          

        
       

In this example, the aGM and bGM matrices are multiplied and the result is saved to cGM. The layout formats of the aGM, bGM, and cGM data are BSNGD, BSNGD, and BNGS1S2, respectively. The left matrix computes batchA SD data each time, and the right matrix computes batchB SD data each time. For details about the complete BatchMatmul example where the aGM, bGM, and cGM data is in BSNDG format, see BatchMatmul sample.

        
         
           
           
                 // Define MatmulType.
typedef AscendC::MatmulType <AscendC::TPosition::GM, CubeFormat::ND, half, false, LayoutMode::BSNGD> aType;
typedef AscendC::MatmulType <AscendC::TPosition::GM, CubeFormat::ND, half, true, LayoutMode::BSNGD> bType;
typedef AscendC::MatmulType <AscendC::TPosition::GM, CubeFormat::ND, float, false, LayoutMode::BNGS1S2> cType;
typedef AscendC::MatmulType <AscendC::TPosition::GM, CubeFormat::ND, float> biasType;
   // Create a Matmul instance.
AscendC::Matmul<aType, bType, cType, biasType> mm1;
REGIST_MATMUL_OBJ(&pipe, GetSysWorkSpacePtr(), mm1);
mm1.Init(&tiling);
int batchC = batchA > batchB ? batchA : batchB;
int g_lay = tiling.ALayoutInfoG > tiling.BLayoutInfoG ? tiling.ALayoutInfoG : tiling.BLayoutInfoG;
    //  Calculate the number of loops required for multi-batch computation.
int for_exent = tiling.ALayoutInfoB * tiling.ALayoutInfoN * g_lay / tiling.BatchNum;
for(int i=0; i<for_exent; ++i) {
        // Calculate the start address for computing matrix A/B in multiple batches each time.
    int batchOffsetA = i * tiling.ALayoutInfoD * batchA;
    int batchOffsetB = i * tiling.BLayoutInfoD * batchB;
    mm1.SetTensorA(gm_a[batchOffsetA], isTransposeAIn);
    mm1.SetTensorB(gm_b[batchOffsetB], isTransposeBIn);
    int idx_c = i * batchC;
    if (tiling.CLayoutInfoG == 1 && (tiling.BLayoutInfoG != 1 || tiling.ALayoutInfoG != 1)) {
        idx_c = idx_c / (tiling.BLayoutInfoG > tiling.ALayoutInfoG ? tiling.BLayoutInfoG : tiling.ALayoutInfoG);
    }
    if(tiling.isBias) {
        int batchOffsetBias = idx_c * tiling.CLayoutInfoS2;
        mm1.SetBias(gm_bias[batchOffsetBias]);
    }
    int batchOffsetC = idx_c * tiling.CLayoutInfoS2;
    if (C_TYPE::layout == LayoutMode::BNGS1S2) {
        batchOffsetC = idx_c * tiling.CLayoutInfoS2 * tiling.CLayoutInfoS1;
    }
    // Execute multi-batch Matmul computation.
    mm1.IterateBatch(gm_c[batchOffsetC], batchA, batchB, false);
}

            

          

        
       

Parent topic: Matmul Kernel APIs