IterateBatch

Function Usage

Computes multiple matrices C of size singleCoreM × singleCoreN by each call to IterateBatch. If the shape processed in a single Matmul computation is small, the performance may be affected because each computation involves internal communication. This API provides the function of processing Matmul computation in batches.

Before using this API, you need to understand the following data formats:

  • General data format: BMNK data layout

  • BSH/SBH: B indicates batch processing size; S indicates sequence length; H = N × D, where N is the number of heads and D is the size of heads. The following figure shows the layout format.

  • BSNGD: shape after reshaping the original BSH shape. S and D are the M axis (or N axis) and K axis of matrix multiplication of a single batch. An SD is the computation data of a batch. Its layout is shown as follows.

  • SBNGD: shape after reshaping the original SBH shape. S and D are the M axis (or N axis) and K axis of matrix multiplication. An SD is the computation data of a batch. Its layout is shown as follows.

  • BNGS1S2: matrix multiplication output of the first two layouts. The S1S2 data is stored continuously, and an S1S2 element is the data computed of a batch. Its layout is shown as follows.

When instantiating the Matmul, you need to set the input and output layouts through MatmulType. Currently, four layout modes are supported: BSNGD, SBNGD, BNGS1S2, and NORMAL (BMNK).

For the BSNGD, SBNGD, and BNGS1S2 layouts, before calling this API, you need to use SetALayout, SetBLayout, SetCLayout, and SetBatchNum in the host tiling implementation to set the layout axis information and maximum number of batches for matrices A, B, and C. For the NORMAL layout, use SetBatchInfoForNormal to set the M, N, and K axes of matrices A, B, and C and the number of batch numbers of matrices A and B.

The iteration order of a single matrix multiplication can be adjusted using the tiling parameter iterateOrder.

For details about batch processing in matrix programming, see Batch Scenario.

Prototype

  • CUBE-ONLY mode
    • Output to GM
      1
      __aicore__ inline void IterateBatch(const GlobalTensor<DstT>& gm, bool enPartialSum, uint8_t enAtomic, bool enSequentialWrite, const uint32_t matrixStrideA = 0, const uint32_t matrixStrideB = 0, const uint32_t matrixStrideC = 0)
      

Parameters

Table 1 Parameters in the template

Parameter

Description

sync

Matrix C can be obtained in synchronous or asynchronous mode.

  • Synchronous mode: Wait until IterateBatch is executed.
  • Asynchronous mode: Do not need to wait until IterateBatch is executed.

This parameter specifies the two modes: true for the synchronous mode and false for the asynchronous mode. The synchronous mode is used by default. In asynchronous scenarios, this API must be used together with WaitIterateBatch.

waitIterateBatch

Used only in asynchronous scenarios, indicating whether to use WaitIterateBatch to wait for the completion of IterateBatch execution. The default value is false.

true: WaitIterateBatch is used to wait for the completion of IterateBatch execution.

false: WaitIterateBatch is not used to wait for the completion of IterateBatch execution. Developers can handle this waiting process themselves.

Parameter

Input/Output

Description

gm

Input

Address for storing matrix C in the global memory.

ubCmatrix

Input

Address for storing matrix C in the local memory.

batchA

Input

Number of batches of the left matrix.

batchB

Input

Number of batches of the right matrix. If batchA and batchB are different, the broadcast operation is performed by default.

Multi-batch computation supports input broadcast and output reduce on the G axis. The G axis dimensions of the left and right matrices must be integer multiples.

enSequentialWrite

Input

Whether the output data is stored continuously.

  • If the storage location of the left and right matrices and the output matrix is Unified Buffer, set enSequentialWrite to true.
  • If the storage location of the output matrix is GM, set enSequentialWrite to false.

matrixStrideA

Input

Offset between the start addresses of adjacent ND matrices of the matrix A's source operand, in elements.

matrixStrideB

Input

Offset between the start addresses of adjacent ND matrices of the matrix B's source operand, in elements.

matrixStrideC

Input

Offset between the start addresses of adjacent ND matrices of the matrix C's source operand, in elements.

enPartialSum

Input

Whether to accumulate the matrix multiplication result to the existing CO1 data. The default value is false. During L0C accumulation, the specification of matrix C output by multiplication of matrix A and matrix B can only be singleM==baseM &&singleN==baseN.

enAtomic

Input

Whether to enable the Atomic operation.

Values:

0 (default): disables the Atomic operation.

1: enables the AtomicAdd (accumulation) operation.

2: enables the AtomicMax (maximum value calculation) operation.

3: enables the AtomicMin (minimum value calculation) operation.

Returns

None

Availability

Constraints

  • This API supports only the Norm template. That is, BatchMatmul operators support only the Norm template.
  • For the BSNGD, SBNGD, and BNGS1S2 layouts, the total size of multiple batches of matrix A and matrix B, after being aligned according to the fractal dimension, must be less than the size of L1 Buffer. There is no such restriction on the NORMAL layout mode, but you need to configure the relationship between the size of multiple batches of matrix A and matrix B and the size of L1 Buffer by using MatmulConfig.
  • If data is output to Unified Buffer, the size of the output matrix C (BaseM × BaseN) must be less than the size of the allocated Unified Buffer.
  • For the BSNGD and SBNGD layout modes, the input and output data must be in ND format. For the BNGS1S2 and NORMAL layout modes, the input data can be in ND or NZ format.
  • This API does not support the quantization mode. That is, SetQuantScalar and SetQuantVector APIs are not supported.
  • In the BSNGD scenario, multiple rows of SDs cannot be computed at a time. Cyclic computation is required in the operator program.
  • IterateBatch cannot be transferred to UB in asynchronous mode.

Example

  • In this example, the aGM and bGM matrices are multiplied and the result is saved to cGm. The layout mode of the aGM, bGM, and cGM data is NORMAL. The left matrix computes batchA MK data each time, the right matrix computes batchB KN data each time.
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    #include "kernel_operator.h"
    #include "lib/matmul_intf.h"
    
    extern "C" __global__  __aicore__ void kernel_matmul_rpc_batch(GM_ADDR aGM, GM_ADDR bGM, GM_ADDR cGM, GM_ADDR biasGM, GM_ADDR tilingGM, GM_ADDR workspaceGM, uint32_t isTransposeAIn, uint32_t isTransposeBIn, int32_t batchA,  int32_t batchB)
    {
        // Define MatmulType.
        typedef matmul::MatmulType <AscendC::TPosition::GM, CubeFormat::ND, half, false, LayoutMode::NORMAL> aType;
        typedef matmul::MatmulType <AscendC::TPosition::GM, CubeFormat::ND, half, true, LayoutMode::NORMAL> bType;
        typedef matmul::MatmulType <AscendC::TPosition::GM, CubeFormat::ND, float, false, LayoutMode::NORMAL> cType;
        typedef matmul::MatmulType <AscendC::TPosition::GM, CubeFormat::ND, float> biasType;
    
       // Initialize tiling data.
        TCubeTiling tiling;
        auto tempTilingGM = (__gm__ uint32_t*)tilingGM;
        auto tempTiling = (uint32_t*)&tiling;
        for (int i = 0; i < sizeof(TCubeTiling) / sizeof(int32_t); ++i, ++tempTilingGM, ++tempTiling) {
            *tempTiling = *tempTilingGM;
        }
    
        // Initialize GM data.
        AscendC::GlobalTensor<half> aGlobal;
        AscendC::GlobalTensor<half> bGlobal;
        AscendC::GlobalTensor<float> cGlobal;
        AscendC::GlobalTensor<float> biasGlobal;
        int32_t sizeA = tiling.ALayoutInfoB * tiling.singleCoreM * tiling.singleCoreK * sizeof(A_T);
        int32_t sizeB = tiling.BLayoutInfoB * tiling.singleCoreK * tiling.singleCoreN * sizeof(B_T);
        int32_t sizeC = tiling.CLayoutInfoB * tiling.singleCoreM * tiling.singleCoreN * sizeof(C_T);
        int32_t sizebias = tiling.CLayoutInfoB * tiling.singleCoreN * sizeof(C_T);
        aGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ half*>(aGM), sizeA);
        bGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ half*>(bGM), sizeB);
        cGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ float*>(cGM), sizeC);
        biasGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ float*>(biasGM), sizebias);
        tiling.shareMode = 0;            
        tiling.shareL1Size = 512 * 1024;
        tiling.shareL0CSize = 128 * 1024;
        tiling.shareUbSize = 0;
        int offset_a=0, offset_b=0, offset_c=0, offset_bias=0;
        AscendC::GlobalTensor<A_T> gm_a;
        gm_a.SetGlobalBuffer(const_cast<__gm__ A_T*>(aGlobal[offset_a].GetPhyAddr()), tiling.singleCoreM * tiling.singleCoreK);
        AscendC::GlobalTensor<B_T> gm_b;
        gm_b.SetGlobalBuffer(const_cast<__gm__ B_T*>(bGlobal[offset_b].GetPhyAddr()), tiling.singleCoreK * tiling.singleCoreN);
        AscendC::GlobalTensor<C_T> gm_c;
        gm_c.SetGlobalBuffer(const_cast<__gm__ C_T*>(cGlobal[offset_c].GetPhyAddr()), tiling.singleCoreM * tiling.singleCoreN) ;
        AscendC::GlobalTensor<BiasT> gm_bias;
        gm_bias.SetGlobalBuffer(const_cast<__gm__ BiasT*>(biasGlobal[offset_bias].GetPhyAddr()), tiling.singleCoreN);
       // Create a Matmul instance.
        constexpr static MatmulConfig MM_CFG = GetNormalConfig(false, false, false, BatchMode::BATCH_LESS_THAN_L1);
        matmul::Matmul<aType, bType, cType, biasType, MM_CFG> mm1;
        TPipe pipe;
    
        REGIST_MATMUL_OBJ(&pipe, GetSysWorkSpacePtr(), mm1);
        mm1.Init(&tiling);
        mm1.SetTensorA(gm_a, isTransposeAIn);
        mm1.SetTensorB(gm_b, isTransposeBIn);
        mm1.SetWorkspace(workspaceGM, 0);
        if(tiling.isBias) {
            mm1.SetBias(gm_bias);
        }
        // Execute multi-batch Matmul computation.
        mm1.IterateBatch(gm_c, batchA, batchB, false);
    }