Basic Functions of Batch Matmul

Functions

Batch Matmul refers to the scenario where Matmul compute is processed in batches. In this scenario, the IterateBatch API is provided externally. You can call IterateBatch once to compute multiple matrices C with the size of singleCoreM x singleCoreN.

Data needs to be moved in and out during a single Matmul compute. If multiple Matmul compute operations are performed and the input shape in a single Matmul compute operation is small, the movement overhead accounts for a large proportion of the total duration. Using the IterateBatch API to process Matmul in batches can effectively improve bandwidth utilization.

Currently, Batch Matmul supports four layout types: BSNGD, SBNGD, BNGS1S2, and NORMAL (BMNK data format). For details about the data format, see IterateBatch.

The following figure shows the Batch Matmul compute in NORMAL data format. The entire Matmul compute involves four matrix multiplication operations: mat_a1 x mat_b1, mat_a2 x mat_b2, mat_a3 x mat_b3, and mat_a4 x mat_b4. Four singleCoreM x singleCoreN operations need to be computed on a single core. In this scenario, if the shape is small, you can consider it as a Batch Matmul scenario for batch processing to improve performance. mat_c1 = mat_a1 x mat_b1, mat_c2 = mat_a2 x mat_b2, mat_c3 = mat_a3 x mat_b3, and mat_c4 = mat_a4 x mat_b4 can be computed concurrently when IterateBatch is called once.

Figure 1 Batch Matmul in NORMAL format

Use Case

During Matmul compute, multiple matrices C with the size of singleCoreM x singleCoreN need to be computed, and the shape processed by a single Matmul compute operation is small.

Restrictions

Only the Norm template can be enabled.
For the BSNGD, SBNGD, and BNGS1S2 layouts, the total size of multiple batches of matrix A and matrix B, after being aligned according to the fractal dimension, must be less than the size of the L1 buffer. There is no such restriction on the NORMAL layout, but you need to configure the relationship between the size of multiple batches of matrix A and matrix B and the size of the L1 buffer by using MatmulConfig to configure the batchMode parameter.
For the BSNGD, SBNGD, and BNGS1S2 layouts, if the G axis of the left matrix and right matrix is ALayoutInfoG and BLayoutInfoG, respectively, the following equation applies: ALayoutInfoG/batchA = BLayoutInfoG/batchB. For the NORMAL layout, batchA and batchB must meet the multiple relationship. The batch in the shape (batch, n) of the bias must be the same as that of matrix C.
If data is output to Unified Buffer, the size of the output matrix C (BaseM × BaseN) must be less than the size of the allocated Unified Buffer.
For the BSNGD and SBNGD layouts, the input and output data must be in ND format. For the BNGS1S2 and NORMAL layouts, the input data can be in ND or NZ format.
Batch Matmul does not support the quantization/dequantization mode, that is, the SetQuantScalar or SetQuantVector API is not supported.
In the BSNGD scenario, multiple rows of SDs cannot be computed at a time. Cyclic compute operations are required in the operator program.
IterateBatch cannot be moved to Unified Buffer in asynchronous mode.
If the template parameter enableMixDualMaster (default value: false) is set to true, Batch Matmul is not supported in the MixDualMaster (dual-master mode) scenario.
In the batch scenario, matrices A and B support the data type of half/float/bfloat16_t/int8_t, but do not support the data type of int4b_t.

Examples

The following is an example of calling Batch Matmul in NORMAL format. For details about the example of Batch Matmul in BSNDG format, see BatchMatmul sample.

Tiling implementation

Use SetBatchInfoForNormal to set the M, N, and K axes of matrices A, B, and C and BatchNum of matrix A and matrix B.

        
         
           
           
             auto ascendcPlatform = platform_ascendc::PlatformAscendC(context->GetPlatformInfo());
matmul_tiling::MultiCoreMatmulTiling tiling(ascendcPlatform);   
int32_t M = 32;
int32_t N = 256;
int32_t K = 64;
tiling->SetDim(1);
tiling->SetAType(matmul_tiling::TPosition::GM, matmul_tiling::CubeFormat::ND, matmul_tiling::DataType::DT_FLOAT16);
tiling->SetBType(matmul_tiling::TPosition::GM, matmul_tiling::CubeFormat::ND, matmul_tiling::DataType::DT_FLOAT16);
tiling->SetCType(matmul_tiling::TPosition::GM, matmul_tiling::CubeFormat::ND, matmul_tiling::DataType::DT_FLOAT);
tiling->SetBiasType(matmul_tiling::TPosition::GM, matmul_tiling::CubeFormat::ND, matmul_tiling::DataType::DT_FLOAT);
tiling->SetShape(M, N, K);
tiling->SetOrgShape(M, N, K);
tiling->EnableBias(true);
tiling->SetBufferSpace(-1, -1, -1);

constexpr int32_t BATCH_NUM = 3;
tiling->SetBatchInfoForNormal(BATCH_NUM, BATCH_NUM, M, N, K);  // Set the matrix layout.
tiling->SetBufferSpace(-1, -1, -1);

optiling::TCubeTiling tilingData;
int ret = tiling.GetTiling(tilingData);

            

          

        
       

Kernel implementation

Create a Matmul object.

Set the layout format of the input and output to NORMAL through MatmulType.

          
               #include "lib/matmul_intf.h"

typedef AscendC::MatmulType <AscendC::TPosition::GM, CubeFormat::ND, half, false, LayoutMode::NORMAL> aType;
typedef AscendC::MatmulType <AscendC::TPosition::GM, CubeFormat::ND, half, true, LayoutMode::NORMAL> bType;
typedef AscendC::MatmulType <AscendC::TPosition::GM, CubeFormat::ND, float, false, LayoutMode::NORMAL> cType;
typedef AscendC::MatmulType <AscendC::TPosition::GM, CubeFormat::ND, float> biasType;
constexpr MatmulConfig MM_CFG = GetNormalConfig(false, false, false, BatchMode::BATCH_LESS_THAN_L1);
AscendC::Matmul<aType, bType, cType, biasType, MM_CFG> mm;

Perform the initialization operation.

          
               REGIST_MATMUL_OBJ(&pipe, GetSysWorkSpacePtr(), mm, &tiling); // Initialize the matmul object.

Set the left matrix A, right matrix B, and bias.

          
               mm.SetTensorA(gm_a);    // Set the left matrix A.
mm.SetTensorB(gm_b);    // Set the right matrix B.
mm.SetBias(gm_bias);    // Set the bias.

Execute the matrix multiplication. The left matrix computes batchA MK data each time, and the right matrix computes batchB KN data each time.

          
               mm.IterateBatch(gm_c, batchA, batchB, false);

End the matrix multiplication.
```
mm.End();
```

Parent topic: Feature Scenarios