Multi-Core Aligned Tiling

Overview

To implement multi-core parallelism for higher efficiency, matrix data needs to be tiled and allocated across cores for processing. There are two main tiling policies: K-axis tiling and non-K-axis tiling.

The policies for tiling only the M and N axes but not the K axis are as follows:

Matrix A is tiled into multiple tiles of singleCoreM along the M axis. A single core processes singleCoreM × K data.
Matrix B is tiled into multiple tiles of singleCoreN along the N axis. A single core processes K × singleCoreN data.
For matrix C, matrix A with the size of singleCoreM × K is multiplied by matrix B with the size of K × singleCoreN to obtain matrix C with the size of singleCoreM × singleCoreN, the size of matrix C output on a single core.

As shown in the following figure, eight cores participate in the computation. Matrix A is tiled into four blocks along the M axis, and matrix B is tiled into two blocks along the N axis. A single core processes one block only (for example, the green part in the figure is the data computed on core5). The matrix A block with the size of singleCoreM × K is multiplied by the matrix B block with the size of singleCoreN × K to obtain the matrix C block with the size of singleCoreM × singleCoreN.

The following figure shows the strategies of tiling the M, N, and K axes.

Matrix A is tiled into multiple tiles of singleCoreM along the M axis and multiple tiles of singleCoreK along the K axis. A single core processes data of the size of singleCoreM × singleCoreK.
Matrix B is tiled into multiple tiles of singleCoreK along the K axis and into multiple tiles of singleCoreN along the N axis. A single core processes data of the size of singleCoreK × singleCoreN.
For matrix C, matrix A with the size of singleCoreM × singleCoreK is multiplied by matrix B with the size of singleCoreK × singleCoreN, and accumulation is performed to obtain matrix C blocks with the size of singleCoreM × singleCoreN.

As shown in the following figure, matrix R blocks in matrix C are obtained by accumulating A1 × B1 + A2 × B2 + A3 × B3, where A1 × B1, A2 × B2, and A3 × B3 can be computed in parallel on multiple cores.

The preceding tiling strategies are reflected in the tiling parameters, such as SingleCoreM, SingleCoreN, and SingleCoreK. You can call APIs on the host to automatically obtain tiling parameters. Different from the single-core scenario, multi-core tiling requires MultiCoreMatmulTiling to construct multi-core tiling objects, and calls the SetDim API to set the number of cores used for Matmul computation. Note that the number of cores set here is the number of cores that can be used for Matmul computation and is used to compute tiling parameters in multi-core scenarios only. SetBlockDim indicates the number of cores used for operator computation and the number of cores to be loaded. This parameter is required. For details about the setting rules of SetBlockDim, see the blockDim description. The setting rules of SetDim are as follows:

In the CUBE_ONLY (with only Cube computation) scenario, the CUBE_ONLY mode is used as an example.
SetDim sets the number of available cores of the AI Processor. The number of cores actually used for Matmul computation is calculated by tiling. The number of cores actually used is smaller than or equal to the number of available cores of the AI Processor. Configure SetBlockDim according to the actual number of used cores.
For details about the rules for setting the MIX mode (including matrix computation and vector computation), see Rules for Setting the Number of Cores in the MIX Scenario.

Application Scenarios

Matmul with multiple cores.

Restrictions

None

Example

The key sample code for this scenario is as follows: For details about the complete example in the Matmul multi-core aligned scenario, see the following samples: matmul multi-core kernel launch sample (multi-core M and N tiling) and operator sample for multi-core K splitting (multi-core K tiling).

      
           // Construct a multi-core tiling object.
auto ascendcPlatform = platform_ascendc::PlatformAscendCManager::GetInstance(socVersion);
matmul_tiling::MultiCoreMatmulTiling cubeTiling(*ascendcPlatform);
// For operators that involve Cube computation only, set the number of cores that can participate in matrix multiplication to the number of Cube cores on the AI Processor.
cubeTiling.SetDim(ascendcPlatform.GetCoreNumAic());
cubeTiling.SetAType(matmul_tiling::TPosition::GM, matmul_tiling::CubeFormat::ND, matmul_tiling::DataType::DT_FLOAT16);
cubeTiling.SetBType(matmul_tiling::TPosition::GM, matmul_tiling::CubeFormat::ND, matmul_tiling::DataType::DT_FLOAT16);
cubeTiling.SetCType(matmul_tiling::TPosition::GM, matmul_tiling::CubeFormat::ND, matmul_tiling::DataType::DT_FLOAT);
cubeTiling.SetBiasType(matmul_tiling::TPosition::GM, matmul_tiling::CubeFormat::ND, matmul_tiling::DataType::DT_FLOAT);
cubeTiling.SetOrgShape(M, N, K);
cubeTiling.SetShape(M, N, K);
cubeTiling.EnableBias(isBias);
optiling::TCubeTiling tilingData;  
// Obtain tiling parameters.
int ret = cubeTiling.GetTiling(tilingData);    // if ret = -1, gen tiling failed

Parent topic: Feature Scenarios