Multi-Core Scenario

Multi-Core Tiling

To implement multi-core parallelism for higher efficiency, matrix data needs to be tiled and allocated to different cores for processing. There are two main tiling policies: K-axis tiling and non-K-axis tiling.

The policies for tiling only the M and N axes but not the K axis are as follows:

Matrix A is tiled into multiple tiles of singleCoreM along the M axis. A single core processes singleCoreM x K data.
Matrix B is tiled into multiple tiles of singleCoreN along the N axis. A single core processes K x singleCoreN data.
For matrix C, matrix A with the size of singleCoreM x K is multiplied by matrix B with the size of K x singleCoreN to obtain matrix C with the size of singleCoreM x singleCoreN, the size of matrix C output on a single core.

As shown in the following figure, eight cores participate in the compute. Matrix A is tiled into four blocks along the M axis, and matrix B is tiled into two blocks along the N axis. A single core processes only one block (for example, the green part in the figure is the data computed on core5). The matrix A block with the size of singleCoreM x K is multiplied by the matrix B block with the size of singleCoreN x K to obtain the matrix C block with the size of singleCoreM x singleCoreN.

The following figure shows the policies of tiling the M, N, and K axes.

Matrix A is tiled into multiple tiles of singleCoreM along the M axis and multiple tiles of singleCoreK along the K axis. A single core processes data of the size of singleCoreM x singleCoreK.
Matrix B is tiled into multiple tiles of singleCoreK along the K axis and into multiple tiles of singleCoreN along the N axis. A single core processes data of the size of singleCoreK x singleCoreN.
For matrix C, matrix A with the size of singleCoreM x singleCoreK is multiplied by matrix B with the size of singleCoreK x singleCoreN and accumulated to obtain matrix C blocks with the size of singleCoreM x singleCoreN.

As shown in the following figure, matrix R blocks in matrix C are obtained by accumulating a x a + b x b + c x c, where a x a, b x b, and c x c can be calculated in parallel on multiple cores.

The preceding tiling policies are reflected in the tiling parameters, such as SingleCoreM, SingleCoreN, and SingleCoreK. You can call APIs on the host to automatically obtain tiling parameters. Different from the single-core scenario, multi-core tiling requires MultiCoreMatmulTiling to construct multi-core tiling objects, call the SetDim API to set the number of cores used for Matmul computation. Note that the number of cores set here is the number of cores used for Matmul computation and is used to compute tiling parameters only in multi-core scenarios. SetBlockDim indicates the number of cores used for operator compute and the number of cores to be loaded. This parameter is required. For details about the setting rules of SetBlockDim, see Tiling Implementation on the Host. The setting rules of SetDim are as follows:

In the CUBE_ONLY scenario, the CUBE_ONLY mode is used as an example.
The number of cores loaded by SetBlockDim is the same as that used by the Matmul API for computation. The values of SetDim and SetBlockDim are the same.
For details about the rules for setting the MIX mode (including Cube computation and Vector computation), see Rules for Setting the Number of Cores in the MIX Scenario.

For details about the complete example in the Matmul multi-core scenario, see sample of Matmul multi-core single-operator calling and sample of matmul multi-core kernel launch.

      
           // Construct a multi-core tiling object.
auto ascendcPlatform = platform_ascendc::PlatformAscendC(context->GetPlatformInfo());
matmul_tiling::MultiCoreMatmulTiling tiling(ascendcPlatform); 
// Set the number of cores that can participate in Matmul.
tiling.SetDim(5);
tiling.SetAType(AscendC::TPosition::GM, CubeFormat::ND, matmul_tiling::DataType::DT_FLOAT16);
...
optiling::TCubeTiling tilingData;  
// Obtain tiling parameters.
int ret = tiling.GetTiling(tilingData);    // if ret = -1, gen tiling failed

Tail Block Processing in the Non-Alignment Scenario

In the multi-core scenario, when tiling a matrix, if M, N, and K cannot be exactly tiled by singleCoreM, singleCoreN, and singleCoreK, the tail block appears, as shown in the matrix blocks in the last row and last column of matrix A and matrix B in the following figure.

In this case, matrix R blocks in matrix C are still obtained through a x a + b x b + c x c + d x d accumulation. When tail blocks such as a x a, b x b, c x c, and d x d are processed, a size of the tail block needs to be set on the kernel. In case that the original tiling cannot be changed, reset the singleCoreM, singleCoreN, and singleCoreK for the compute. The tail block is moved and computed based on the configured tailM, tailN, and tailK. In the following example, if tailM is less than singleCoreM, the tail block needs to be processed, you can call SetTail to set the tail block. SetTail should be called before Iterate and IterateAll.

      
           template<typename aType, typename bType, typename cType, typename biasType
__aicore__ inline void MatmulKernel<aType, bType, cType, biasType>::CalcOffset(int32_t blockIdx, const TCubeTiling& tiling, int32_t& offsetA, int32_t& offsetB, int32_t& offsetC, int32_t& offsetBias, bool isAtrans, bool isBtrans){
    auto mSingleBlocks = Ceil(tiling.M, tiling.singleCoreM);
    auto mCoreIndx = blockIdx % mSingleBlocks;
    auto nCoreIndx = blockIdx / mSingleBlocks;

    ...
    // Process the tail block.
    int tailM = tiling.M - mCoreIndx * tiling.singleCoreM;
    tailM = tailM < tiling.singleCoreM ? tailM : tiling.singleCoreM;
    int tailN = tiling.N - nCoreIndx * tiling.singleCoreN;
    tailN = tailN < tiling.singleCoreN ? tailN : tiling.singleCoreN;
    if (tailM < tiling.singleCoreM || tailN < tiling.singleCoreN) {
        matmulObj.SetTail(tailM, tailN);
    }
}

Parent topic: Cube Programming (Advanced APIs)