TCubeTiling Structure
The TCubeTiling structure contains parameters related to the Matmul tiling algorithm that are passed to the Matmul kernel for Matmul tiling, data movement, and computation. For details about the parameters of the TCubeTiling structure, see Table 1.
Parameter |
Data Type |
Description |
|---|---|---|
usedCoreNum |
int |
Number of AI Processor cores used. Set this parameter based on your actual requirements. Value range: [1, Maximum number of AI Processor cores]. The relationship between this parameter and shape-related parameters is as follows: usedCoreNum = (M/singleCoreM) x (N/singlecoreN). |
M, N, Ka, Kb |
int |
Shape size of the original input of matrices A, B, and C, in elements. M and Ka are the original input shapes of matrix A, and Kb and N are the original input shapes of matrix B.
|
singleCoreM, singleCoreN, singleCoreK |
int |
Shape sizes of matrices A, B, and C in a single core, in elements. singleCoreK = K (K is not tiled during multi-core processing); singleCoreM ≤ M; singleCoreN ≤ N Note: If the input of matrix A is in NZ format, singleCoreM must be 16-element aligned and singleCoreK must be aligned with C0_size × fractal_num. If the input of matrix B is in NZ format, singleCoreK must be aligned with C0_size × fractal_num and singleCoreN must be 16-element aligned. For input in NZ format, C0_size of half/bfloat16_t data is 16, C0_size of float data is 8, C0_size of int8_t data is 32, and C0_size of int4_t is 64. |
baseM, baseN, baseK |
int |
Shape sizes of matrices A, B, and C involved in a matrix multiplication instruction, in elements.
The shape size of matrices A, B, and C participating in a matrix multiplication must be fractal-aligned. For details, see the data format description in Mmad. |
depthA1, depthB1 |
int |
Full-load A2/B2 copies of slices of matrices A and B. depthA1 is an integer multiple of baseM x baseK, and depthB1 is an integer multiple of baseN x baseK. The value must be greater than 0. |
stepM, stepN, stepKa, stepKb, |
int |
stepM is a multiple of baseM of the left matrix in the bufferM direction buffered in A1. stepN is a multiple of baseN of the right matrix in the bufferN direction buffered in B1. stepKa is a multiple of baseK of the left matrix in the bufferKa direction buffered in A1. stepKb is a multiple of baseK of the right matrix in the bufferKb direction buffered in B1. The value must be greater than 0. |
isBias |
int |
Whether to enable Bias. The value 0 disables Bias and the value 1 enables Bias. |
transLength |
int |
max(A1Length, B1Length, C1Length, BiasLength). A1Length, B1Length, C1Length, and BiasLength indicate the sizes of UB space that needs to be temporarily occupied by the A, B, C, and Bias matrices during computation, respectively. |
iterateOrder |
int |
Each Iterate call computes a slice of matrix C of size [baseM, baseN]. After Iterate is complete, Matmul automatically offsets the matrix C position output by next Iterate. iterOrder indicates the automatic offset order. Values:
|
dbL0A, dbL0B, dbL0C |
int |
Whether to enable double buffering for MTE1. dbL0A: Whether to enable double buffering for the left matrix MTE1. dbL0B: Whether to enable double buffering for the right matrix MTE1. dbL0C: Whether to enable double buffering for MMAD. Values:
|
shareMode |
int |
This parameter is reserved and can be ignored. |
shareL1Size |
int |
This parameter is reserved and can be ignored. |
shareL0CSize |
int |
This parameter is reserved and can be ignored. |
shareUbSize |
int |
This parameter is reserved and can be ignored. |
batchM |
int |
This parameter is reserved and can be ignored. |
batchN |
int |
This parameter is reserved and can be ignored. |
singleBatchM |
int |
This parameter is reserved and can be ignored. |
singleBatchN |
int |
This parameter is reserved and can be ignored. |
Call the GetTiling API to obtain the TCubeTiling structure. For details, see Instructions for Use. To modify tiling, set the parameters by referring to the following TCubeTiling parameter constraints and recommended values for performance tuning.
- TCubeTiling constraintsA group of valid TCubeTiling parameters must meet all the constraints listed in Table 2. If the MatmulConfig template of a Matmul object is an MDL template, the constraints listed in Table 3 must also be met in addition to those listed in Table 2.
Table 2 TCubeTiling constraints Constraint
Description
usedCoreNum <= aiCoreCnt
The number of used cores is less than or equal to the maximum number of cores configured in the current AI processor.
baseM x baseK x sizeof(A_type) x dbL0A< l0a_size
The size of a base block of matrix A does not exceed the size of the l0a buffer.
baseN x baseK x sizeof(B_type) x dbL0B < l0b_size
The size of a base block of matrix B does not exceed the size of the l0b buffer.
baseM x baseN x sizeof(int32_t) x dbL0C < l0c_size
The size of a base block of matrix C does not exceed the size of the l0c buffer.
baseN x sizeof(Bias_type) < biasT_size
The size of a base block of Bias does not exceed the size of the BiasTable buffer.
stepM x stepKa x db = depthA1
db indicates whether double buffering is enabled for the left matrix MTE2, that is, whether double buffering is enabled for L1. The value can be 1 (double buffering disabled) or 2 (double buffering enabled).
The value of depthA1 is the same as that of stepM x stepKa x db.
stepN x stepKb x db = depthB1
db indicates whether double buffering is enabled for the right matrix MTE2, that is, whether double buffering is enabled for L1. The value can be 1 (double buffering disabled) or 2 (double buffering enabled).
The value of depthB1 is the same as that of stepN x stepKb x db.
baseM x baseK x depthA1 x sizeof(A_type) + baseN x baseK x depthB1 x sizeof(B_type) ≤ L1_size
Matrix A and matrix B meet the buffer size limit of the L1 Buffer block.
baseM x baseK, baseK x baseN and baseM x baseN are fractal-aligned in NZ format.
The base blocks of matrix A, matrix B, and matrix C must meet the following alignment constraints:
- baseM and baseN must be 16-element aligned, and baseK must be C0_size aligned.
Note: C0_size of half/bfloat16_t data is 16, C0_size of float data is 8, C0_size of int8_t data is 32, and C0_size of int4_t is 64
Table 3 Additional constraints for the MDL template Constraint
Description
When data in the Ka direction is not fully loaded, that is, Ka/baseK > stepKa, stepM = 1.
When data in the K direction is not fully loaded, data in the M direction must be moved block-wise.
When data in the Kb direction is not fully loaded, that is, Kb/baseK > stepKb, stepN = 1
When data in the K direction is not fully loaded, data in the N direction must be moved block-wise.
kaStepIter_ % kbStepIter_ = 0 or kbStepIter_ % kaStepIter_ = 0
kaStepIter_ = CeilDiv(tiling_->singleCoreK_, tiling_->baseK * tiling_->stepKa)
kbStepIter_ = CeilDiv(tiling_->singleCoreK_, tiling_->baseK * tiling_->stepKb)
For the K-direction cyclic movement in the MDL template, the numbers of iterations in the Ka and Kb directions must be multiples of each other.
kaStepIter_: number of cyclic movement iterations in the Ka direction
kbStepIter_: number of cyclic movement iterations in the Kb direction
- Recommended values for performance tuning
Based on the tiling tuning experience, the recommended values or example values for some TCubeTiling parameters are as follows:
- Recommended base block (baseM, baseN, baseK): (128, 256, 64)
- dbl0a/dbl0b = 2
- depthA1/(stepM x stepKa) = 2
- depthB1/(stepN x stepKb) = 2
- Set the stepKa and stepKb parameters first to ensure that data in the K direction is fully loaded, and then the M or N direction.