TCubeTiling Structure

The TCubeTiling structure contains parameters related to the Matmul tiling algorithm that are passed to the Matmul kernel for Matmul tiling, data movement, and computation. For details about the parameters of the TCubeTiling structure, see Table 1.

**Table 1** Description for the TCubeTiling structure
Parameter	Data Type	Description
usedCoreNum	int	Number of AI Processor cores used. Set this parameter based on your actual requirements. Value range: [1, Maximum number of AI Processor cores]. The relationship between this parameter and shape-related parameters is as follows: usedCoreNum = (M/singleCoreM) x (N/singlecoreN).
M, N, Ka, Kb	int	Shape size of the original input of matrices A, B, and C, in elements. M and Ka are the original input shapes of matrix A, and Kb and N are the original input shapes of matrix B. Size constraints If matrix A is in ND format and not transposed, the value range of Ka is [1, 65535] and the size of M is not limited; if matrix A is transposed, the value range of M is [1, 65535] and the size of Ka is not limited. If matrix B is in ND format and not transposed, the value range of N is [1, 65535] and the size of Kb is not limited; if matrix B is transposed, the value range of Kb is [1, 65535] and the size of M is not limited. Alignment constraints If the input of matrix A is in NZ format, M must be 16-element aligned and K must be aligned with C0_size. If the input of matrix B is in NZ format, K must be aligned with C0_size and N must be 16-element aligned. If matrices A and B are in ND format, there is no alignment constraint. Note: For input in NZ format, C0_size of half/bfloat16_t data is 16, C0_size of float data is 8, C0_size of int8_t data is 32, and C0_size of int4_t is 64.
singleCoreM, singleCoreN, singleCoreK	int	Shape sizes of matrices A, B, and C in a single core, in elements. singleCoreK = K (K is not tiled during multi-core processing); singleCoreM ≤ M; singleCoreN ≤ N Note: If the input of matrix A is in NZ format, singleCoreM must be 16-element aligned and singleCoreK must be aligned with C0_size × fractal_num. If the input of matrix B is in NZ format, singleCoreK must be aligned with C0_size × fractal_num and singleCoreN must be 16-element aligned. For input in NZ format, C0_size of half/bfloat16_t data is 16, C0_size of float data is 8, C0_size of int8_t data is 32, and C0_size of int4_t is 64.
baseM, baseN, baseK	int	Shape sizes of matrices A, B, and C involved in a matrix multiplication instruction, in elements. baseM x baseN x sizeof(l0c_dtype) ≤ L0C_size, where l0c_dtype is of the int32_t or float type. baseM x baseK x sizeof(Input_dtype) ≤ L0A_size baseK x baseN x sizeof(Input_dtype) ≤ L0B_size The shape size of matrices A, B, and C participating in a matrix multiplication must be fractal-aligned. For details, see the data format description in Mmad.
depthA1, depthB1	int	Full-load A2/B2 copies of slices of matrices A and B. depthA1 is an integer multiple of baseM x baseK, and depthB1 is an integer multiple of baseN x baseK. The value must be greater than 0.
stepM, stepN, stepKa, stepKb,	int	stepM is a multiple of baseM of the left matrix in the bufferM direction buffered in A1. stepN is a multiple of baseN of the right matrix in the bufferN direction buffered in B1. stepKa is a multiple of baseK of the left matrix in the bufferKa direction buffered in A1. stepKb is a multiple of baseK of the right matrix in the bufferKb direction buffered in B1. The value must be greater than 0.
isBias	int	Whether to enable Bias. The value 0 disables Bias and the value 1 enables Bias.
transLength	int	max(A1Length, B1Length, C1Length, BiasLength). A1Length, B1Length, C1Length, and BiasLength indicate the sizes of UB space that needs to be temporarily occupied by the A, B, C, and Bias matrices during computation, respectively.
iterateOrder	int	Each Iterate call computes a slice of matrix C of size [baseM, baseN]. After Iterate is complete, Matmul automatically offsets the matrix C position output by next Iterate. iterOrder indicates the automatic offset order. Values: 0: offsets along the M-axis direction first and then along the N-axis direction. 1: offsets along the N-axis direction first and then along the M-axis direction.
dbL0A, dbL0B, dbL0C	int	Whether to enable double buffering for MTE1. dbL0A: Whether to enable double buffering for the left matrix MTE1. dbL0B: Whether to enable double buffering for the right matrix MTE1. dbL0C: Whether to enable double buffering for MMAD. Values: 1: disables double buffering. 2: enables double buffering.
shareMode	int	This parameter is reserved and can be ignored.
shareL1Size	int	This parameter is reserved and can be ignored.
shareL0CSize	int	This parameter is reserved and can be ignored.
shareUbSize	int	This parameter is reserved and can be ignored.
batchM	int	This parameter is reserved and can be ignored.
batchN	int	This parameter is reserved and can be ignored.
singleBatchM	int	This parameter is reserved and can be ignored.
singleBatchN	int	This parameter is reserved and can be ignored.

Call the GetTiling API to obtain the TCubeTiling structure. For details, see Instructions for Use. To modify tiling, set the parameters by referring to the following TCubeTiling parameter constraints and recommended values for performance tuning.

TCubeTiling constraints

A group of valid TCubeTiling parameters must meet all the constraints listed in Table 2. If the MatmulConfig template of a Matmul object is an MDL template, the constraints listed in Table 3 must also be met in addition to those listed in Table 2.

**Table 2** TCubeTiling constraints
Constraint	Description
usedCoreNum <= aiCoreCnt	The number of used cores is less than or equal to the maximum number of cores configured in the current AI processor.
baseM x baseK x sizeof(A_type) x dbL0A< l0a_size	The size of a base block of matrix A does not exceed the size of the l0a buffer.
baseN x baseK x sizeof(B_type) x dbL0B < l0b_size	The size of a base block of matrix B does not exceed the size of the l0b buffer.
baseM x baseN x sizeof(int32_t) x dbL0C < l0c_size	The size of a base block of matrix C does not exceed the size of the l0c buffer.
baseN x sizeof(Bias_type) < biasT_size	The size of a base block of Bias does not exceed the size of the BiasTable buffer.
stepM x stepKa x db = depthA1 db indicates whether double buffering is enabled for the left matrix MTE2, that is, whether double buffering is enabled for L1. The value can be 1 (double buffering disabled) or 2 (double buffering enabled).	The value of depthA1 is the same as that of stepM x stepKa x db.
stepN x stepKb x db = depthB1 db indicates whether double buffering is enabled for the right matrix MTE2, that is, whether double buffering is enabled for L1. The value can be 1 (double buffering disabled) or 2 (double buffering enabled).	The value of depthB1 is the same as that of stepN x stepKb x db.
baseM x baseK x depthA1 x sizeof(A_type) + baseN x baseK x depthB1 x sizeof(B_type) ≤ L1_size	Matrix A and matrix B meet the buffer size limit of the L1 Buffer block.
baseM x baseK, baseK x baseN and baseM x baseN are fractal-aligned in NZ format.	The base blocks of matrix A, matrix B, and matrix C must meet the following alignment constraints: baseM and baseN must be 16-element aligned, and baseK must be C0_size aligned. Note: C0_size of half/bfloat16_t data is 16, C0_size of float data is 8, C0_size of int8_t data is 32, and C0_size of int4_t is 64

**Table 3** Additional constraints for the MDL template
Constraint	Description
When data in the Ka direction is not fully loaded, that is, Ka/baseK > stepKa, stepM = 1.	When data in the K direction is not fully loaded, data in the M direction must be moved block-wise.
When data in the Kb direction is not fully loaded, that is, Kb/baseK > stepKb, stepN = 1	When data in the K direction is not fully loaded, data in the N direction must be moved block-wise.
kaStepIter_ % kbStepIter_ = 0 or kbStepIter_ % kaStepIter_ = 0 kaStepIter_ = CeilDiv(tiling_->singleCoreK_, tiling_->baseK * tiling_->stepKa) kbStepIter_ = CeilDiv(tiling_->singleCoreK_, tiling_->baseK * tiling_->stepKb)	For the K-direction cyclic movement in the MDL template, the numbers of iterations in the Ka and Kb directions must be multiples of each other. kaStepIter_: number of cyclic movement iterations in the Ka direction kbStepIter_: number of cyclic movement iterations in the Kb direction

Recommended values for performance tuning
Based on the tiling tuning experience, the recommended values or example values for some TCubeTiling parameters are as follows:
- Recommended base block (baseM, baseN, baseK): (128, 256, 64)
- dbl0a/dbl0b = 2
- depthA1/(stepM x stepKa) = 2
- depthB1/(stepN x stepKb) = 2
- Set the stepKa and stepKb parameters first to ensure that data in the K direction is fully loaded, and then the M or N direction.

Parent topic: Matmul Tiling Class