TCubeTiling Structure

The TCubeTiling structure contains parameters related to the Matmul tiling algorithm that are passed to the Matmul kernel for Matmul tiling, data movement, and computation. For details about the parameters of the TCubeTiling structure, see Table 1.

Table 1 Description for the TCubeTiling structure

Parameter

Data Type

Description

usedCoreNum

int

Number of AI Processor cores used. Set this parameter based on your actual requirements. Value range: [1, Maximum number of AI Processor cores]. The relationship between this parameter and shape-related parameters is as follows: usedCoreNum = (M/singleCoreM) x (N/singlecoreN).

M, N, Ka, Kb

int

Shape size of the original input of matrices A, B, and C, in elements. M and Ka are the original input shapes of matrix A, and Kb and N are the original input shapes of matrix B.

  • Size constraints
    • If matrix A is in ND format and not transposed, the value range of Ka is [1, 65535] and the size of M is not limited; if matrix A is transposed, the value range of M is [1, 65535] and the size of Ka is not limited.
    • If matrix B is in ND format and not transposed, the value range of N is [1, 65535] and the size of Kb is not limited; if matrix B is transposed, the value range of Kb is [1, 65535] and the size of M is not limited.
  • Alignment constraints
    • If the input of matrix A is in NZ format, M must be 16-element aligned and K must be aligned with C0_size. If the input of matrix B is in NZ format, K must be aligned with C0_size and N must be 16-element aligned.
    • If matrices A and B are in ND format, there is no alignment constraint.

    Note: For input in NZ format, C0_size of half/bfloat16_t data is 16, C0_size of float data is 8, C0_size of int8_t data is 32, and C0_size of int4_t is 64.

singleCoreM, singleCoreN, singleCoreK

int

Shape sizes of matrices A, B, and C in a single core, in elements.

singleCoreK = K (K is not tiled during multi-core processing); singleCoreM ≤ M; singleCoreN ≤ N

Note: If the input of matrix A is in NZ format, singleCoreM must be 16-element aligned and singleCoreK must be aligned with C0_size × fractal_num. If the input of matrix B is in NZ format, singleCoreK must be aligned with C0_size × fractal_num and singleCoreN must be 16-element aligned.

For input in NZ format, C0_size of half/bfloat16_t data is 16, C0_size of float data is 8, C0_size of int8_t data is 32, and C0_size of int4_t is 64.

baseM, baseN, baseK

int

Shape sizes of matrices A, B, and C involved in a matrix multiplication instruction, in elements.

  • baseM x baseN x sizeof(l0c_dtype) ≤ L0C_size, where l0c_dtype is of the int32_t or float type.
  • baseM x baseK x sizeof(Input_dtype) ≤ L0A_size
  • baseK x baseN x sizeof(Input_dtype) ≤ L0B_size

The shape size of matrices A, B, and C participating in a matrix multiplication must be fractal-aligned. For details, see the data format description in Mmad.

depthA1, depthB1

int

Full-load A2/B2 copies of slices of matrices A and B. depthA1 is an integer multiple of baseM x baseK, and depthB1 is an integer multiple of baseN x baseK. The value must be greater than 0.

stepM, stepN, stepKa, stepKb,

int

stepM is a multiple of baseM of the left matrix in the bufferM direction buffered in A1.

stepN is a multiple of baseN of the right matrix in the bufferN direction buffered in B1.

stepKa is a multiple of baseK of the left matrix in the bufferKa direction buffered in A1.

stepKb is a multiple of baseK of the right matrix in the bufferKb direction buffered in B1.

The value must be greater than 0.

isBias

int

Whether to enable Bias. The value 0 disables Bias and the value 1 enables Bias.

transLength

int

max(A1Length, B1Length, C1Length, BiasLength). A1Length, B1Length, C1Length, and BiasLength indicate the sizes of UB space that needs to be temporarily occupied by the A, B, C, and Bias matrices during computation, respectively.

iterateOrder

int

Each Iterate call computes a slice of matrix C of size [baseM, baseN]. After Iterate is complete, Matmul automatically offsets the matrix C position output by next Iterate. iterOrder indicates the automatic offset order. Values:

  • 0: offsets along the M-axis direction first and then along the N-axis direction.
  • 1: offsets along the N-axis direction first and then along the M-axis direction.

dbL0A, dbL0B,

dbL0C

int

Whether to enable double buffering for MTE1.

dbL0A: Whether to enable double buffering for the left matrix MTE1. dbL0B: Whether to enable double buffering for the right matrix MTE1. dbL0C: Whether to enable double buffering for MMAD. Values:

  • 1: disables double buffering.
  • 2: enables double buffering.

shareMode

int

This parameter is reserved and can be ignored.

shareL1Size

int

This parameter is reserved and can be ignored.

shareL0CSize

int

This parameter is reserved and can be ignored.

shareUbSize

int

This parameter is reserved and can be ignored.

batchM

int

This parameter is reserved and can be ignored.

batchN

int

This parameter is reserved and can be ignored.

singleBatchM

int

This parameter is reserved and can be ignored.

singleBatchN

int

This parameter is reserved and can be ignored.

Call the GetTiling API to obtain the TCubeTiling structure. For details, see Instructions for Use. To modify tiling, set the parameters by referring to the following TCubeTiling parameter constraints and recommended values for performance tuning.

  • TCubeTiling constraints
    A group of valid TCubeTiling parameters must meet all the constraints listed in Table 2. If the MatmulConfig template of a Matmul object is an MDL template, the constraints listed in Table 3 must also be met in addition to those listed in Table 2.
    Table 2 TCubeTiling constraints

    Constraint

    Description

    usedCoreNum <= aiCoreCnt

    The number of used cores is less than or equal to the maximum number of cores configured in the current AI processor.

    baseM x baseK x sizeof(A_type) x dbL0A< l0a_size

    The size of a base block of matrix A does not exceed the size of the l0a buffer.

    baseN x baseK x sizeof(B_type) x dbL0B < l0b_size

    The size of a base block of matrix B does not exceed the size of the l0b buffer.

    baseM x baseN x sizeof(int32_t) x dbL0C < l0c_size

    The size of a base block of matrix C does not exceed the size of the l0c buffer.

    baseN x sizeof(Bias_type) < biasT_size

    The size of a base block of Bias does not exceed the size of the BiasTable buffer.

    stepM x stepKa x db = depthA1

    db indicates whether double buffering is enabled for the left matrix MTE2, that is, whether double buffering is enabled for L1. The value can be 1 (double buffering disabled) or 2 (double buffering enabled).

    The value of depthA1 is the same as that of stepM x stepKa x db.

    stepN x stepKb x db = depthB1

    db indicates whether double buffering is enabled for the right matrix MTE2, that is, whether double buffering is enabled for L1. The value can be 1 (double buffering disabled) or 2 (double buffering enabled).

    The value of depthB1 is the same as that of stepN x stepKb x db.

    baseM x baseK x depthA1 x sizeof(A_type) + baseN x baseK x depthB1 x sizeof(B_type) ≤ L1_size

    Matrix A and matrix B meet the buffer size limit of the L1 Buffer block.

    baseM x baseK, baseK x baseN and baseM x baseN are fractal-aligned in NZ format.

    The base blocks of matrix A, matrix B, and matrix C must meet the following alignment constraints:

    • baseM and baseN must be 16-element aligned, and baseK must be C0_size aligned.

    Note: C0_size of half/bfloat16_t data is 16, C0_size of float data is 8, C0_size of int8_t data is 32, and C0_size of int4_t is 64

    Table 3 Additional constraints for the MDL template

    Constraint

    Description

    When data in the Ka direction is not fully loaded, that is, Ka/baseK > stepKa, stepM = 1.

    When data in the K direction is not fully loaded, data in the M direction must be moved block-wise.

    When data in the Kb direction is not fully loaded, that is, Kb/baseK > stepKb, stepN = 1

    When data in the K direction is not fully loaded, data in the N direction must be moved block-wise.

    kaStepIter_ % kbStepIter_ = 0 or kbStepIter_ % kaStepIter_ = 0

    kaStepIter_ = CeilDiv(tiling_->singleCoreK_, tiling_->baseK * tiling_->stepKa)

    kbStepIter_ = CeilDiv(tiling_->singleCoreK_, tiling_->baseK * tiling_->stepKb)

    For the K-direction cyclic movement in the MDL template, the numbers of iterations in the Ka and Kb directions must be multiples of each other.

    kaStepIter_: number of cyclic movement iterations in the Ka direction

    kbStepIter_: number of cyclic movement iterations in the Kb direction

  • Recommended values for performance tuning

    Based on the tiling tuning experience, the recommended values or example values for some TCubeTiling parameters are as follows:

    • Recommended base block (baseM, baseN, baseK): (128, 256, 64)
    • dbl0a/dbl0b = 2
    • depthA1/(stepM x stepKa) = 2
    • depthB1/(stepN x stepKb) = 2
    • Set the stepKa and stepKb parameters first to ensure that data in the K direction is fully loaded, and then the M or N direction.