GetMMConfig

Applicability

Product	Supported
Atlas A3 training products / Atlas A3 inference products	√
Atlas A2 training products / Atlas A2 inference products	√
Atlas 200I/500 A2 inference products	x
Atlas inference product 's AI Core	√
Atlas inference product 's Vector Core	x
Atlas training products	x

Function

Allows to flexibly customize Matmul template parameters. You can set MatmulConfigMode, MatmulShapeParams, MatmulQuantParams, MatmulBatchParams, and MatmulFuncParams to obtain the custom MatmulConfig templates.

MatmulConfigMode specifies the MatmulConfig templates to be obtained and modified. For details about each template, see Table 1. You can modify the parameters of the MatmulConfig templates by setting one or more variable parameters, that is, MatmulShapeParams, MatmulQuantParams, MatmulBatchParams and MatmulFuncParams in any sequence. Compared with the GetNormalConfig and GetMDLConfig APIs used to obtain templates, this API provides a more flexible way to customize Matmul template parameters.

Prototype

      
           template <MatmulConfigMode configMode, typename... ArgTypes>
__aicore__ inline constexpr MatmulConfig GetMMConfig(ArgTypes&&... args)

Parameters

**Table 1** Template parameters
Parameter	Description
configMode	Obtained MatmulConfig template
ArgTypes	Variable template parameter

**Table 2** Parameters
Parameter	Input/Output	Description
args	Input	Variable parameter that can be configured by importing one or more of MatmulShapeParams, MatmulQuantParams, MatmulBatchParams and MatmulFuncParams as needed in any sequence.

**Table 3** MatmulConfigMode parameters
Parameter	Description
CONFIG_NORM	Sets MatmulConfig to the Norm template by default.
CONFIG_MDL	Sets MatmulConfig to the MDL template by default.
CONFIG_SPECIALMDL	Sets MatmulConfig to the SpecialMDL template by default.
CONFIG_IBSHARE	Sets MatmulConfig to the IBShare template by default.

**Table 4** MatmulShapeParams parameters
Parameter	Data Type	Description
singleCoreM	uint32_t	Shape size of a single core on the M axis, in elements.
singleCoreN	uint32_t	Shape size of a single core in the N axis, in elements.
singleCoreK	uint32_t	Shape size of a single core in the K axis, in elements.
basicM	uint32_t	Equivalent to the baseM parameter in the TCubeTiling structure. It indicates the length of the M axis of a base block during Matmul computation. The unit is element.
basicN	uint32_t	Equivalent to the baseN parameter in the TCubeTiling structure. It indicates the length of the N axis of a base block during Matmul computation. The unit is element.
basicK	uint32_t	Equivalent to the baseK parameter in the TCubeTiling structure. It indicates the length of the K axis of a base block during Matmul computation. The unit is element.

**Table 5** MatmulQuantParams parameters
Parameter	Data Type	Description
isPerTensor	bool	Whether quantization for matrix B is conducted per tensor or per channel in the scenario where matrix A's input type is half and matrix B's input type is int8_t. true: quantization conducted per tensor false: quantization conducted per channel
hasAntiQuantOffset	bool	Whether to use the offset coefficient when matrix B quantization is enabled in the scenario where matrix A's input type is half and matrix B's input type is int8_t.

**Table 6** MatmulBatchParams parameters
Parameter	Data Type	Description
isNBatch	bool	Whether to enable multi-batch input and output. This parameter is valid only for BatchMatmul. After this parameter is enabled, only the Norm template is supported, and IterateNBatch needs to be called to implement multi-batch input and output. Values: false (default): disables the multi-batch function. true: enables the multi-batch function.
batchMode	BatchMode	Relationship between the total amount of multi-batch data for input matrices A and B in a BatchMatmul operation and the size of L1 Buffer when the layout type is set to Normal in the BatchMatmul scenario. Values: BatchMode::BATCH_LESS_THAN_L1: Total amount of multi-batch data < Size of L1 Buffer BatchMode::BATCH_LARGE_THAN_L1: Total amount of multi-batch data > Size of L1 Buffer BatchMode::SINGLE_LARGE_THAN_L1: Total amount of single-batch data > Size of L1 Buffer
isBiasBatch	bool	Whether the bias size includes batch axes in the BatchMatmul scenario. Values: true (default): The bias size involves batch axes. The bias size is Batch × N. false: The bias size does not involve batch axes. The bias size is N. The bias is reused in the BatchMatmul computation. Note: In the BatchMode::SINGLE_LARGE_THAN_L1 scenario, this parameter can only be set to true. For Atlas A2 training products / Atlas A2 inference products , this parameter is supported. For Atlas A3 training products / Atlas A3 inference products , this parameter is supported. For Atlas inference product 's AI Core, this parameter cannot be set to false. For Atlas 200I/500 A2 inference products , this parameter cannot be set to false.
bmmOutMode	BatchOutMode	Reserved parameter

Table 7 MatmulFuncParams parameters

Parameter

Data Type

Description

intrinsicsLimit

bool

Whether to enable cyclic data move-in from the Global Memory to L1 Buffer when the inner axis (last axis) of the left or right matrix on a single core is greater than or equal to 65535 (number of elements). For example, for the left matrix A [M, K], if singleCoreK of the inner axis on a single core is greater than 65535 and this parameter is set to true, data is moved in cyclically in the API. Values:

false (default): When the inner axis of the left or right matrix on a single core is greater than or equal to 65535, data is not moved in cyclically.
true: When the inner axis of the left or right matrix on a single core is greater than or equal to 65535, data is moved in cyclically.

enVecND2NZ

bool

Whether to enable ND2NZ (converting data from ND format to NZ format) using vector. To enable this function, you need to set SetLocalWorkspace. Values:

false (default): disables ND2NZ using the vector.
true: enables ND2NZ using the vector.

For Atlas inference product 's AI Core, when the Unified Buffer space is sufficient (Unified Buffer space is greater than twice the value of transLength of TCubeTiling), you are advised to enable this parameter for better data movement.

enableDoubleCache

bool

Whether to cache two blocks in L1 Buffer after the IBShare template is enabled. Values:

false (default): caches one block in L1 Buffer.
true: caches two blocks in L1 Buffer.

Note: If this parameter is set to true, the base block size must be controlled to ensure that the cached data blocks do not exceed the L1 Buffer capacity.

enableL1CacheUB

bool

Whether to cache Unified Buffer computing blocks in L1 Buffer. It is recommended that this parameter be used in scenarios where the MTE3 and MTE2 pipelines are frequently used in serial mode. Values:

true: caches Unified Buffer computing blocks in L1 Buffer.
false: does not cache Unified Buffer computing blocks in L1 Buffer.

To cache Unified Buffer computing blocks in L1 Buffer, you must call SetMatmulConfigParams in the tiling implementation to set enableL1CacheUBIn to true.

For Atlas A3 training products / Atlas A3 inference products , this parameter is not supported.

For Atlas A2 training products / Atlas A2 inference products , this parameter is not supported.

For Atlas inference product 's AI Core, this parameter is supported.

For Atlas 200I/500 A2 inference products , this parameter is not supported.

doMTE2Preload

uint32_t

Whether to enable the preloading function in the M/N direction when MTE2 pipeline gap and the M/N value are large. After this function is enabled, the MTE2 pipeline gap is reduced and the performance is improved. The preloading function is valid only for the MDL template. Values:

0 (default): disables the function.
1: enables preloading in the M direction.
2: enables preloading in the N direction.

Note: When preloading in the M/N direction is enabled, ensure that the data is fully loaded in the K direction and DoubleBuffer is enabled in the M/N direction. The condition for full load in the M direction is that singleCoreK/baseK is less than or equal to stepKa, and that in the N direction is that singleCoreK/baseK is less than or equal to stepKb.

For details about how to use this parameter, see Matmul operator sample for preloading in the M and N directions.

iterateOrder

IterateOrder

Iteration sequence for Matmul to perform cube computation. The meaning of this parameter is the same as that of iterateOrder in Table 1. This parameter is valid only when ScheduleType is set to ScheduleType::OUTER_PRODUCT. Values:

          
               enum class IterateOrder {
    ORDER_M = 0,   // Offset to the M-axis direction and then to the N-axis direction.
    ORDER_N,       // Offset to the N-axis direction and then to the M-axis direction.
    UNDEF,         // Invalid currently.
};

Note: When the Norm template (Matmul scenario) and the MDL template are used, if IterateOrder is set to ORDER_M, the value of stepN in the TCubeTiling structure must be greater than 1. If IterateOrder is set to ORDER_N, the value of stepM in the TCubeTiling structure must be greater than 1.

For details about how to use this parameter, see Matmul operator sample for pipeline parallelism in the M and N directions.

For Atlas A3 training products / Atlas A3 inference products , this parameter is supported.

For Atlas A2 training products / Atlas A2 inference products , this parameter is supported.

For Atlas inference product 's AI Core, this parameter is not supported.

For Atlas 200I/500 A2 inference products , this parameter is not supported.

scheduleType

ScheduleType

Matmul data movement mode. Values:

ScheduleType::INNER_PRODUCT (default): performs MTE1 cyclic movement in the K direction.
ScheduleType::OUTER_PRODUCT: performs MTE1 cyclic movement in the M or N direction. After being enabled, this parameter must be used together with IterateOrder.
Its configuration takes effect only in the Norm template (BatchMatmul and Matmul scenarios) and the MDL template.
- If the value of IterateOrder is set to ORDER_M, cyclic movement is performed in the N direction, that is, data in matrix B is moved in parallel using MTE1. (The performance may be improved when the value of singleCoreN is greater than that of baseN.)
- If the value of IterateOrder is set to ORDER_N, cyclic movement is performed in the M direction, that is, data in matrix A is moved in parallel using MTE1. (The performance may be improved when the value of singleCoreM is greater than that of baseM.)
- The cyclic movement in the M direction and N direction cannot be enabled at the same time.

Note:

In the Norm template (BatchMatmul scenario) or the MDL template, when singleCoreK is greater than baseK, ScheduleType::OUTER_PRODUCT cannot be enabled and the default mode must be used.
In the Matmul scenario of the Norm or MDL template, ScheduleType::OUTER_PRODUCT can be configured only in CUBE_ONLY mode (with only Cube computation).
This parameter can be set to ScheduleType::OUTER_PRODUCT only when the MDL template calls IterateAll for computation.
This parameter can be set to ScheduleType::OUTER_PRODUCT only when matrix C is output to GM.

For Atlas A3 training products / Atlas A3 inference products , this parameter is supported.

For Atlas A2 training products / Atlas A2 inference products , this parameter is supported.

For Atlas inference product 's AI Core, this parameter is not supported.

For Atlas 200I/500 A2 inference products , this parameter is not supported.

enableReuse

bool

SetSelfDefineData function directly transfers the computation data. If the SetSelfDefineData function is not called to set dataPtr, this parameter can only be set to the default value true. Values:

true: passes computation data. Only a single value is supported.
false: passes data address information stored on GM.

enableUBReuse

bool

Whether to enable Unified Buffer reuse. When the Unified Buffer has sufficient capacity (its size is greater than four times the value of transLength of TCubeTiling), enabling this parameter divides the Unified Buffer into two non-overlapping regions. These two regions store the data for two consecutive Matmul iterations. With Unified Buffer reuse enabled, the data of the next iteration can be loaded into the second region. It no longer needs to wait for the previous iteration's Unified Buffer region to be released. This optimizes pipeline and improves overall performance. Values:

true: enables Unified Buffer reuse.
false: disables Unified Buffer reuse.

For Atlas A3 training products / Atlas A3 inference products , this parameter is not supported.

For Atlas A2 training products / Atlas A2 inference products , this parameter is not supported.

For Atlas inference product 's AI Core, this parameter is supported.

For Atlas 200I/500 A2 inference products , this parameter is not supported.

isPartialOutput

bool

Whether to enable the PartialOutput function. This parameter controls how Matmul computes and outputs base blocks along the K axis. In other words, this parameter determines whether to accumulate the partial results along the K axis when Matmul runs one Iterate step. Values:

true: enables the PartialOutput function. The K-axis partial results computed in a single Iterate computation are not accumulated. Each Matmul iteration outputs a local matrix fragment of size baseM × baseN that corresponds to the current baseK slice.
false: disables the PartialOutput function. The K-axis partial results computed in a single Iterate computation are accumulated. Each Matmul iteration outputs a matrix fragment of size baseM × baseN that corresponds to the current SingleCoreK slice.

For Atlas A3 training products / Atlas A3 inference products , this parameter is supported.

For Atlas A2 training products / Atlas A2 inference products , this parameter is supported.

For Atlas inference product 's AI Core, this parameter is not supported.

For Atlas 200I/500 A2 inference products , this parameter is not supported.

isA2B2Shared

bool

Whether to enable the global management of A2 and B2, that is, whether all Matmul objects share the double buffering mechanism of A2 and B2. As this is a global configuration, the parameter values for all Matmul objects must be the same. When it is enabled, the base block sizes of matrix A and matrix B cannot exceed 32 KB.

Values:

true: enabled
false (default): disabled

For Atlas A3 training products / Atlas A3 inference products , this parameter is supported.

For Atlas A2 training products / Atlas A2 inference products , this parameter is supported.

For Atlas inference product 's AI Core, this parameter is not supported.

When this parameter is set to true, you are advised to set enUnitFlag to true so that the data transfer and computation pipeline can be performed in parallel mode, improving performance. For details about the example of using this parameter, see global management sample of Matmul A2 and B2.

isEnableChannelSplit

bool

Whether to enable the channel_split function. In normal cases, the fractal size of the matrix C in CubeFormat::NZ format computed by Matmul is 16 × 16. Assume that the number of fractals is x. The channel_split function is used to obtain the fractal size of matrix C as 16 × 8, and the number of fractals changes to 2x. Note that this parameter can be enabled only when the format of matrix C computed by Matmul is CubeFormat::NZ, the type is float, and the output is to the global memory. Values:

false (default): The channel_split function is disabled, and the output fractal size is 16 × 16.
true: The channel_split function is enabled, and the output fractal size is 16 × 8.
For Atlas A3 training products / Atlas A3 inference products , this parameter is supported.

For Atlas A2 training products / Atlas A2 inference products , this parameter is supported.

For Atlas inference product 's AI Core, this parameter is not supported.

For Atlas 200I/500 A2 inference products , this parameter is not supported.

enableKdimReorderLoad

bool

Whether to enable staggered loading of data on the K axis. During Matmul computation based on the same tiling parameters, if the left or right matrices of multiple cores are the same and stored in the global memory, multiple cores may access the same address at the same time to load matrix data, causing access conflicts and affecting performance. After this parameter is enabled, during multi-core Matmul computation, the multiple cores try to access different global memory addresses at the same time to reduce the probability of address access conflicts and improve performance. This parameter is supported only for the MDL template. You are advised to enable this parameter when the K axis is large and the left and right matrices are not fully loaded. For details, see operator sample for staggered data loading along the K axis. Values:

false (default): disables the staggered data loading function on the K axis.
true: enables the staggered data loading function on the K axis.

For Atlas A3 training products / Atlas A3 inference products , this parameter is supported.

For Atlas A2 training products / Atlas A2 inference products , this parameter is supported.

For Atlas inference product 's AI Core, this parameter is not supported.

For Atlas 200I/500 A2 inference products , this parameter is not supported.

Returns

MatmulConfig structure

Restrictions

None

Example

      
           // Obtain the MatmulConfig template and set it as the Norm template.
constexpr static MatmulConfigMode configMode = MatmulConfigMode::CONFIG_NORM;
// singleCoreM, singleCoreN, singleCoreK, basicM, basicN, and basicK
constexpr static MatmulShapeParams shapeParams = {128, 128, 128, 64, 64, 64};
// Conduct quantization for matrix B per channel without using the offset coefficient.
constexpr static MatmulQuantParams quantParams = {false, false};
// Disable the multi-batch parameters.
constexpr static MatmulBatchParams batchParams{false};
// Disable the verification for the address offset for chip instruction movement, and enable ND2NZ using vector.
constexpr static MatmulFuncParams funcParams{false, true};
constexpr static MatmulConfig mmConfig = GetMMConfig<configMode>(shapeParams, quantParams, batchParams, funcParams);

Parent topic: Matmul Kernel APIs