MatmulConfig
Configures Matmul template information and related parameters. If the MatmulConfig template parameter is not set, the Norm template is enabled by default. For details, see Table 2. MatmulConfig can be defined in the following ways:
- It can be set to one of the default values. Possible values are CFG_NORM, CFG_MDL, CFG_IBSHARE_NORM, and MM_CFG_BB, which correspond to the default Norm, MDL, IBShare, and BasicBlock templates, respectively. For details about each template, see Table 1.
- It can be customized using various APIs for obtaining templates to obtain custom templates. The APIs for obtaining templates include GetNormalConfig, GetMDLConfig, GetSpecialMDLConfig, GetIBShareNormConfig, GetBasicConfig, or GetSpecialBasicConfig.
- In addition, it can be split into level-2 sub-configs (MatmulShapeParams, MatmulQuantParams, MatmulBatchParams, and MatmulFuncParams). You can use the GetMMConfig API to set the required level-2 sub-configs and MatmulConfigMode to flexibly obtain the custom template parameter configuration MatmulConfig.
Template |
Implementation |
Advantages |
Recommended Usage |
|---|---|---|---|
Norm |
L1 can cache multiple base blocks. MTE2 moves base blocks from GM to L1 for multiple times, with one base block moved each time. The moved base blocks are not cleared. For example, if depthA1 is set to 6, six base blocks of matrix A are moved to L1, one base block is moved at a time, and MTE2 moves blocks for six times. |
The MTE1 pipeline can be started in advance, because the subsequent computation of MTE1 can be performed after one base block is moved. |
The Norm template is enabled by default. |
MDL, SpecialMDL |
L1 can cache multiple base blocks. The data movement of MTE2 from GM to L1 is a one-time "large-packet" movement. For example, if depthA1 is set to 6, six base blocks of matrix A are moved to L1 at a time, and MTE2 moves blocks once. |
In common large-shape scenarios, this can reduce MTE2 cyclic movement to improve performance. |
Large-shape scenario. |
IBShare |
In the MIX scenario, when the GM addresses of matrix A or matrix B of multiple AIVs are the same, L1 Buffer is shared to reduce MTE2 movement. |
This reduces MTE2 movement and improves performance. |
The GM addresses of matrix A or matrix B of multiple AIVs are the same in the MIX scenario. Note: To use the IBShare template, the matrix A or matrix B reused by multiple AIVs must be fully loaded on L1 Buffer. |
BasicBlock |
If there is no tail block and the base block size is fixed, the GetBasicConfig API can be used to configure the size of input base blocks, and fix the size of the matrix moved by MTE1 each time and the size of the matrix computed by MMAD each time to reduce the parameter computation workload. |
This reduces the parameter computation overhead during MTE1 matrix movement and MMAD matrix computation. |
There is no tail block, and the base block size (baseM, baseN) is fixed. |
Parameter |
Description |
Supported Templates: Norm, MDL, SpecialMDL, IBShare, and BasicBlock |
||
|---|---|---|---|---|
doNorm |
Whether to enable the Norm template. Values:
If no value is specified, the Norm template is enabled by default. |
Norm |
||
doBasicBlock |
Whether to enable the BasicBlock template. Values:
When GetBasicConfig is called to obtain the BasicBlock template, this parameter is set to true. Note:
|
BasicBlock |
||
doMultiDataLoad |
Whether to enable the MDL template. Values:
|
MDL |
||
doSpecialMDL |
Whether to enable the SpecialMDL template. Values:
It is also an MDL template in essence. When the MDL template is not fully loaded in the Matmul K direction (singleCoreK/baseK > stepKb), stepN can only be set to 1. If true, stepN can be set to 2. |
SpecialMDL |
||
doIBShareNorm |
Whether to enable the IBShare template. Values:
IBShare is used to reuse the same matrix A or B data on L1. After IBShare is enabled, repeated data movement to L1 can be avoided for data reuse. |
IBShare |
||
intrinsicsCheck |
Whether to enable cyclic data move-in when the inner axis (last axis) of the left or right matrix on a single core is greater than or equal to 65535. For example, for the left matrix A [M, K], if singleCoreK of the inner axis on a single core is greater than 65535 and this parameter is set to true, data is moved in cyclically in the API. Values:
|
All templates |
||
isNBatch |
Whether to enable multi-batch input and output. This parameter is valid only for BatchMatmul. Values:
|
All templates |
||
enVecND2NZ |
Whether to enable ND2NZ (converting data from ND format to NZ format) using the vector. To enable this function, you need to set SetLocalWorkspace. Values:
|
All templates |
||
enableInit |
Whether to enable the Init function. If the Init function is disabled, the constant propagation effect can be improved and the performance can be optimized. By default, it is enabled. |
All templates |
||
batchMode |
Relationship between the total amount of multi-batch data for input matrices A and B in a BatchMatmul operation and the size of L1 Buffer when the layout mode is set to NORMAL. Values:
|
Norm, IBShare |
||
enUnitFlag |
Whether to enable the unitflag function to allow parallel execution of computation and data movement for performance improvement. By default, the function is enabled when the Norm and IBShare templates are used and disabled when the MDL template is used. Values:
|
MDL, Norm, IBShare |
||
isPerTensor |
Whether quantization for matrix B is conducted per tensor (true) or per channel (false) in the scenario where matrix A's input type is half and matrix B's input type is int8. |
MDL, SpecialMDL |
||
hasAntiQuantOffset |
Whether to use the offset coefficient when matrix B quantization is enabled in the scenario where matrix A's input type is half and matrix B's input type is int8. |
MDL, SpecialMDL |
||
doMTE2Preload |
Whether to enable the preloading function in the M/N direction when MTE2 pipeline gap and the M/N value are large. After this function is enabled, the MTE2 pipeline gap is reduced and the performance is improved. The preloading function is valid only for the MDL template. Values:
Note: When preloading in the M/N direction is enabled, ensure that the data is fully loaded in the K direction and double buffering is enabled in the M/N direction. |
MDL, SpecialMDL |
||
enableReuse |
Whether dataPtr in the callback function set by SetSelfDefineData directly passes computation data. The parameter values are as follows:
|
Norm, MDL |
||
enableUBReuse |
Whether to enable Unified Buffer reuse. Values:
|
MDL |
||
enableL1CacheUB |
Whether to cache Unified Buffer computing blocks in L1 Buffer. Values:
To cache Unified Buffer computing blocks in L1 Buffer, you must call SetMatmulConfigParams in the tiling implementation to configure related information. |
MDL |
||
enableDoubleCache |
Whether to cache two blocks in L1 Buffer after the IBShare template is enabled. Note that the size of the base block must be controlled to prevent the size of the two blocks from exceeding the L1 Buffer size limit. The values are as follows:
|
IBShare |
||
IterateOrder |
Iteration order for Matmul to perform matrix computation. The meaning of this parameter is the same as that of iterateOrder in Table 1. This parameter is valid only when ScheduleType is set to ScheduleType::OUTER_PRODUCT or 1. Values:
Note: When the Norm template (Matmul scenario) and the MDL template are used, if IterateOrder is set to ORDER_M, the value of stepN in the TCubeTiling structure must be greater than 1; if IterateOrder is set to ORDER_N, the value of stepM in the TCubeTiling structure must be greater than 1. |
Norm, MDL |
||
ScheduleType |
Matmul data movement mode. Values:
Note:
|
Norm, MDL |
||
enableStaticPadZeros |
Whether to automatically pad zeros based on the sizes of singleM, singleN, singleK, baseM, baseN, and baseK when the static tiling parameters are used and the left and right matrices are moved to L1 Buffer. For details about the static tiling parameters, see GetMatmulApiTiling. Only the ND2NZ format of the GM input supports padding zeros. In other scenarios, you need to pad zeros manually. Values:
|
Norm |
||
isBiasBatch |
Whether the bias size involves batch axes in the BatchMatmul scenario. Values:
|
Norm |
||
basicM |
Equivalent to baseM. Length of the M axis of a base block during Matmul computation. The unit is element. |
BasicBlock |
||
basicN |
Equivalent to baseN. Length of the N axis of a base block during Matmul computation. The unit is element. |
BasicBlock |
||
basicK |
Equivalent to baseK. Length of the K axis of a base block during Matmul computation. The unit is element. |
BasicBlock |
||
enableSetBias |
Whether to compute bias. This parameter can be used to optimize performance. Values:
|
MDL |
||
enableEnd |
Whether to call the End function during Matmul computation. This parameter can be used to optimize performance. Values:
|
All templates |
||
enableGetTensorC |
Whether to call the GetTensorC function during Matmul computation. This parameter can be used to optimize performance. Values:
|
All templates |
||
enableSetOrgShape |
Whether to call the SetOrgShape function during Matmul computation. This parameter can be used to optimize performance. Values:
|
All templates |
||
enableSetTail |
Whether to call the SetTail function during Matmul computation. This parameter can be used to optimize performance. Values:
|
All templates |
||
enableQuantVector |
Whether to call the SetQuantVector and SetQuantScalar functions during Matmul computation. This parameter can be used to optimize performance. Values:
|
All templates |
||
enableSetDefineData |
Whether to enable the setting of information such as the computation data required by the callback function or the data address stored on GM when MatmulCallBack (custom callback function) is enabled. Values:
|
MDL |
||
iterateMode |
Iteration mode, used to optimize the Matmul computation overhead in the separated architecture. Specifically, it is used for the optimization through Iterate APIs (including Iterate, IterateAll, IterateBatch, and IterateNBatch). When a mode is enabled, only one Iterate API corresponding to the mode is called during the Matmul computation, and the code related to other Iterate APIs is deleted during compilation to optimize performance. This parameter is of the IterateMode type. Values:
|
All templates |
||
intraBlockPartSum |
Whether to enable the accumulation of a single compute result (matrix slices with the size of baseM x baseN) of two AIV cores on L0C Buffer in the case of fused vector and cube computation on the separated architecture. Values:
|
Norm |
||
isPartialOutput |
Whether to enable the PartialOutput function, that is, whether to control the base block computation mode of the Matmul sequential output in the K direction. In other words, this parameter determines whether to reduce the K axis of an Iterate computation of Matmul. Values:
|
MDL |
||
doSpecialBasicBlock |
Whether to enable the SpecialBasicBlock template. Values:
It is also a BasicBlock template, but eliminates scalar computation of overhead. |
Reserved |
||
singleCoreM |
Shape size of a single core on the M axis, in elements. |
Reserved |
||
singleCoreN |
Shape size of a single core on the N axis, in elements. |
Reserved |
||
singleCoreK |
Shape size of a single core on the K axis, in elements. |
Reserved |
||
stepM |
A multiple of baseM of the left matrix in the bufferM direction buffered in A1. |
Reserved |
||
stepN |
A multiple of baseN of the right matrix in the bufferN direction buffered in B1. |
Reserved |
||
baseMN |
Size of baseM × baseN. |
Reserved |
||
singleCoreMN |
Size of singleCoreM × singleCoreN. |
Reserved |