Mmad
Supported Products
Product |
Supported/Unsupported Prototype without bias input ) |
Supported/Unsupported Prototype with bias input ) |
|---|---|---|
√ |
√ |
|
√ |
√ |
|
√ |
√ |
|
√ |
x |
|
x |
x |
|
√ |
x |
Function Usage
Performs the matrix multiplication and addition (C += A * B) operation. The matrices A, B, and C are data in A2, B2, and CO1, respectively.
- The data formats of matrices A, B, and C are ZZ, ZN, and NZ, respectively.
In the following figure, each square represents a fractal matrix. The black line in the Z shape represents the data arrangement sequence, which starts in the upper left corner and ends in the lower right corner.
Matrix A: The row-major order is used in each fractal matrix and between fractal matrices. This is called ZZ format. The fractal shape is 16 x (32B/sizeof(AType)), and the size is 512 bytes.
Matrix B: The column-major order is used in each fractal matrix while the row-major order is used between fractal matrices. This is called NZ format. The fractal shape is (32B/sizeof (BType)) x 16, and the size is 512 bytes.
Matrix C: The row-major order is used in each fractal matrix, while the column-major order is used between fractal matrices. This is called ZN format. The fractal shape is 16 x 16, and the size is 256 elements.

The following is a simple example. It is assumed that the size of a fractal matrix is 2 x 2 (which does not comply with an actual situation and is merely used as an example), and sizes of the matrices A, B, and C are all 4 x 4.
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Arrangement order of matrix A: 0, 1, 4, 5, 2, 3, 6, 7, 8, 9, 12, 13, 10, 11, 14, 15.
Arrangement order of matrix B: 0, 4, 1, 5, 2, 6, 3, 7, 8, 12, 9, 13, 10, 14, 11, 15.
Arrangement order of matrix C: 0, 1, 4, 5, 8, 9, 12, 13, 2, 3, 6, 7, 10, 11, 14, 15.
Prototype
- Bias not passed in.
1 2
template <typename T, typename U, typename S> __aicore__ inline void Mmad(const LocalTensor<T>& dst, const LocalTensor<U>& fm, const LocalTensor<S>& filter, const MmadParams& mmadParams)
- Input bias
1 2
template <typename T, typename U, typename S, typename V> __aicore__ inline void Mmad(const LocalTensor<T>& dst, const LocalTensor<U>& fm, const LocalTensor<S>& filter, const LocalTensor<V>& bias, const MmadParams& mmadParams)
Parameters
Parameter |
Description |
|---|---|
T |
Data type of the destination operand. |
U |
Data type of the left matrix. |
S |
Data type of the right matrix. |
V |
Data type of the Bias matrix. |
Parameter |
Input/Output |
Meaning |
|---|---|---|
dst |
Output |
Destination operand; result matrix. Type: LocalTensor. Supported TPosition: CO1. The start address of LocalTensor must be 256-element-aligned. |
fm |
Input |
Source operand; left matrix a. Type: LocalTensor. Supported TPosition: A2. The start address of LocalTensor must be 512-byte aligned. |
filter |
Input |
Source operand; right matrix b. Type: LocalTensor. Supported TPosition: B2. The start address of LocalTensor must be 512-byte aligned. |
bias |
Input |
Source operand; bias matrix. Type: LocalTensor. Supported TPosition: C2 and CO1. The start address of LocalTensor must be 128-byte aligned. |
mmadParams |
Input |
Matrix multiplication parameters. For details about the definition of this parameter, see ${INSTALL_DIR}/include/ascendc/basic_api/interface/kernel_struct_mm.h. Replace ${INSTALL_DIR} with the actual path where the CANN software is installed. For details about the MmadParams parameters, see Table 3. |
Parameter |
Meaning |
|---|---|
m |
Height of the left matrix. Value range: m ∈ [0, 4095]. The default value is 0. |
n |
Width of the right matrix. Value range: n ∈ [0, 4095]. The default value is 0. |
k |
Width of the left matrix and height of the right matrix. Value range: k ∈ [0, 4095]. The default value is 0. |
cmatrixInitVal |
Whether the initial value of matrix C is 0. The default value is true.
|
cmatrixSource |
Whether the initial value of matrix C comes from C2 (hardware buffer for storing the bias) The default value is false.
For the For the For the For the For the Note: This parameter is invalid for the API with bias input. The system determines whether the initial value of matrix C is from CO1 or C2 based on the position of the bias input. |
isBias |
This parameter is deprecated. Do not use this parameter in new development. To add up the initial matrices, use the API with biasLocal. You can also use the cmatrixInitVal and cmatrixSource parameters to configure the initial value source of matrix C. You are advised to use the API with biasLocal, which is easier to configure than the cmatrixInitVal and cmatrixSource parameters. Whether the initial matrix needs to be added up. The default value is false. The options are as follows:
|
unitFlag |
unitFlag is a fine-grained parallelism of MMAD and Fixpipe instructions. After this function is enabled, the hardware moves out the computation result each time after a fractal is computed. This function is not applicable to the scenario where accumulation is performed in the L0C buffer. The options are as follows: 0: reserved value 2: unitFlag is enabled. After the hardware executes the instruction, the unitFlag function is not disabled. 3: unitFlag is enabled. After the hardware executes the instruction, the unitFlag function is disabled. When this function is enabled, the unitFlag of the MMAD instruction is set to 3 for the last fractal and to 2 for other fractals. This parameter is supported only by the following models: |
fmOffset |
Reserved. This parameter is reserved for future functions. You can use the default value for now. |
enSsparse |
|
enWinogradA |
|
enWinogradB |
|
kDirectionAlign |
Left matrix fm type |
Right matrix filter type |
Result matrix dst type |
|---|---|---|
uint8_t |
uint8_t |
uint32_t |
int8_t |
int8_t |
int32_t |
uint8_t |
int8_t |
int32_t |
half |
half |
half NOTE:
The mixed precision of this type cannot reach double 1‰, and later processor versions do not support this type conversion. You are advised to use half input and float output. The double one-thousandth means that the error between each actual data and the true value does not exceed one-thousandth, and the total number of data records whose error exceeds one-thousandth does not exceed one-thousandth of the total number of data records. |
half |
half |
float |
Left matrix fm type |
Right matrix filter type |
Result matrix dst type |
|---|---|---|
int8_t |
int8_t |
int32_t |
uint8_t |
int8_t |
int32_t |
uint8_t |
uint8_t |
int32_t |
half |
half |
half NOTE:
The mixed precision of this type cannot reach double 1‰, and later processor versions do not support this type conversion. You are advised to use half input and float output. 1‰ means that the error between each actual data and the true value does not exceed 1‰, and the total number of data records whose error exceeds 1‰ does not exceed 1‰ of the total number of data records. |
half |
half |
float |
int4b_t |
int4b_t |
int32_t |
Left matrix fm type |
Right matrix filter type |
Result matrix dst type |
|---|---|---|
int8_t |
int8_t |
int32_t |
half |
half |
float |
float |
float |
float |
bfloat16_t |
bfloat16_t |
float |
int4b_t |
int4b_t |
int32_t |
Left matrix fm type |
Right matrix filter type |
bias type |
Result matrix dst type |
|---|---|---|---|
int8_t |
int8_t |
int32_t |
int32_t |
half |
half |
float |
float |
float |
float |
float |
float |
bfloat16_t |
bfloat16_t |
float |
float |
Restrictions
- dst can only be placed in CO1, fm can only be placed in A2, and filter can only be placed in B2.
- If any of M, K, and N is 0, the instruction is not executed.
- When M = 1, the General Matrix-Vector Multiplication (GEMV) function is enabled by default. In this case, the Mmad API reads data from the L0A Buffer in ND format instead of ZZ format. Therefore, the left matrix needs to be directly arranged in ND format.
- For details about the operand address alignment requirements, see General Address Alignment Restrictions.
- The following uses an example to describe the arrangement of invalid and valid data.
The data type is half. When M = 30, K = 70, and N = 40, A2 contains two 16 x 16 matrices with 2 x 5 elements, B2 contains five 16 x 16 matrices with 5 x 3 elements, and CO1 contains two 16 x 16 matrices with 2 x 3 elements. In this scenario, M, K, and N are not multiples of 16. The matrix in the lower right corner of A2 actually has only 14 x 6 pieces of valid data, but also needs to occupy space of a 16 x 16 matrix. The invalid data is ignored during computation. In a 16 x 16 fractal data block, the arrangement of invalid and valid data is as follows.
