Basics

This section describes how to use high-level APIs to perform matrix multiplication. In cube programming, check the product models supported by the high-level APIs in API Reference.

Overview of Matrix Multiplication

Matmul formula: C = A × B + bias.

A and B are the source operands. A is a left matrix with shape [M, K], and B is a right matrix with shape [K, N].
C is the destination operand, which is a matrix that stores the matrix multiplication result. Its shape is [M, N].
bias indicates the matrix multiplication bias, whose shape is [1, N]. It is used to bias each row of the A × B result matrix.

Figure 1 Matmul matrix multiplication

Matrix Multiplication Data Flow

Before learning about the matrix multiplication data flow, you need to review several important concepts of logical storage locations.

Storage location of the copy-in data: A1, used to store the entire matrix A, which is similar to the level-2 cache in the multi-level cache of the CPU.
Storage location of the copy-in data: B1, used to store the entire matrix B, which is similar to the level-2 cache in the multi-level cache of the CPU.
Storage location of the copy-in data: C1, used to store the entire matmul bias matrix, which is similar to the level-2 cache in the multi-level cache of the CPU.
Storage location of the copy-in data: A2, used to store the split smaller matrix A, which is similar to the level-1 cache in the multi-level cache of the CPU.
Storage location of the copy-in data: B2, used to store the split smaller matrix B, which is similar to the level-1 cache in the multi-level cache of the CPU.
Storage location of the copy-in data: C2, used to store the split smaller matmul bias matrix, which is similar to the level-1 cache in the multi-level cache of the CPU.
Storage location of the result data: CO1, used to store the smaller result matrix C, which can be considered as Cube Out.
Storage location of the result data: CO2, used to store the entire result matrix C, which can be considered as Cube Out.
Storage location of the copy-in data: VECCALC, used when temporary variables are required for the computation.

Matrix multiplication data flow refers to the flow direction of the input and output of the matrix multiplication between storage locations. The following figure shows the data flow of logical locations. (To simplify the description, bias is not listed.)

Data flow from an input location of matrix A to A2 is as follows (the input location may be GM or VECOUT): GM->A2, GM->A1->A2; VECOUT->A1->A2.
Because A1 has a larger space than A2, data can be first moved from GM or VECOUT to A1 for buffering. Before Cube computation is performed on the data, the data is directly moved from A1 to A2. In this way, the waiting time before computation can be reduced when a large amount of data is moved, improving performance. The data flow of GM->A2 is used only when a small amount of data is moved.
Data flow from an input location of matrix B to B2 is as follows (the input location may be GM or VECOUT): GM->B2, GM->B1->B2; VECOUT->B1->B2.
Because B1 has a larger space than B2, data can be first moved from GM or VECOUT to B1 for buffering. Before Cube computation is performed on the data, the data is directly moved from B1 to B2. In this way, the waiting time before computation can be reduced when a large amount of data is moved, improving performance. The data flow of GM->B2 is used only when a small amount of data is moved.
Complete the operation of A2 × B2 = CO1.
CO1 data is aggregated to CO2: CO1->CO2.
Data flow from CO2 to the output location (GM or VECIN): CO2->GM or CO2->VECIN.

Data Formats

Two fractal formats, that is, ND and NZ, are mainly involved in completing Matmul.

ND: common format, N-dimensional tensor.
NZ: This special format is introduced to meet the high-performance computing requirements of the Cube Unit in the AI Core.
ND-to-NZ conversion:
```
(..., N, H, W )->pad->(..., N, H1*H0, W1*W0)->reshape->(..., N, H1, H0, W1, W0)->transpose->(..., N, W1, H1, H0, W0)
```
As shown in the following figure, the (W, H) matrix is divided into (H1 × W1) fractals, which are arranged by column major, shaped as letter N. Each fractal has (H0 × W0) elements, which are arranged by row major, shaped as letter Z. Therefore, the data format is called NZ (large N small Z) format.

The following is an example to help you understand the differences between ND and NZ data layouts. Assume that the fractal format is 2 × 2, as the 4 × 4 matrix shown in the following figure. In the case of ND (1, 4, 4) and NZ (1, 2, 2, 2, 2) layouts, the data formats in the memory are as follows:

ND: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15

NZ: 0, 1, 4, 5, 8, 9, 12, 13, 2, 3, 6, 7, 10, 11, 14, 15

For details about the example of converting the ND format to the NZ format, see Matmul operator sample for converting the ND format of the input matrix to the NZ format.

Data Tiling

Multi-core tiling
To implement multi-core parallelism, matrix data needs to be tiled and allocated across cores for processing. The tiling strategies are as follows:
- Matrix A is tiled into multiple tiles of singleCoreM along the M axis. A single core processes singleCoreM × K data.
- Matrix B is tiled into multiple tiles of singleCoreN along the N axis. A single core processes K × singleCoreN data.
- For matrix C, matrix A with the size of singleCoreM × K is multiplied by matrix B with the size of K × singleCoreN to obtain matrix C with the size of singleCoreM × singleCoreN, the size of matrix C output on a single core.
As shown in the following figure, eight cores participate in the compution. Matrix A is tiled into four blocks along the M axis, and matrix B is tiled into two blocks along the N axis. A single core processes one block only (for example, the green part in the figure is the data computed on core3). The matrix A block with the size of singleCoreM × K is multiplied by the matrix B block with the size of singleCoreN × K to obtain the matrix C block with the size of singleCoreM × singleCoreN.

In addition, the length of the K axis processed on a single core is SingleCoreK. In scenarios where the value of the K-axis is large, the K axis can be tiled into multiple tiles of singleCoreK. For details, see Matmul High-level API Enabling K-axis Tiling of Matrix Data in Multi-core Parallel Computation.
Intra-core tiling
In most cases, the local memory cannot completely store the operator input and output. Therefore, you need to pass some inputs for multiple times until the complete final result is obtained. This process is intra-core tiling. The tiling strategies are as follows:
- For matrix A, singleCoreM is tiled along the M axis into multiple tiles of baseM, the number of which corresponds to mIter in the figure, or tiled along the K axis into multiple tiles of baseK.
- For matrix B, singleCoreN is tiled along the N axis into multiple tiles of baseN, the number of which corresponds to nIter in the figure, or tiled along the K axis into multiple tiles of baseK.
- For matrix C, a block with the size of baseM × baseK in matrix A is multiplied by a block with the size of baseK × baseN in matrix B, and accumulation is performed to obtain a block with the size of baseM × baseN in a corresponding location in matrix C. For example, the green matrix block 5 in the result matrix in the figure is obtained by using the following accumulation process: a × a + b × b + c × c + d × d + e × e + f × f.
In addition to the basic block shapes baseM, baseN, and baseK, there are several common tiling parameters, which are described as follows:
- iterateOrder: A matrix C tile with the size of [baseM, baseN] is computed in one iteration. After one iteration, Matmul automatically offsets the output location of matrix C for the next iteration. iterateOrder indicates the automatic offset sequence.
  - The value 0 indicates that offsetting is performed along the M-axis direction first and then along the N-axis direction.
  - The value 1 indicates that offsetting is performed along the N-axis direction first and then along the M-axis direction.
  In the preceding figure, iterateOrder is set to 0.
- depthA1 and depthB1: number of copies of A2 and B2 that are fully loaded by matrix tiles stored on A1 and B1. The sizes of A2 and B2 are baseM × baseK and baseN × baseK, respectively. That is, depthA1 is the number of blocks with the size of baseM × baseK contained in the A1 matrix tiles, and depthB1 is the number of blocks with the size of baseN × baseK contained in the B1 matrix tiles.
- stepM and stepN: stepM is a multiple of baseM of the left matrix in the bufferM direction buffered in A1. stepN is a multiple of baseN of the right matrix in the bufferN direction buffered in B1.
- stepKa and stepKb: setpKa is a multiple of baseK of the left matrix in the bufferK direction buffered in A1. setpKb is a multiple of baseK of the right matrix in the bufferK direction buffered in B1.

Parent topic: Cube Programming (High-Level APIs)