Basics

This section describes how to use high-level APIs to perform matrix multiplication. In cube programming, check the product models supported by the high-level APIs in API Reference.

Overview of Matrix Multiplication

Matmul formula: C = A × B + bias.

  • A and B are the source operands. A is a left matrix with shape [M, K], and B is a right matrix with shape [K, N].
  • C is the destination operand, which is a matrix that stores the matrix multiplication result. Its shape is [M, N].
  • bias indicates the matrix multiplication bias, whose shape is [1, N]. It is used to bias each row of the A × B result matrix.
Figure 1 Matmul matrix multiplication

Matrix Multiplication Data Flow

Before learning about the matrix multiplication data flow, you need to review several important concepts of logical storage locations.

  • Storage location of the move-in data: A1, used to store the entire matrix A, which is similar to the level-2 cache in the multi-level cache of the CPU.
  • Storage location of the move-in data: B1, used to store the entire matrix B, which is similar to the level-2 cache in the multi-level cache of the CPU.
  • Storage location of the move-in data: A2, used to store the tiled smaller matrix A, which is similar to the level-1 cache in the multi-level cache of the CPU.
  • Storage location of the move-in data: B2, used to store the tiled smaller matrix B, which is similar to the level-1 cache in the multi-level cache of the CPU.
  • Storage position of the result data: CO1, used to store the small-block result matrix C, which can be considered as Cube Out.
  • Storage location of the result data: CO2, used to store the entire result matrix C, which can be considered as Cube Out.
  • Storage location of the move-in data: VECCALC, used when temporary variables are required for the compute.

Matrix multiplication data flow refers to the flow direction of the input and output of the matrix multiplication between storage locations. The following figure shows the data flow of logical locations. (To simplify the description, bias is not listed.)

  • Data flow from an input location of matrix A to A2 is as follows (the input location may be GM or VECOUT): GM->A2, GM->A1->A2; VECOUT->A1->A2.
  • Data flow from an input location of matrix B to B2 is as follows (the input location may be GM or VECOUT): GM->B2, GM->B1->B2; VECOUT->B1->B2.
  • Complete the operation of A2 x B2 = CO1.
  • CO1 data is aggregated to CO2: CO1->CO2.
  • Data flow from CO2 to the output location (GM or VECIN): CO2->GM or CO2->VECIN.

Data Format

Two fractal formats ND and NZ are involved in completing Matmul.

  • ND: common format, N-dimensional tensor.
  • NZ: This special format is introduced to meet the high-performance computing requirements of the Cube Unit in the AI Core.

    ND-to-NZ conversion:

    (..., N, H, W) -> pad -> (..., N, H1 x H0, W1 x W0) -> reshape -> (..., N, H1, H0, W1, W0) -> transpose -> (..., N, W1, H1, H0, W0)

    As shown in the following figure, the (W, H) matrix is divided into (H1 x W1) fractals, which are arranged by column major, shaped as letter N. Each fractal has (H0 x W0) elements, which are arranged by row major, shaped as letter Z. Therefore, the data format is called NZ (large N small Z) format.

    The following is an example to help you understand the differences between ND and NZ data layouts. Assume that the fractal format is 2 x 2, as the 4 x 4 matrix shown in the following figure. In the case of ND and NZ layouts, the data formats in the memory are as follows:

    ND: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15

    NZ: 0, 1, 4, 5, 8, 9, 12, 13, 2, 3, 6, 7, 10, 11, 14, 15

Data Tiling

  • Multi-core tiling

    To implement multi-core parallelism, matrix data needs to be tiled and allocated to different cores for processing. The following figure shows the tiling policy.

    • Matrix A is tiled into multiple tiles of singleCoreM along the M axis. A single core processes singleCoreM x K data.
    • Matrix B is tiled into multiple tiles of singleCoreN along the N axis. A single core processes K x singleCoreN data.
    • For matrix C, matrix A with the size of singleCoreM x K is multiplied by matrix B with the size of K x singleCoreN to obtain matrix C with the size of singleCoreM x singleCoreN, the size of matrix C output on a single core.

    As shown in the following figure, eight cores participate in the compute. Matrix A is tiled into four blocks along the M axis, and matrix B is tiled into two blocks along the N axis. A single core processes only one block (for example, the green part in the figure is the data computed on core3). The matrix A block with the size of singleCoreM x K is multiplied by the matrix B block with the size of singleCoreN x K to obtain the matrix C block with the size of singleCoreM x singleCoreN.

  • Intra-core tiling

    In most cases, the local memory cannot completely store the operator input and output. Therefore, you need to pass some inputs for multiple times until the complete final result is obtained. This process is intra-core tiling. Tiling policies:

    • Matrix A can be tiled along the M axis into multiple tiles of baseM, or tiled along the K axis into multiple tiles of baseK.
    • Matrix B can be tiled along the N axis into multiple tiles of baseM, or tiled along the K axis into multiple tiles of baseK.
    • For matrix C, a block with the size of baseM x baseK in matrix A is multiplied by a block with the size of baseK x baseN in matrix B, and accumulation is performed to obtain a block with the size of baseM x baseN in a corresponding location in matrix C. For example, the blue matrix block 5 in the result matrix in the figure is obtained by using the following accumulation process: a x a + b x b + c x c + d x d + e x e + f x f.

    In addition to baseM, baseN, and baseK, there are some common tiling parameters. The meanings of these parameters are as follows:

    • iterateOrder: A matrix C tile with the size of [baseM, baseN] is computed in one iteration. After one iteration, Matmul automatically offsets the output position of Matrix C for the next iteration. iterateOrder indicates the automatic offset sequence.
      • The value 0 indicates that offsetting is performed along the M-axis direction first and then along the N-axis direction.
      • The value 1 indicates that offsetting is performed along the N-axis direction first and then along the M-axis direction.
    • depthA1 and depthB1: number of copies of A2 and B2 that are fully loaded by matrix tiles stored on A1 and B1. The sizes of A2 and B2 are baseM x baseK and baseN x baseK, respectively.
    • stepM and stepN: setpM is a multiple of baseM of the left matrix in the buffer M direction buffered in A1. stepN is a multiple of baseN of the right matrix in the buffer N direction buffered in B1.
    • stepKa and stepKb: setpKa is a multiple of baseK of the left matrix in the bufferK direction buffered in A1. setpKb is a multiple of baseK of the right matrix in the bufferK direction buffered in B1.