matmul

Description

Multiplies tensor a by tensor b and outputs a result tensor.

For details about the data type restrictions, see Table 2.

Prototype

matmul(dst, a, b, m, k, n, init_l1out=True)

Parameters

Table 1 Parameter description

Parameter

Input/Output

Description

dst

Output

Start element of the destination operand. For details about the data type restrictions, see Table 2. The scope is the L1OUT Buffer.

A tensor in the format of [N1, M, N0], where N = N1 * N0

  • N1 = Ceiling(n/N0). Ceiling indicates rounding up.
  • M = Ceiling(m/M0) * M0, where M0 = 16.
  • N0 =16

a

Input

Source operand, tensor from the left matrix. For details about the data type restrictions, see Table 2. The scope is the L1 Buffer.

A tensor in the format of [K1, M, K0], where K = K1 * K0

  • K1 = Ceiling(k/K0). Ceiling indicates rounding up.
  • M = Ceiling(m/M0) * M0, where M0 = 16.
  • K0: If tensor a is of type float16, K0 = 16 (an immediate of type int). If tensor a is of type int8, K0 = 32 (an immediate of type int).

b

Input

Source operand, tensor from the right matrix. For details about the data type restrictions, see Table 2. The scope is the L1 Buffer.

A tensor in the format of [K1, N, K0], where K = K1 * K0

  • K1 = Ceiling(k/K0). Ceiling indicates rounding up.
  • K0: If tensor b is of type float16, K0 = 16 (an immediate of type int). If tensor b is of type int8, K0 = 32 (an immediate of type int).
  • N = N1* N0, where N1 = Ceiling(n/N0), N0 = 16.

m

Input

An immediate of type int specifying the valid height of the left matrix. Must be in the range of [1, 4096].

Note: The m argument does not need to be rounded up to a multiple of 16.

k

Input

An immediate of type int specifying the valid width of the left matrix and the valid height of the right matrix.

If tensor a is of type float16, the value range is [1, 16384].

If tensor a is of type int8, the value range is [1, 32768].

Note: The k argument does not need to be rounded up to a multiple of 16.

n

Input

An immediate of type int specifying the valid width of the right matrix. Must be in the range of [1, 4096].

Note: The n argument does not need to be rounded up to a multiple of 16.

init_l1out

Input

A bool specifying whether to initialize dst. Defaults to True.

  • True: The dst initial matrix will be overwritten by the compute result.
  • False: The dst initial matrix stores the previous matmul result and will be accumulated with the new matmul result.
Table 2 Data type combination of a, b, and dst

a.dtype

b.dtype

dst.dtype

int8

int8

int32

float16

float16

float32

Applicability

Atlas 200/300/500 Inference Product

Atlas Training Series Product

Restrictions

  • Single-step debugging takes a long time, and is therefore not recommended.
  • The tensor[immediate] or tensor[scalar] format indicates a 1-element tensor. To specify the computation start (with offset), use the tensor[immediate:] or tensor[scalar:] format.
  • For the Atlas 200/300/500 Inference Product, the start addresses of the source operands a and b of the instruction must be 512-byte aligned. For example, when tensor slices are input and the source operand is of type float16, tensor[256:] can be used. However, tensor[2:] does not meet the alignment requirement, and an unknown error may occur.
  • For the Atlas 200/300/500 Inference Product, the start addresses of the source operands a and b of the instruction must be 512-byte aligned. For example, when tensor slices are input and the source operand is of type float16, tensor[256:] can be used. However, tensor[2:] does not meet the alignment requirement, and an unknown error may occur.
  • The start address of the destination operand dst must be 1024-byte aligned. For example, when tensor slices are input and the destination operand is of type int32, tensor[256:] can be used. However, tensor[2:] does not meet the alignment requirement, and an unknown error may occur.
  • This instruction is mutually exclusive with Vector instructions.
  • The m, k, and n arguments do not need to be rounded up to multiples of 16 pixels. However, due to hardware restrictions, the shape of operands dst, a, and b must meet the following alignment requirements. The m and n arguments must be rounded up to multiples of 16 pixels, and the k argument must be rounded up to multiples of 16 or 32 pixels, depending on the operand data type.
  • When n is not a multiple of 16, the invalid data in the n dimension of dst needs to be processed by the user. When m is not a multiple of 16, the invalid data in the m dimension of dst can be deleted in the fixpipe instruction. The following figure shows the implementation diagram of the matmul API. The rightmost data block is the output result after dst is processed by the fixpipe API.

  • This instruction should be used in pair with the fixpipe instruction.
  • For details about the alignment requirements of the operand address offset, see General Restrictions.

Returns

None

Example

  • Example: a and b are of type int8, dst is of type int32, and ReLU is implemented using fixpipe.
    from tbe import tik
    tik_instance = tik.Tik()
    # Define the tensors.
    a_gm = tik_instance.Tensor("int8", [2, 32, 32], name='a_gm', scope=tik.scope_gm)
    b_gm = tik_instance.Tensor("int8", [2, 160, 32], name='b_gm', scope=tik.scope_gm)
    # For matmul, m = 30. The fixpipe instruction deletes invalid data from dst_l1out. Therefore, set the m dimension of dst_gm to 30.
    dst_gm = tik_instance.Tensor("int32", [10, 30, 16], name='dst_gm', scope=tik.scope_gm)
    a_l1 = tik_instance.Tensor("int8", [2, 32, 32], name='a_l1', scope=tik.scope_cbuf)
    b_l1 = tik_instance.Tensor("int8", [2, 160, 32], name='b_l1', scope=tik.scope_cbuf)
    dst_l1out = tik_instance.Tensor("int32", [10, 32, 16], name='dst_l1out', scope=tik.scope_cbuf_out)
    # Move data to the source operand.
    tik_instance.data_move(a_l1, a_gm, 0, 1, 64, 0, 0)
    tik_instance.data_move(b_l1, b_gm, 0, 1, 320, 0, 0)
    # Perform matmul. The m, k, and n arguments are 30, 64, and 160, respectively. The m dimension of dst_l1out is a multiple of 16 and rounded up to 32.
    tik_instance.matmul(dst_l1out, a_l1, b_l1, 30, 64, 160)
    # Move data to dst_gm, where burst_len = 30 * 16 * dst_l1out_dtype_size//32 = 60.
    tik_instance.fixpipe(dst_gm, dst_l1out, 10, 60, 0, 0, extend_params={"relu": True})
    tik_instance.BuildCCE(kernel_name="matmul", inputs=[a_gm, b_gm], outputs=[dst_gm])

    Result example:

    Input:
    a_l1 = [[[-1, -1, -1, ..., -1, -1, -1]
             ...
             [-1, -1, -1, ..., -1, -1, -1]]
            [[-1, -1, -1, ..., -1, -1, -1]
             ...
             [-1, -1, -1, ..., -1, -1, -1]]]
    b_l1 = [[[1, 1, 1, ..., 1, 1, 1]
             ...
             [1, 1, 1, ..., 1, 1, 1]]
            [[1, 1, 1, ..., 1, 1, 1]
             ...
             [1, 1, 1, ..., 1, 1, 1]]]
    Output:
    dst_gm = [[[0, 0, 0, ..., 0, 0, 0]
               ...
               [0, 0, 0, ..., 0, 0, 0]]
              ...
              [[0, 0, 0, ..., 0, 0, 0]
               ...
               [0, 0, 0, ..., 0, 0, 0]]]
  • Example: See the end-to-end call example of the matmul API. In the example, the inputs a and b support only static shapes, and the shapes are [16, 64] and [64, 1024], respectively. For the call example, see matmul Sample.