matmul
Description
Multiplies tensor a by tensor b and outputs a result tensor.
For details about the data type restrictions, see Table 2.
Prototype
matmul(dst, a, b, m, k, n, init_l1out=True)
Parameters
|
Parameter |
Input/Output |
Description |
|---|---|---|
|
dst |
Output |
Start element of the destination operand. For details about the data type restrictions, see Table 2. The scope is the L1OUT Buffer. A tensor in the format of [N1, M, N0], where N = N1 * N0
|
|
a |
Input |
Source operand, tensor from the left matrix. For details about the data type restrictions, see Table 2. The scope is the L1 Buffer. A tensor in the format of [K1, M, K0], where K = K1 * K0
|
|
b |
Input |
Source operand, tensor from the right matrix. For details about the data type restrictions, see Table 2. The scope is the L1 Buffer. A tensor in the format of [K1, N, K0], where K = K1 * K0
|
|
m |
Input |
An immediate of type int specifying the valid height of the left matrix. Must be in the range of [1, 4096]. Note: The m argument does not need to be rounded up to a multiple of 16. |
|
k |
Input |
An immediate of type int specifying the valid width of the left matrix and the valid height of the right matrix. If tensor a is of type float16, the value range is [1, 16384]. If tensor a is of type int8, the value range is [1, 32768]. Note: The k argument does not need to be rounded up to a multiple of 16. |
|
n |
Input |
An immediate of type int specifying the valid width of the right matrix. Must be in the range of [1, 4096]. Note: The n argument does not need to be rounded up to a multiple of 16. |
|
init_l1out |
Input |
A bool specifying whether to initialize dst. Defaults to True.
|
Applicability
Restrictions
- Single-step debugging takes a long time, and is therefore not recommended.
- The tensor[immediate] or tensor[scalar] format indicates a 1-element tensor. To specify the computation start (with offset), use the tensor[immediate:] or tensor[scalar:] format.
- For the Atlas 200/300/500 Inference Product, the start addresses of the source operands a and b of the instruction must be 512-byte aligned. For example, when tensor slices are input and the source operand is of type float16, tensor[256:] can be used. However, tensor[2:] does not meet the alignment requirement, and an unknown error may occur.
- For the Atlas 200/300/500 Inference Product, the start addresses of the source operands a and b of the instruction must be 512-byte aligned. For example, when tensor slices are input and the source operand is of type float16, tensor[256:] can be used. However, tensor[2:] does not meet the alignment requirement, and an unknown error may occur.
- The start address of the destination operand dst must be 1024-byte aligned. For example, when tensor slices are input and the destination operand is of type int32, tensor[256:] can be used. However, tensor[2:] does not meet the alignment requirement, and an unknown error may occur.
- This instruction is mutually exclusive with Vector instructions.
- The m, k, and n arguments do not need to be rounded up to multiples of 16 pixels. However, due to hardware restrictions, the shape of operands dst, a, and b must meet the following alignment requirements. The m and n arguments must be rounded up to multiples of 16 pixels, and the k argument must be rounded up to multiples of 16 or 32 pixels, depending on the operand data type.
- When n is not a multiple of 16, the invalid data in the n dimension of dst needs to be processed by the user. When m is not a multiple of 16, the invalid data in the m dimension of dst can be deleted in the fixpipe instruction. The following figure shows the implementation diagram of the matmul API. The rightmost data block is the output result after dst is processed by the fixpipe API.

- This instruction should be used in pair with the fixpipe instruction.
- For details about the alignment requirements of the operand address offset, see General Restrictions.
Returns
None
Example
- Example: a and b are of type int8, dst is of type int32, and ReLU is implemented using fixpipe.
from tbe import tik tik_instance = tik.Tik() # Define the tensors. a_gm = tik_instance.Tensor("int8", [2, 32, 32], name='a_gm', scope=tik.scope_gm) b_gm = tik_instance.Tensor("int8", [2, 160, 32], name='b_gm', scope=tik.scope_gm) # For matmul, m = 30. The fixpipe instruction deletes invalid data from dst_l1out. Therefore, set the m dimension of dst_gm to 30. dst_gm = tik_instance.Tensor("int32", [10, 30, 16], name='dst_gm', scope=tik.scope_gm) a_l1 = tik_instance.Tensor("int8", [2, 32, 32], name='a_l1', scope=tik.scope_cbuf) b_l1 = tik_instance.Tensor("int8", [2, 160, 32], name='b_l1', scope=tik.scope_cbuf) dst_l1out = tik_instance.Tensor("int32", [10, 32, 16], name='dst_l1out', scope=tik.scope_cbuf_out) # Move data to the source operand. tik_instance.data_move(a_l1, a_gm, 0, 1, 64, 0, 0) tik_instance.data_move(b_l1, b_gm, 0, 1, 320, 0, 0) # Perform matmul. The m, k, and n arguments are 30, 64, and 160, respectively. The m dimension of dst_l1out is a multiple of 16 and rounded up to 32. tik_instance.matmul(dst_l1out, a_l1, b_l1, 30, 64, 160) # Move data to dst_gm, where burst_len = 30 * 16 * dst_l1out_dtype_size//32 = 60. tik_instance.fixpipe(dst_gm, dst_l1out, 10, 60, 0, 0, extend_params={"relu": True}) tik_instance.BuildCCE(kernel_name="matmul", inputs=[a_gm, b_gm], outputs=[dst_gm])Result example:
Input: a_l1 = [[[-1, -1, -1, ..., -1, -1, -1] ... [-1, -1, -1, ..., -1, -1, -1]] [[-1, -1, -1, ..., -1, -1, -1] ... [-1, -1, -1, ..., -1, -1, -1]]] b_l1 = [[[1, 1, 1, ..., 1, 1, 1] ... [1, 1, 1, ..., 1, 1, 1]] [[1, 1, 1, ..., 1, 1, 1] ... [1, 1, 1, ..., 1, 1, 1]]] Output: dst_gm = [[[0, 0, 0, ..., 0, 0, 0] ... [0, 0, 0, ..., 0, 0, 0]] ... [[0, 0, 0, ..., 0, 0, 0] ... [0, 0, 0, ..., 0, 0, 0]]] - Example: See the end-to-end call example of the matmul API. In the example, the inputs a and b support only static shapes, and the shapes are [16, 64] and [64, 1024], respectively. For the call example, see matmul Sample.