fixpipe

Description

Processes the matrix compute result, for example, adding an offset to and quantizing the compute result, and moving the data from the L1OUT Buffer to the Global Memory.

Prototype

fixpipe(dst, src, cburst_num, burst_len, dst_stride, src_stride, extend_params=None)

Parameters

Table 1 Parameter description

Parameter

Input/Output

Description

dst

Output

A Tensor of type float16, float32, or int32, for the start element of the destination operand. For details about the data type restrictions, see Table 2. The scope is the Global Memory.

After fixpipe processing, the extra data allocated during matrix computation is deleted in addition to the offset and quantization operations.

If this API is used to process the conv2d result, the format is [cout_blocks, howo, 16].

If this API is used to process the matmul result, the format is [N1, m, N0].

Note: For meanings of cout_blocks and howo, see the parameter description of conv2d in Parameters.

For meanings of N1, m, and N0, see parameter description of matmul in Parameters.

src

Input

A Tensor of type float32 or int32, for the start element of the source operand. For details about the data type restrictions, see Table 2. The scope is the L1OUT Buffer.

The source operand is the result of matrix computation.

If this API is used to process the conv2d result, the format is [cout_blocks, round_howo, 16].

If this API is used to process the matmul result, the format is [N1, M, N0].

Note: For meanings of cout_blocks and round_howo, see the parameter description of conv2d in Parameters.

For meanings of N1, M, and N0, see parameter description of matmul in Parameters

cburst_num

Input

An immediate of type int specifying the number of bursts. Must be in the range of [1, 4095].

If this API is used to process the conv2d result, the format is [cout_blocks, round_howo, 16], where cburst_num is set to cout_blocks.

If this API is used to process the matmul result, the format is [N1, M, N0], where cburst_num is set to N1.

Note: For meanings of cout_blocks and round_howo, see the parameter description of conv2d in Parameters.

For meanings of N1, M, and N0, see parameter description of matmul in Parameters.

burst_len

Input

Burst length of contiguous data transfer, in the unit of 32 bytes. The value is in the range [1, 65535].

Must be an immediate of type int.

For src, the valid data segment length of each burst is as follows:

  • If this API is used to process the conv2d result, the size is calculated as follows: howo * 16 * src_dtype_size/32 (unit: 32 bytes)
  • If this API is used to process the matmul result, the size is calculated as follows: m * N0 * src_dtype_size/32 (unit: 32 bytes)

dst_stride

Input

Tail-to-header stride between adjacent bursts of the dst operand tensor, in the unit of 32 bytes. Must be in the range [0, 65535]. Must be an immediate of type int.

src_stride

Input

Tail-to-header stride between adjacent bursts of the dst operand tensor, in the unit of 256 elements. Must be in the range [0, 65535]. Must be an immediate of type int.

This parameter is reserved. To ensure data accuracy, pass 0.

extend_params

Input

A dictionary of extended parameters. Defaults to None. Currently, three keys are supported: bias, quantize_params, and relu, which are described as follows:

1. Key: "bias"

Value: defaults to None, indicating bias disabled.

To enable bias, specify the value as the start element of the bias operand. Has the same data type as src (a Tensor of type int32 or float32). Has shape [Cout,].

Cout: number of convolution kernels if src is the output of conv2d; or the length in the N dimension if src is the output of matmul.

The scope is the L1 Buffer.

2. Key: "quantize_params"

Value: defaults to None, indicating quantization disabled.

If quantization is enabled, the value is a dictionary of two keys: "mode" and "mode_param".

The value of "mode" is a string, for the quantization mode:

  • "int322fp16": int32 to float16 quantization
  • "fp322fp16": float32 to float16 quantization

The value of "mode_param" can be:

  • A Scalar of type float16 or an immediate of type float, for a single scale factor (supported only when "mode" is "int322fp16").
  • A Tensor of type float16, with shape is [16] and scope L1. Applies to the 16 channels of cout (supported only when the bias is disabled and the "mode" is "int322fp16").
  • If "mode" is set to "fp322fp16", pass None.

3. Key: "relu"

Value: defaults to False. A bool. False indicates the ReLU function is disabled. True indicates that the ReLU function is enabled.

Notes:

  • ReLU is supported only when the bias function is disabled.
  • ReLU is not supported when quantization is enabled, "mode" is set to "int322fp16", and the "mode_param" argument is a Scalar of type float16 or an immediate of type float.
Table 2 Data type combination of src and dst

src.dtype

dst.dtype

extend_params["quantize_params"]

float32

float16

"fp322fp16"

int32

float16

"int322fp16"

float32

float32

None

int32

int32

None

Applicability

Atlas 200/300/500 Inference Product

Atlas Training Series Product

Restrictions

  • Single-step debugging takes a long time, and is therefore not recommended.
  • The functions enabled in extend_params are executed in the following sequence.

  • This instruction is mutually exclusive with Vector instructions.
  • For details about the alignment requirements of the operand address offset, see General Restrictions.

Returns

None

Example

  • Example: src is of type int32 and dst is of type float16, bias is disabled, and mode_param is a Tensor.
    from tbe import tik
    tik_instance = tik.Tik()
    dtype_size = {
        "int8": 1,
        "uint8": 1,
        "int16": 2,
        "uint16": 2,
        "float16": 2,
        "int32": 4,
        "uint32": 4,
        "float32": 4,
        "int64": 8,
    }
    fm_dtype = "uint8"
    ker_dtype = "int8"
    deq_dtype="float16"
    dst_dtype = "int32"
    fm_shape = [1, 4, 4, 32]
    kernel_shape = [1, 2, 2, 32, 32]
    dst_shape = [2, 9, 16]
    dst_l1_shape = [2, 16, 16]
    deq_shape = [16]
    # Convolution stride, [stride_h, stride_w]
    stride = [1, 1]
    # Padding factors, in the format of [pad_left, pad_right, pad_top, pad_bottom]
    pad = [0, 0, 0, 0]
    # Convolution dilation factors, in the format of [dilation_h, dilation_w]
    dilation = [1, 1]
    # Padding value
    pad_value = 0
    # Define the tensors.
    feature_map_gm = tik_instance.Tensor(fm_dtype, fm_shape, name='feature_map_gm', scope=tik.scope_gm)
    weight_gm = tik_instance.Tensor(ker_dtype, kernel_shape, name='weight_gm', scope=tik.scope_gm)
    deqscale_gm = tik_instance.Tensor(deq_dtype, deq_shape, name='deqscale_gm', scope=tik.scope_gm)
    dst_gm = tik_instance.Tensor(deq_dtype, dst_shape, name='dst_gm', scope=tik.scope_gm)
    feature_map = tik_instance.Tensor(fm_dtype, fm_shape, name='feature_map', scope=tik.scope_cbuf)
    weight = tik_instance.Tensor(ker_dtype, kernel_shape, name='weight', scope=tik.scope_cbuf)
    deqscale = tik_instance.Tensor(deq_dtype, deq_shape, name='deqscale', scope=tik.scope_cbuf)
    dst_l1out = tik_instance.Tensor(dst_dtype, dst_l1_shape, name='dst_l1out', scope=tik.scope_cbuf_out)
    # Move data from the Global Memory to the source operand tensor.
    tik_instance.data_move(feature_map, feature_map_gm, 0, 1, 16, 0, 0)
    tik_instance.data_move(weight, weight_gm, 0, 1, 128, 0, 0)
    tik_instance.data_move(deqscale, deqscale_gm, 0, 1, 1, 0, 0)
    # Perform convolution.
    tik_instance.conv2d(dst_l1out, feature_map, weight, fm_shape, kernel_shape, stride, pad, dilation, pad_value)
    # Perform quantization using fixpipe.
    # Number of transferred data segments. When conv2d data is processed, cburst_num is set to cout_blocks. When matmul data is processed, cburst_num is set to N1.
    cburst_num = dst_l1_shape[0]
    # Length of a contiguously transferred data segment. When conv2d data is processed, the length is howo*16*src_dtype_size/32. When matmul data is processed, the length is m*N0*src_dtype_size/32.
    burst_len = dst_l1_shape[1] *16 * dtype_size[dst_dtype] / 32
    # Interval between adjacent consecutive data segments of the dst_stride and src_stride tensors, that is, distance between the previous burst tail and the next burst header. The value 0 is used as an example.
    dst_stride, src_stride = 0, 0
    
    tik_instance.fixpipe(dst_gm, dst_l1out, cburst_num, burst_len, dst_stride, src_stride, extend_params={"bias": None, "quantize_params": {"mode": "int322fp16", "mode_param": deqscale}})
    tik_instance.BuildCCE(kernel_name="fixpipe", inputs=[feature_map_gm, weight_gm, deqscale_gm], outputs=[dst_gm])

    Result example:

    Input:
    feature_map_gm:
    [[[[3, 2, 4, 2, ..., 4, 3]]]]
    weight_gm:
    [[[[[0, -5, -3, ..., -4, -2]]]]]
    deqscale_gm:
    [ 0.1214, -0.2238, ..., 0.4883, 0.2788]
    Output:
    dst_gm:
    [[[-13.48, 39.38, -114.8, 30.38, ..., 9.766, -24.81]]]
  • Example: src is of type float32 and dst is of type float16, bias is enabled, and mode_param is None.
    from tbe import tik
    tik_instance = tik.Tik()
    # Define the tensors.
    feature_map_gm = tik_instance.Tensor("float16", [2, 4, 4, 16], name='feature_map_gm', scope=tik.scope_gm)
    weight_gm = tik_instance.Tensor("float16", [2, 2, 2, 16, 16], name='weight_gm', scope=tik.scope_gm)
    bias_gm = tik_instance.Tensor("float32", (16,), name='bias_gm', scope=tik.scope_gm)
    dst_gm = tik_instance.Tensor("float16", [1, 4, 16], name='dst_gm', scope=tik.scope_gm)
    feature_map = tik_instance.Tensor("float16", [2, 4, 4, 16], name='feature_map', scope=tik.scope_cbuf)
    weight = tik_instance.Tensor("float16", [2, 2, 2, 16, 16], name='weight', scope=tik.scope_cbuf)
    bias = tik_instance.Tensor("float32", (16,), name='bias', scope=tik.scope_cbuf)
    dst_l1out = tik_instance.Tensor("float32", [1, 16, 16], name='dst_l1out', scope=tik.scope_cbuf_out)
    # Move data from the Global Memory to the source operand tensor.
    tik_instance.data_move(feature_map, feature_map_gm, 0, 1, 32, 0, 0)
    tik_instance.data_move(weight, weight_gm, 0, 1, 128, 0, 0)
    tik_instance.data_move(bias, bias_gm, 0, 1, 2, 0, 0)
    # Perform convolution.
    tik_instance.conv2d(dst_l1out, feature_map, weight, [2, 4, 4, 16], [2, 2, 2, 16, 16], [1, 1], [0, 0, 0, 0], [2, 2], 0)
    # Perform bias and quantization using fixpipe.
    tik_instance.fixpipe(dst_gm, dst_l1out, 1, 8, 0, 0, extend_params={"bias": bias, "quantize_params": {"mode": "fp322fp16", "mode_param": None}})
    tik_instance.BuildCCE(kernel_name="conv2d", inputs=[feature_map_gm, weight_gm, bias_gm], outputs=[dst_gm])

    Result example:

    Inputs:
    feature_map_gm:
    [[[[0.0, 0.01, 0.02, 0.03, 0.04, ..., 5.09, 5.1, 5.11]]]]
    weight_gm:
    [[[[[0.0, 0.01, 0.02, 0.03, 0.04, ..., 20.46, 20.47]]]]]
    bias_gm:
    [0.0, 1.0, 2.0, 3.0, ..., 14.0, 15.0]
    
    Output:
    dst_gm:
    [[[3568., 3614., 3660., 3704., 3750., 3794., 3840., 3884., 3930.,
       3976., 4020., 4066., 4110., 4156., 4200., 4250.],
      [3754., 3802., 3850., 3900., 3948., 3996., 4044., 4094., 4140.,
       4188., 4240., 4290., 4336., 4384., 4430., 4480.],
      [4308., 4370., 4424., 4484., 4544., 4600., 4660., 4716., 4776.,
       4830., 4892., 4950., 5010., 5068., 5124., 5184.],
      [4496., 4556., 4616., 4680., 4740., 4804., 4864., 4924., 4988.,
       5050., 5108., 5172., 5230., 5296., 5356., 5416.]]]