vec_reduce_min

Description

Obtains the minimum value and its corresponding index position among the input data.

If there are multiple minimum values, determine which one to be returned by referring to Restrictions.

Prototype

vec_reduce_min(mask, dst, src, work_tensor, repeat_times, src_rep_stride, cal_index=False)

Parameters

**Table 1** Parameters
Parameter	Input/Output	Description
mask	Input	For details, see the description of the mask parameter in Table 1.
dst	Input	Start element of the destination Tensor operand. The start address must be 4-byte aligned. The scope of the tensor is the Unified Buffer.
src	Input	Start element of the source Tensor operand. For details about the alignment requirements on the start address, see General Restrictions. The scope of the tensor is the Unified Buffer.
work_tensor	Input	A tensor for storing temporary results during instruction execution to calculate the required workspace. Pay attention to the size. For details, see the restrictions of each instruction.
repeat_times	Input	Repeat times (or iterations). Must be a Scalar of type int32, an immediate of type int, or an Expr of type int32. Immediate is recommended because it provides higher performance.
src_rep_stride	Input	Repeat stride size for the source operand between the corresponding blocks of successive iterations. Must be a Scalar of type int16/int32/int64/uint16/uint32/uint64, an immediate of type int, or an Expr of type int16/int32/int64/uint16/uint32/uint64.
cal_index	Input	A bool that specifies whether to obtain the maximum and minimum values with indexes (supported only by vec_reduce_max and vec_reduce_min). Defaults to False. True: Both the maximum and minimum values and indexes are obtained. False: Only the maximum and minimum values are obtained. The corresponding indexes are not obtained.

dst, src, and work_tensor have the same data type:

Atlas 200/300/500 Inference Product : Tensors of type float16

Atlas Training Series Product : Tensors of type float16

Returns

None

Applicability

Atlas 200/300/500 Inference Product

Atlas Training Series Product

Restrictions

The argument of repeat_times is a Scalar of type int32, an immediate of type int, or an Expr of type int32.
- When cal_index is set to False, repeat_times ∈ [1, 4095].
- When cal_index is set to True:
  For the Atlas 200/300/500 Inference Product , repeat_times ∈ [1, 511]. index is up to 65504, which is the maximum representable in float16. Therefore, a maximum of 511 iterations are supported.
  
  For the Atlas Training Series Product , repeat_times ∈ [1, 511]. index is up to 65504, which is the maximum representable in float16. Therefore, a maximum of 511 iterations are supported.
src_rep_stride ∈ [0, 65535]. Must be a Scalar of type int16/int32/int64/uint16/uint32/uint64, an immediate of type int, or an Expr of type int16/int32/int64/uint16/uint32/uint64.
The storage sequence of the dst result is: maximum value, corresponding index. In the result, index data is stored as integers. For example, if the defined data type of dst is float16, but that of index is uint16, an error occurs when the index data is read in float16 format. Therefore, the reinterpret_cast_to() method needs to be called to convert the float16 index data to corresponding integers.

Restrictions for the work_tensor space are as follows:

If cal_index is set to False, at least (repeat_times x 2) elements are required. For example, when repeat_times is set to 120, the shape of work_tensor is at least 240.

When cal_index is set to True, the space size is calculated by using the following formula.

# DTYPE_SIZE indicates the data type size, in bytes. For example, float16 occupies 2 bytes. elements_per_block indicates the number of elements required by each block.
elements_per_block = 32 // DTYPE_SIZE[dtype]
elements_per_repeat = 256 // DTYPE_SIZE[dtype] # elements_per_repeat indicates the number of elements required for each repeat.
it1_output_count = 2*repeat_times  # Number of elements generated in the first iteration.

def ceil_div(a_value, b_value):
    return (a_value + b_value - 1) // b_value

it2_align_start = ceil_div(it1_output_count, elements_per_block)*elements_per_block # Offset of the start position of the second iteration. ceil_div is used to perform division and round up the result.
it2_output_count = ceil_div(it1_output_count, elements_per_repeat)*2 # Number of elements generated in the second iteration.
it3_align_start = ceil_div(it2_output_count, elements_per_block)*elements_per_block # Offset of the start position of the third iteration.
it3_output_count = ceil_div(it2_output_count, elements_per_repeat)*2 # Number of elements generated in the third iteration.
it4_align_start = ceil_div(it3_output_count, elements_per_block)*elements_per_block # Offset of the start position of the fourth iteration.
it4_output_count = ceil_div(it3_output_count, elements_per_repeat)*2 # Number of elements generated in the fourth iteration.
final_work_tensor_need_size = it2_align_start + it3_align_start + it4_align_start + it4_output_count # Finally required work_tensor size

Address overlapping between dst and work_tensor is not allowed.

Example

from tbe import tik
tik_instance = tik.Tik()
src_gm = tik_instance.Tensor("float16", (256,), name="src_gm", scope=tik.scope_gm)
dst_gm = tik_instance.Tensor("float16", (16,), name="dst_gm", scope=tik.scope_gm)
src_ub = tik_instance.Tensor("float16", (256,), name="src_ub", scope=tik.scope_ubuf)
dst_ub = tik_instance.Tensor("float16", (16,), name="dst_ub", scope=tik.scope_ubuf)
work_tensor_ub = tik_instance.Tensor("float16", (18,), tik.scope_ubuf, "work_tensor_ub")
# Move the user input from the Global Memory to the Unified Buffer.
tik_instance.data_move(src_ub, src_gm, 0, 1, 16, 0, 0)
# Assign 0 to dst_ub as its initial value to show the difference between the inputs and outputs more clearly.
tik_instance.vec_dup(16, dst_ub, 0, 1, 1)
tik_instance.vec_reduce_min(128, dst_ub, src_ub, work_tensor_ub, 2, 8, cal_index=True)
# Move the compute result from the Unified Buffer to the Global Memory.
tik_instance.data_move(dst_gm, dst_ub, 0, 1, 1, 0, 0)

tik_instance.BuildCCE(kernel_name="vec_reduce_min", inputs=[src_gm], outputs=[dst_gm])

Result example:

Input (src_gm):
[ 6.344    1.954   -3.246    9.555    9.29    -6.95     7.586    5.67
  3.2     -3.234    0.4087  -5.777   -7.156   -4.71    -1.587    4.023
  2.242    3.72     3.064    6.21    -1.627   -5.69     1.263   -0.577
  0.1646   1.894   -9.945   -7.18     3.72     1.8955  -9.53    -0.0693
 -4.836    9.13     1.122    4.766   -7.957   -5.07    -6.93     9.47
  2.912    5.938   -2.617   -9.97     3.99     2.805    1.9375   9.04
  0.03117  0.6147  -2.37    -6.297   -4.715   -5.89     3.137   -0.976
  5.79     0.02997 -2.07    -9.17    -1.8      3.498   -9.266    4.562
  7.023   -8.78    -1.974    7.508   -2.205    2.162   -2.775    4.1
  6.633   -8.1      3.334   -4.52     6.26    -5.516   -7.223   -8.586
  5.      -3.787    3.836   -1.833   -1.063   -6.188   -8.36    -5.91
  6.812    6.137   -7.746    9.2     -5.492    2.799   -7.613    9.695
 -4.074    3.188    7.145   -7.477   -7.01     6.137   -4.77     8.42
  4.336    9.836   -6.86     3.363    1.32    -4.36     2.629   -3.09
  9.53    -9.24     8.99     3.355    9.77     1.465    2.826    9.82
  8.94     1.122    9.516   -2.146   -4.566   -4.414   -1.057   -7.97
  0.4263   0.171   -4.254    1.046   -3.191    9.36     3.287   -0.6216
 -4.957   -3.605   -9.35    -6.863    1.472    5.06    -9.75    -9.54
 -3.707    6.92     5.88    -4.855   -8.21     2.697    8.17     7.836
  6.008   -4.402   -1.721    6.39    -3.992   -5.33     6.56    -9.305
  0.7114  -5.58     8.93    -1.598   -8.164   -5.805    0.4302  -8.766
 -9.414    8.81     4.6     -9.83     6.5      7.38    -0.9526   8.04
  5.008    4.863   -8.36     0.4224   8.68    -8.484   -0.9536   8.164
  0.8613  -4.027   -4.848   -2.875    6.67    -7.75    -6.133    0.08655
 -9.96     7.266    6.082    5.883   -8.32    -8.87    -8.945    4.363
 -1.022   -8.57     6.66    -2.553    0.0873   8.89    -4.336    5.01
  8.24    -2.953    3.93     9.695    6.227    8.625   -7.926    9.51
  6.27    -3.28     0.911   -7.938   -7.48    -9.94     7.57    -9.62
  3.654   -5.867    4.45     4.492   -3.129    4.523    1.69     3.947
  0.9663  -9.766    9.48    -4.652   -1.876   -2.516    2.34     7.16
  5.223    5.473    1.373    1.489    1.538   -0.3293  -8.61     1.894
 -2.56    -2.379    8.48    -4.81    -0.877   -7.85    -2.148    1.938  ]

Output (dst_gm):
[-9.97e+00  2.56e-06  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00
  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00
  0.00e+00  0.00e+00]

For more examples, see vec_reduce_max.

Parent topic: Reduce