vec_reduce_min
Description
Obtains the minimum value and its corresponding index position among the input data.

If there are multiple minimum values, determine which one to be returned by referring to Restrictions.
Prototype
vec_reduce_min(mask, dst, src, work_tensor, repeat_times, src_rep_stride, cal_index=False)
Parameters
|
Parameter |
Input/Output |
Description |
|---|---|---|
|
mask |
Input |
For details, see the description of the mask parameter in Table 1. |
|
dst |
Input |
Start element of the destination Tensor operand. The start address must be 4-byte aligned. The scope of the tensor is the Unified Buffer. |
|
src |
Input |
Start element of the source Tensor operand. For details about the alignment requirements on the start address, see General Restrictions. The scope of the tensor is the Unified Buffer. |
|
work_tensor |
Input |
A tensor for storing temporary results during instruction execution to calculate the required workspace. Pay attention to the size. For details, see the restrictions of each instruction. |
|
repeat_times |
Input |
Repeat times (or iterations). Must be a Scalar of type int32, an immediate of type int, or an Expr of type int32. Immediate is recommended because it provides higher performance. |
|
src_rep_stride |
Input |
Repeat stride size for the source operand between the corresponding blocks of successive iterations. Must be a Scalar of type int16/int32/int64/uint16/uint32/uint64, an immediate of type int, or an Expr of type int16/int32/int64/uint16/uint32/uint64. |
|
cal_index |
Input |
A bool that specifies whether to obtain the maximum and minimum values with indexes (supported only by vec_reduce_max and vec_reduce_min). Defaults to False.
|
dst, src, and work_tensor have the same data type:
Returns
None
Applicability
Restrictions
- The argument of repeat_times is a Scalar of type int32, an immediate of type int, or an Expr of type int32.
- When cal_index is set to False, repeat_times ∈ [1, 4095].
- When cal_index is set to True:
For the
Atlas 200/300/500 Inference Product , repeat_times ∈ [1, 511]. index is up to 65504, which is the maximum representable in float16. Therefore, a maximum of 511 iterations are supported.For the
Atlas Training Series Product , repeat_times ∈ [1, 511]. index is up to 65504, which is the maximum representable in float16. Therefore, a maximum of 511 iterations are supported.
- src_rep_stride ∈ [0, 65535]. Must be a Scalar of type int16/int32/int64/uint16/uint32/uint64, an immediate of type int, or an Expr of type int16/int32/int64/uint16/uint32/uint64.
- The storage sequence of the dst result is: maximum value, corresponding index. In the result, index data is stored as integers. For example, if the defined data type of dst is float16, but that of index is uint16, an error occurs when the index data is read in float16 format. Therefore, the reinterpret_cast_to() method needs to be called to convert the float16 index data to corresponding integers.
- Restrictions for the work_tensor space are as follows:
- If cal_index is set to False, at least (repeat_times x 2) elements are required. For example, when repeat_times is set to 120, the shape of work_tensor is at least 240.
- When cal_index is set to True, the space size is calculated by using the following formula.
# DTYPE_SIZE indicates the data type size, in bytes. For example, float16 occupies 2 bytes. elements_per_block indicates the number of elements required by each block. elements_per_block = 32 // DTYPE_SIZE[dtype] elements_per_repeat = 256 // DTYPE_SIZE[dtype] # elements_per_repeat indicates the number of elements required for each repeat. it1_output_count = 2*repeat_times # Number of elements generated in the first iteration. def ceil_div(a_value, b_value): return (a_value + b_value - 1) // b_value it2_align_start = ceil_div(it1_output_count, elements_per_block)*elements_per_block # Offset of the start position of the second iteration. ceil_div is used to perform division and round up the result. it2_output_count = ceil_div(it1_output_count, elements_per_repeat)*2 # Number of elements generated in the second iteration. it3_align_start = ceil_div(it2_output_count, elements_per_block)*elements_per_block # Offset of the start position of the third iteration. it3_output_count = ceil_div(it2_output_count, elements_per_repeat)*2 # Number of elements generated in the third iteration. it4_align_start = ceil_div(it3_output_count, elements_per_block)*elements_per_block # Offset of the start position of the fourth iteration. it4_output_count = ceil_div(it3_output_count, elements_per_repeat)*2 # Number of elements generated in the fourth iteration. final_work_tensor_need_size = it2_align_start + it3_align_start + it4_align_start + it4_output_count # Finally required work_tensor size
- Address overlapping between dst and work_tensor is not allowed.
Example
from tbe import tik
tik_instance = tik.Tik()
src_gm = tik_instance.Tensor("float16", (256,), name="src_gm", scope=tik.scope_gm)
dst_gm = tik_instance.Tensor("float16", (16,), name="dst_gm", scope=tik.scope_gm)
src_ub = tik_instance.Tensor("float16", (256,), name="src_ub", scope=tik.scope_ubuf)
dst_ub = tik_instance.Tensor("float16", (16,), name="dst_ub", scope=tik.scope_ubuf)
work_tensor_ub = tik_instance.Tensor("float16", (18,), tik.scope_ubuf, "work_tensor_ub")
# Move the user input from the Global Memory to the Unified Buffer.
tik_instance.data_move(src_ub, src_gm, 0, 1, 16, 0, 0)
# Assign 0 to dst_ub as its initial value to show the difference between the inputs and outputs more clearly.
tik_instance.vec_dup(16, dst_ub, 0, 1, 1)
tik_instance.vec_reduce_min(128, dst_ub, src_ub, work_tensor_ub, 2, 8, cal_index=True)
# Move the compute result from the Unified Buffer to the Global Memory.
tik_instance.data_move(dst_gm, dst_ub, 0, 1, 1, 0, 0)
tik_instance.BuildCCE(kernel_name="vec_reduce_min", inputs=[src_gm], outputs=[dst_gm])
Result example:
Input (src_gm): [ 6.344 1.954 -3.246 9.555 9.29 -6.95 7.586 5.67 3.2 -3.234 0.4087 -5.777 -7.156 -4.71 -1.587 4.023 2.242 3.72 3.064 6.21 -1.627 -5.69 1.263 -0.577 0.1646 1.894 -9.945 -7.18 3.72 1.8955 -9.53 -0.0693 -4.836 9.13 1.122 4.766 -7.957 -5.07 -6.93 9.47 2.912 5.938 -2.617 -9.97 3.99 2.805 1.9375 9.04 0.03117 0.6147 -2.37 -6.297 -4.715 -5.89 3.137 -0.976 5.79 0.02997 -2.07 -9.17 -1.8 3.498 -9.266 4.562 7.023 -8.78 -1.974 7.508 -2.205 2.162 -2.775 4.1 6.633 -8.1 3.334 -4.52 6.26 -5.516 -7.223 -8.586 5. -3.787 3.836 -1.833 -1.063 -6.188 -8.36 -5.91 6.812 6.137 -7.746 9.2 -5.492 2.799 -7.613 9.695 -4.074 3.188 7.145 -7.477 -7.01 6.137 -4.77 8.42 4.336 9.836 -6.86 3.363 1.32 -4.36 2.629 -3.09 9.53 -9.24 8.99 3.355 9.77 1.465 2.826 9.82 8.94 1.122 9.516 -2.146 -4.566 -4.414 -1.057 -7.97 0.4263 0.171 -4.254 1.046 -3.191 9.36 3.287 -0.6216 -4.957 -3.605 -9.35 -6.863 1.472 5.06 -9.75 -9.54 -3.707 6.92 5.88 -4.855 -8.21 2.697 8.17 7.836 6.008 -4.402 -1.721 6.39 -3.992 -5.33 6.56 -9.305 0.7114 -5.58 8.93 -1.598 -8.164 -5.805 0.4302 -8.766 -9.414 8.81 4.6 -9.83 6.5 7.38 -0.9526 8.04 5.008 4.863 -8.36 0.4224 8.68 -8.484 -0.9536 8.164 0.8613 -4.027 -4.848 -2.875 6.67 -7.75 -6.133 0.08655 -9.96 7.266 6.082 5.883 -8.32 -8.87 -8.945 4.363 -1.022 -8.57 6.66 -2.553 0.0873 8.89 -4.336 5.01 8.24 -2.953 3.93 9.695 6.227 8.625 -7.926 9.51 6.27 -3.28 0.911 -7.938 -7.48 -9.94 7.57 -9.62 3.654 -5.867 4.45 4.492 -3.129 4.523 1.69 3.947 0.9663 -9.766 9.48 -4.652 -1.876 -2.516 2.34 7.16 5.223 5.473 1.373 1.489 1.538 -0.3293 -8.61 1.894 -2.56 -2.379 8.48 -4.81 -0.877 -7.85 -2.148 1.938 ] Output (dst_gm): [-9.97e+00 2.56e-06 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00]
For more examples, see vec_reduce_max.