vec_reduce_max

Description

Obtains the maximum value and its corresponding index position among the input data.

If there are multiple maximum values, determine which one to be returned by referring to Restrictions.

Prototype

vec_reduce_max(mask, dst, src, work_tensor, repeat_times, src_rep_stride, cal_index=False)

Parameters

**Table 1** Parameters
Parameter	Input/Output	Description
mask	Input	For details, see the description of the mask parameter in Table 1.
dst	Input	Start element of the destination Tensor operand. The start address must be 4-byte aligned. The scope of the tensor is the Unified Buffer.
src	Input	Start element of the source Tensor operand. For details about the alignment requirements on the start address, see General Restrictions. The scope of the tensor is the Unified Buffer.
work_tensor	Input	A tensor for storing temporary results during instruction execution to calculate the required workspace. Pay attention to the size. For details, see the restrictions of each instruction.
repeat_times	Input	Repeat times (or iterations). Must be a Scalar of type int32, an immediate of type int, or an Expr of type int32. Immediate is recommended because it provides higher performance.
src_rep_stride	Input	Repeat stride size for the source operand between the corresponding blocks of successive iterations. Must be a Scalar of type int16/int32/int64/uint16/uint32/uint64, an immediate of type int, or an Expr of type int16/int32/int64/uint16/uint32/uint64.
cal_index	Input	A bool that specifies whether to obtain the maximum and minimum values with indexes. Defaults to False. True: Both the maximum and minimum values and indexes are obtained. False: Only the maximum and minimum values are obtained. The corresponding indexes are not obtained.

dst, src, and work_tensor have the same data type:

Atlas 200/300/500 Inference Product: Tensors of type float16

Atlas Training Series Product: Tensors of type float16

Returns

None

Applicability

Atlas 200/300/500 Inference Product

Atlas Training Series Product

Restrictions

The argument of repeat_times is a Scalar of type int32, an immediate of type int, or an Expr of type int32.
- When cal_index is set to False, repeat_times ∈ [1, 4095].
- When cal_index is set to True:
  - If the operands are of type int16, the maximum value of the operands is 32767, meaning that a maximum of 255 iterations are supported, that is, repeat_times ∈ [1, 255].
  - If the operands are of type float16, the maximum value of the operands is 65504, meaning that a maximum of 511 iterations are supported, that is, repeat_times ∈ [1, 511].
  - Similarly, if the operands are of type float32, repeat_times ∈ [1, 4095].
src_rep_stride ∈ [0, 65535]. Must be a Scalar of type int16/int32/int64/uint16/uint32/uint64, an immediate of type int, or an Expr of type int16/int32/int64/uint16/uint32/uint64.
The storage sequence of the dst result is: maximum value, corresponding index. In the result, index data is stored as integers. For example, if the defined data type of dst is float16, but that of index is uint16, an error occurs when the index data is read in float16 format. Therefore, the reinterpret_cast_to() method needs to be called to convert the float16 index data to corresponding integers.

Address overlapping between dst and work_tensor is not allowed.

Restrictions for the work_tensor space are as follows:

If cal_index is set to False, at least (repeat_times x 2) elements are required. For example, if repeat_times = 120, work_tensor must contain at least 240 elements.

When cal_index is set to True, the space size is calculated by using the following formula. For more examples, see Example.

# DTYPE_SIZE indicates the data type size, in bytes. For example, float16 occupies 2 bytes. elements_per_block indicates the number of elements required by each block.
elements_per_block = 32 // DTYPE_SIZE[dtype]
elements_per_repeat = 256 // DTYPE_SIZE[dtype] # elements_per_repeat indicates the number of elements required for each repeat.
it1_output_count = 2*repeat_times  # Number of elements generated in the first iteration.

def ceil_div(a_value, b_value):
    return (a_value + b_value - 1) // b_value

it2_align_start = ceil_div(it1_output_count, elements_per_block)*elements_per_block # Offset of the start position of the second iteration. ceil_div is used to perform division and round up the result.
it2_output_count = ceil_div(it1_output_count, elements_per_repeat)*2 # Number of elements generated in the second iteration.
it3_align_start = ceil_div(it2_output_count, elements_per_block)*elements_per_block # Offset of the start position of the third iteration.
it3_output_count = ceil_div(it2_output_count, elements_per_repeat)*2 # Number of elements generated in the third iteration.
it4_align_start = ceil_div(it3_output_count, elements_per_block)*elements_per_block # Offset of the start position of the fourth iteration.
it4_output_count = ceil_div(it3_output_count, elements_per_repeat)*2 # Number of elements generated in the fourth iteration.
final_work_tensor_need_size = it2_align_start + it3_align_start + it4_align_start + it4_output_count # Finally required work_tensor size

Example

[Example 1]

src, work_tensor, and dst are Tensors of type float16. src has shape (65, 128), and repeat_times of vec_reduce_max and vec_reduce_min is 65.

The following is an API call example:

tik_instance.vec_reduce_max(128, dst, src, work_tensor, 65, 8, cal_index=True)

The space of work_tensor is calculated as follows:

elements_per_block = 16 (elements)
elements_per_repeat = 128 (elements)
it1_output_count = 2*65 = 130 (elements)

def ceil_div(a_value, b_value):
    return (a_value + b_value - 1) // b_value

it2_align_start = ceil_div(130, 16)*16 = 144 (elements)
it2_output_count = ceil_div(130, 128)*2 = 4 (elements)
it3_align_start = ceil_div(4, 16)*16 = 16 (elements)
it3_output_count = ceil_div(4, 128)*2 = 2 (elements)

The final maximum value and its index can be obtained after three iterations. The required space of work_tensor = it2_align_start + it3_align_start + it3_output_count = 144 + 16 + 2 = 162 (elements).

[Example 2]
src, work_tensor, and dst are Tensors of type float16. src has shape (65, 128), and repeat_times of vec_reduce_max and vec_reduce_min is a Scalar with the value 65. If repeat_times is a Scalar or contains a Scalar, four iterations of calculation are required.

The following is an API call example:
```
scalar = tik_instance.Scalar (init_value=65, dtype="int32")
tik_instance.vec_reduce_max(128, dst, src, work_tensor, scalar, 8, cal_index=True)
```
The space of work_tensor is calculated as follows:
```
elements_per_block = 16 (elements)
elements_per_repeat = 128 (elements)
it1_output_count = 2*65 = 130 (elements)

def ceil_div(a_value, b_value):
    return (a_value + b_value - 1) // b_value

it2_align_start = ceil_div(130, 16)*16 = 144 (elements)
it2_output_count = ceil_div(130, 128)*2 = 4 (elements)
it3_align_start = ceil_div(4, 16)*16 = 16 (elements)
it3_output_count = ceil_div(4, 128)*2 = 2 (elements)
it4_align_start = ceil_div(2, 16)*16 = 16 (elements)
it4_output_count = ceil_div(2, 128)*2 = 2(elements)
```
In cases where repeat_times is a Scalar or contains a Scalar, the result is obtained in the third repeat. However, the Scalar value cannot be obtained during Python compilation. Another repeat is required. The required space of work_tensor = it2_align_start + it3_align_start + it4_align_start + it4_output_count = 144 + 16 + 16 + 2 = 178 (elements).

[Example 3]

src, work_tensor, and dst are Tensors of type float32. src has shape (65, 64), and repeat_times of vec_reduce_max and vec_reduce_min is 65.

The following is an API call example:

tik_instance.vec_reduce_max(64, dst, src, work_tensor, 65, 8, cal_index=True)

The space of work_tensor is calculated as follows:

elements_per_block = 8 (elements)
elements_per_repeat = 64 (elements)
it1_output_count = 2*65 = 130 (elements)

def ceil_div(a_value, b_value):
    return (a_value + b_value - 1) // b_value

it2_align_start = ceil_div(130, 8)*8 = 136 (elements)
it2_output_count = ceil_div(130, 64)*2 = 6 (elements)
it3_align_start = ceil_div(6, 8)*8 = 8 (elements)
it3_output_count = ceil_div(6, 64)*2 = 2 (elements)

[Example 4]
src, work_tensor, and dst are float32 tensors. The shape of src is (65, 64), and repeat_times of vec_reduce_max and vec_reduce_min is a Scalar with the value 65. If repeat_times is a Scalar or contains a Scalar, four iterations of calculation are required.

The following is an API call example:
```
scalar = tik_instance.Scalar (init_value=65, dtype="int32")
tik_instance.vec_reduce_max(64, dst, src, work_tensor, scalar, 8, cal_index=True)
```
The space of work_tensor is calculated as follows:
```
elements_per_block = 8 (elements)
elements_per_repeat = 64 (elements)
it1_output_count = 2*65 = 130 (elements)

def ceil_div(a_value, b_value):
    return (a_value + b_value - 1) // b_value

it2_align_start = ceil_div(130, 8)*8 = 136 (elements)
it2_output_count = ceil_div(130, 64)*2 = 6 (elements)
it3_align_start = ceil_div(6, 8)*8 = 8 (elements)
it3_output_count = ceil_div(6, 64)*2 = 2 (elements)
it4_align_start = ceil_div(2, 8)*8 = 8 (elements)
it4_output_count = ceil_div(2, 64)*2 = 2(elements)
```
In cases where repeat_times is a Scalar or contains a Scalar, the result is obtained in the third repeat. However, the Scalar value cannot be obtained during Python compilation. Another repeat is required. The required space of work_tensor = it2_align_start + it3_align_start + it4_align_start + it4_output_count = 136 + 8 + 8 + 2 = 154 (elements).

Example 1

from tbe import tik
tik_instance = tik.Tik()
src_gm = tik_instance.Tensor("float16", (256,), name="src_gm", scope=tik.scope_gm)
dst_gm = tik_instance.Tensor("float16", (16,), name="dst_gm", scope=tik.scope_gm)
src_ub = tik_instance.Tensor("float16", (256,), name="src_ub", scope=tik.scope_ubuf)
dst_ub = tik_instance.Tensor("float16", (16,), name="dst_ub", scope=tik.scope_ubuf)
work_tensor_ub = tik_instance.Tensor("float16", (18,), tik.scope_ubuf, "work_tensor_ub")
# Move the user input from the Global Memory to the Unified Buffer.
tik_instance.data_move(src_ub, src_gm, 0, 1, 16, 0, 0)
# Assign 0 to the destination Unified Buffer as its initial value to show the difference between the inputs and outputs more clearly.
tik_instance.vec_dup(16, dst_ub, 0, 1, 1)
tik_instance.vec_reduce_max(128, dst_ub, src_ub, work_tensor_ub, 2, 8, cal_index=True)
# Move the compute result from the Unified Buffer to the Global Memory.
tik_instance.data_move(dst_gm, dst_ub, 0, 1, 1, 0, 0)

tik_instance.BuildCCE(kernel_name="vec_reduce_max", inputs=[src_gm], outputs=[dst_gm])

Result example:

Input (src_gm):
[-3.326   -6.883    3.607   -0.969   -0.179    2.254   -3.957    3.242
  6.133   -3.559    3.656   -9.88     2.19     4.707   -7.027   -3.598
 -3.264    4.44     6.04    -6.35     0.525   -6.492    0.341   -4.477
  1.375    6.484   -7.957   -1.243   -9.586   -2.871   -6.688    2.088
  5.      -1.808   -5.62     9.47     1.311    2.69     8.58     9.3
  5.754   -6.25     4.516   -6.6     -0.331   -8.586    4.844    9.81
  7.695   -0.332   -7.137   -2.79     2.66     5.316    8.72     1.954
  5.043   -7.816    1.207    2.508   -5.06    -1.697    8.5     -6.637
 -0.647   -1.211   -3.229   -3.074    7.89     5.043   -3.059   -0.7544
  9.484   -2.809   -7.145   -1.051    9.45     7.688    6.695   -2.318
 -0.3562  -0.674    1.736    2.994   -2.018   -2.605   -7.113    6.09
 -1.766    6.574   -4.47     7.367   -7.93     6.88     7.83     6.527
  5.816   -3.135    6.195   -6.734   -8.85     1.705   -5.023    5.992
  6.062   -3.342    8.03    -0.748    0.9883   3.191    2.75     8.39
  9.17    -5.887    1.378   -8.77    -9.05    -3.11    -7.203    9.79
  9.64     3.945    9.32     7.812    7.066    0.664    5.234   -4.61
 -3.559   -7.73     1.441    5.434    8.23     4.785   -1.231    8.03
  0.293   -0.1658  -5.48    -3.293    8.89    -7.926   -9.66     1.597
  0.5396   9.25    -6.74     7.086   -0.954    8.96     2.318   -2.395
 -9.19    -6.176   -4.297   -7.812   -1.787   -5.39     6.5      9.055
 -0.9556   2.4      2.092    7.35     0.7017   1.548   -2.637   -5.145
 -2.938    5.617   -3.451    7.5     -5.426   -7.62     7.535   -9.14
 -8.7     -3.436    2.283   -6.18     2.836    5.707   -1.356    8.664
  1.625   -3.717    1.478   -6.67    -4.023    2.652    4.805   -8.25
  2.63    -1.394   -3.227    1.595    7.49     7.574   -3.053   -1.841
 -7.06     0.4524  -5.71     5.37     8.72     8.51     4.836   -5.05
 -7.043    5.188   -5.332    5.62    -0.6465   5.773    8.53     7.793
 -4.215    7.47    -2.451    8.18     5.543   -7.367    7.105   -0.10364
  4.465    0.3362   0.9287   2.447   -9.87     7.844    2.084    4.527
  7.582   -3.217   -5.695   -6.375    0.627    2.24     6.625   -9.55
 -5.613    7.055    9.48    -6.613    5.49     5.066    4.117    9.516
 -4.594   -0.781    2.102    9.94     6.49    -7.82     0.11975  3.146  ]

Output (dst_gm):
[9.938e+00 1.496e-05 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
 0.000e+00 0.000e+00]

Example 2

def ceil_div(a_value, b_value):
    return (a_value + b_value - 1) // b_value


tik_instance = tik.Tik()
dtype_size = {
    "int8": 1,
    "uint8": 1,
    "int16": 2,
    "uint16": 2,
    "float16": 2,
    "int32": 4,
    "uint32": 4,
    "float32": 4,
    "int64": 8,
}
# Tensor shape
src_shape = (3, 128)
dst_shape = (64,)
# Data volume
src_elements = 3 * 128
dst_elements = 64
# Data type
dtype = "float16"

src_gm = tik_instance.Tensor(dtype, src_shape, name="src_gm", scope=tik.scope_gm)
dst_gm = tik_instance.Tensor(dtype, dst_shape, name="dst_gm", scope=tik.scope_gm)
src_ub = tik_instance.Tensor(dtype, src_shape, name="src_ub", scope=tik.scope_ubuf)
dst_ub = tik_instance.Tensor(dtype, dst_shape, name="dst_ub", scope=tik.scope_ubuf)

# Copy the user input to the source Unified Buffer.
# Number of moved segments.
nburst = 1
# Length of the moved segment each time, in 32 bytes.
burst = src_elements * dtype_size[dtype] // 32 // nburst
dst_burst = dst_elements * dtype_size[dtype] // 32 // nburst
# Stride between the previous burst tail and the next burst header, in 32 bytes.
dst_stride, src_stride = 0, 0
tik_instance.data_move(src_ub, src_gm, 0, nburst, burst, dst_stride, src_stride)
# Assign the initial value 0 to dst ubuf. For details about vec_dup, see the corresponding section.
tik_instance.vec_dup(64, dst_ub, 0, 1, 1)

# mask indicates the number of source operands in each iteration. The value range varies according to the data type. For details, see the corresponding section. In this example, 34 operands are processed in each iteration.
mask = 34
# cal_index specifies whether to obtain the index of the maximum or minimum value. Here, True is used as an example.
cal_index = True
# Configure iterations based on your actual requirements. Here, six iterations are used as an example.
repeat_times = tik_instance.Scalar(dtype="int32", init_value=6)
elements_per_block = 32 // dtype_size[dtype]  # 16
elements_per_repeat = 256 // dtype_size[dtype]  # 128
it1_output_count = 2 * 6  # 12
# Offset of the start position of the second repeat. ceil_div is used to perform division and round up the result, that is 16.
it2_align_start = ceil_div(it1_output_count, elements_per_block) * elements_per_block
# Number of elements generated in the second repeat, that is 2.
it2_output_count = ceil_div(it1_output_count, elements_per_repeat) * 2
# Offset of the start position of the third repeat. ceil_div is used to perform division and round up the result, that is 16.
it3_align_start = ceil_div(it2_output_count, elements_per_block) * elements_per_block
# Number of elements generated in the second repeat, that is 2.
it3_output_count = ceil_div(it2_output_count, elements_per_repeat) * 2
# Offset of the start position of the fourth repeat. ceil_div is used to perform division and round up the result, that is 16.
it4_align_start = ceil_div(it3_output_count, elements_per_block) * elements_per_block
# Number of elements generated in the fourth repeat, that is 2.
it4_output_count = ceil_div(it3_output_count, elements_per_repeat) * 2

# Required space: 50
final_work_tensor_need_size = it2_align_start + it3_align_start + it4_align_start + it4_output_count
cal_shape = (final_work_tensor_need_size,)

work_tensor_ub = tik_instance.Tensor(dtype, cal_shape, tik.scope_ubuf, "work_tensor_ub")
tik_instance.vec_dup(final_work_tensor_need_size, work_tensor_ub, 0, 1, 1)

# Stride between operand headers in adjacent iterations. The unit is block. Currently, the header of the first iteration is three blocks away from the header of the second iteration.
src_rep_stride = 3

tik_instance.vec_reduce_max(mask, dst_ub, src_ub, work_tensor_ub, repeat_times, src_rep_stride, cal_index=cal_index)
# In the current example, the value of work_tensor_ub is [3.30e+01 1.97e-06 8.10e+01 1.97e-06 1.29e+02 1.97e-06 1.77e+02 1.97e-06
#  2.25e+02 1.97e-06 2.73e+02 1.97e-06  0.  0. ...], which is the results of six iterations.
# Copy the compute result to the destination Global Memory.
tik_instance.data_move(dst_gm, dst_ub, 0, nburst, dst_burst, dst_stride, src_stride)
tik_instance.BuildCCE(kernel_name="vec_reduce_max", inputs=[src_gm], outputs=[dst_gm])

Result example:
Input (src_gm):
[[  0.   1.   2.   3.   4.   5.   6.   7.   8.   9.  10.  11.  12.  13.
   14.  15.  16.  17.  18.  19.  20.  21.  22.  23.  24.  25.  26.  27.
   28.  29.  30.  31.  32.  33.  34.  35.  36.  37.  38.  39.  40.  41.
   42.  43.  44.  45.  46.  47.  48.  49.  50.  51.  52.  53.  54.  55.
   56.  57.  58.  59.  60.  61.  62.  63.  64.  65.  66.  67.  68.  69.
   70.  71.  72.  73.  74.  75.  76.  77.  78.  79.  80.  81.  82.  83.
   84.  85.  86.  87.  88.  89.  90.  91.  92.  93.  94.  95.  96.  97.
   98.  99. 100. 101. 102. 103. 104. 105. 106. 107. 108. 109. 110. 111.
  112. 113. 114. 115. 116. 117. 118. 119. 120. 121. 122. 123. 124. 125.
  126. 127.]
 [128. 129. 130. 131. 132. 133. 134. 135. 136. 137. 138. 139. 140. 141.
  142. 143. 144. 145. 146. 147. 148. 149. 150. 151. 152. 153. 154. 155.
  156. 157. 158. 159. 160. 161. 162. 163. 164. 165. 166. 167. 168. 169.
  170. 171. 172. 173. 174. 175. 176. 177. 178. 179. 180. 181. 182. 183.
  184. 185. 186. 187. 188. 189. 190. 191. 192. 193. 194. 195. 196. 197.
  198. 199. 200. 201. 202. 203. 204. 205. 206. 207. 208. 209. 210. 211.
  212. 213. 214. 215. 216. 217. 218. 219. 220. 221. 222. 223. 224. 225.
  226. 227. 228. 229. 230. 231. 232. 233. 234. 235. 236. 237. 238. 239.
  240. 241. 242. 243. 244. 245. 246. 247. 248. 249. 250. 251. 252. 253.
  254. 255.]
 [256. 257. 258. 259. 260. 261. 262. 263. 264. 265. 266. 267. 268. 269.
  270. 271. 272. 273. 274. 275. 276. 277. 278. 279. 280. 281. 282. 283.
  284. 285. 286. 287. 288. 289. 290. 291. 292. 293. 294. 295. 296. 297.
  298. 299. 300. 301. 302. 303. 304. 305. 306. 307. 308. 309. 310. 311.
  312. 313. 314. 315. 316. 317. 318. 319. 320. 321. 322. 323. 324. 325.
  326. 327. 328. 329. 330. 331. 332. 333. 334. 335. 336. 337. 338. 339.
  340. 341. 342. 343. 344. 345. 346. 347. 348. 349. 350. 351. 352. 353.
  354. 355. 356. 357. 358. 359. 360. 361. 362. 363. 364. 365. 366. 367.
  368. 369. 370. 371. 372. 373. 374. 375. 376. 377. 378. 379. 380. 381.
  382. 383.]]

Output (dst_gm):
[2.73e+02 1.63e-05 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00
 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00
 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00
 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00
 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00
 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00
 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00
 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00]

Parent topic: Reduce