General Definition

Description

This is a generic form (instruction template) for an instruction with two source operands (src and a scalar).

Note that this section describes only the general format and gives no out-of-box instruction code.

Prototype

instruction (mask, dst, src, scalar, repeat_times, dst_rep_stride, src_rep_stride, mask_mode="normal")

Parameters

**Table 1** Parameter description
Parameter	Input/Output	Description
instruction	Input	A string specifying the instruction name. Only lowercase letters are supported in TIK DSL.
mask	Input	There are two modes indicated by mask_mode: normal mode: Use a 128-bit mask to indicate that each element of the source Vector is calculated or not. If a bit is set to 1, the corresponding element of the Vector participates in the computation; if a bit is set to 0, otherwise. This mask mode takes two forms: contiguous mask and bitwise mask. To set a contiguous mask mode, pass a single Scalar or Python immediate to indicate the number of first leading contiguous valid elements. For example, mask = 16 means that the first 16 elements participate in the computation. The argument must be an immediate of type int or a Scalar of type int64/int32/int16. Value range: When dst and src are 16 bits, mask ∈ [1, 128]. When dst and src are 32 bits, mask ∈ [1, 64]. When dst and src are 64 bits, mask ∈ [1, 32]. To set a bitwise mask mode, pass a list of two Scalars (int64) or two immediates (int64), formatted as [mask_h, mask_l]. If a bit is set to 1, the corresponding element of the Vector participates in the computation; if a bit is set to 0, otherwise. mask_h corresponds to the upper 64 elements, and mask_l corresponds to the lower 64 elements. For example, mask = [0, 8] (8 = 0b1000) means that only the fourth element participates in the computation. Value range: When dst and src are 16 bits, mask_h and mask_l ∈ [0, 264 – 1]. When dst and src are 32 bits, mask_h = 0 and mask_l ∈ [0, 264 – 1]. When dst and src are 64 bits, mask_h = 0 and mask_l ∈ [0, 232 – 1]. counter_mode: Pass a Scalar of type int64/int32/int16 or an immediate of type int to indicate the exact number of elements valid in the computation. In this mode, the number of instruction repeats is determined by the source operation vector and mask value, instead of the argument of repeat_times. The value range is [1, 232 – 1]. The Atlas 200/300/500 Inference Product uses the normal mode. The Atlas Training Series Product uses the normal mode.
dst	Output	Destination Vector operand, which points to the start element of the tensor. The supported data types vary depending on the specific instruction. The scope of the tensor is the Unified Buffer.
src	Input	Source Vector operand, which points to the start element of the tensor. The supported data types vary depending on the specific instruction. The scope of the tensor is the Unified Buffer.
scalar	Input	A Scalar or immediate, for the source Scalar operand.
repeat_times	Input	Repeat times (or iterations).
dst_rep_stride	Input	Repeat stride size for the destination Vector operand between the corresponding blocks of successive iterations.
src_rep_stride	Input	Repeat stride size for the source Vector operand between the corresponding blocks of successive iterations
mask_mode	Input	A string for the mask mode normal: normal mode counter: counter mode For the Atlas 200/300/500 Inference Product, this parameter has no effect. For the Atlas Training Series Product, this parameter has no effect.

Restrictions

repeat_times ∈ [0, 255]. Must be a Scalar of type int16/int32/int64/uint16/uint32/uint64, an immediate of type int (other than 0), or an Expr of type int16/int32/int64/uint16/uint32/uint64.
dst_rep_stride and src_rep_stride are within the range of [0, 255], in the unit of 32 bytes. Must be a Scalar of type int16/int32/int64/uint16/uint32/uint64, an immediate of type int, or an Expr of type int16/int32/int64/uint16/uint32/uint64.
The addresses of dst and src must not overlap.
The argument of the scalar parameter is A Scalar of type int/float or an immediate of type int/float.
To save memory space, you can define a tensor reused by the source and destination operands (which means they have overlapped addresses). The general instruction restrictions are as follows.
- In the event of a single repeat (repeat_times = 1), the source operand must completely overlap the destination operand.
- In the event of multiple repeats (repeat_times > 1), if there is a dependency between the source operand and the destination operand, that is, the destination operand of the Nth iteration is the source operand of the (N+1)th iteration, address overlapping is not allowed.
For details about the alignment requirements of the operand address offset, see General Restrictions.

Example

Example of contiguous data operations

"""
Add 2 to all source operands and place them back to the destination operation space.
"""
tik_instance = tik.Tik()
dtype_size = {
    "int8": 1,
    "uint8": 1,
    "int16": 2,
    "uint16": 2,
    "float16": 2,
    "int32": 4,
    "uint32": 4,
    "float32": 4,
    "int64": 8,
}

shape = (2, 256)
dtype = "float16"
elements = 2 * 256
# Number of operations per iteration, which is 128 in the current example. In bitwise mode, mask can be represented as [2**64-1, 2**64-1].
mask = 128
# Number of iterations, which is 4 in the current example. You can adjust the number of iterations as required.
repeat_times = 4
# Iteration stride between the previous repeat header and the next repeat header of the destination operand. The unit is 32 bytes. In the current example, the destination operand is placed contiguously. If data does not need to be processed contiguously, adjust the corresponding parameter.
dst_rep_stride = 8
src_rep_stride = 8
src_gm = tik_instance.Tensor(dtype, shape, name="src_gm", scope=tik.scope_gm)
dst_gm = tik_instance.Tensor(dtype, shape, name="dst_gm", scope=tik.scope_gm)
src_ub = tik_instance.Tensor(dtype, shape, name="src_ub", scope=tik.scope_ubuf)
dst_ub = tik_instance.Tensor(dtype, shape, name="dst_ub", scope=tik.scope_ubuf)
# Initialize and define scalar data.
scalar = tik_instance.Scalar(dtype=dtype, init_value=2)
# Number of moved segments.
nburst = 1
# Length of the moved segment each time, in 32 bytes.
burst = elements * dtype_size[dtype] // 32 // nburst
# Stride between the previous burst tail and the next burst header, in 32 bytes.
dst_stride, src_stride = 0, 0
# Copy the user input to the source Unified Buffer.
tik_instance.data_move(src_ub, src_gm, 0, nburst, burst, src_stride, dst_stride)
tik_instance.vec_adds(mask, dst_ub, src_ub, scalar, repeat_times, dst_rep_stride, src_rep_stride)
# Copy the compute result to the destination Global Memory.
tik_instance.data_move(dst_gm, dst_ub, 0, nburst, burst, src_stride, dst_stride)
tik_instance.BuildCCE(kernel_name="vec_adds", inputs=[src_gm], outputs=[dst_gm])

Result example:
Input (src_gm):
[[  0.   1.   2.   3.   4.   5.   6.   7.   8.   9.  10.  11.  12.  13.
   14.  15.  16.  17.  18.  19.  20.  21.  22.  23.  24.  25.  26.  27.
   28.  29.  30.  31.  32.  33.  34.  35.  36.  37.  38.  39.  40.  41.
   42.  43.  44.  45.  46.  47.  48.  49.  50.  51.  52.  53.  54.  55.
   56.  57.  58.  59.  60.  61.  62.  63.  64.  65.  66.  67.  68.  69.
   70.  71.  72.  73.  74.  75.  76.  77.  78.  79.  80.  81.  82.  83.
   84.  85.  86.  87.  88.  89.  90.  91.  92.  93.  94.  95.  96.  97.
   98.  99. 100. 101. 102. 103. 104. 105. 106. 107. 108. 109. 110. 111.
  112. 113. 114. 115. 116. 117. 118. 119. 120. 121. 122. 123. 124. 125.
  126. 127. 128. 129. 130. 131. 132. 133. 134. 135. 136. 137. 138. 139.
  140. 141. 142. 143. 144. 145. 146. 147. 148. 149. 150. 151. 152. 153.
  154. 155. 156. 157. 158. 159. 160. 161. 162. 163. 164. 165. 166. 167.
  168. 169. 170. 171. 172. 173. 174. 175. 176. 177. 178. 179. 180. 181.
  182. 183. 184. 185. 186. 187. 188. 189. 190. 191. 192. 193. 194. 195.
  196. 197. 198. 199. 200. 201. 202. 203. 204. 205. 206. 207. 208. 209.
  210. 211. 212. 213. 214. 215. 216. 217. 218. 219. 220. 221. 222. 223.
  224. 225. 226. 227. 228. 229. 230. 231. 232. 233. 234. 235. 236. 237.
  238. 239. 240. 241. 242. 243. 244. 245. 246. 247. 248. 249. 250. 251.
  252. 253. 254. 255.]
 [256. 257. 258. 259. 260. 261. 262. 263. 264. 265. 266. 267. 268. 269.
  270. 271. 272. 273. 274. 275. 276. 277. 278. 279. 280. 281. 282. 283.
  284. 285. 286. 287. 288. 289. 290. 291. 292. 293. 294. 295. 296. 297.
  298. 299. 300. 301. 302. 303. 304. 305. 306. 307. 308. 309. 310. 311.
  312. 313. 314. 315. 316. 317. 318. 319. 320. 321. 322. 323. 324. 325.
  326. 327. 328. 329. 330. 331. 332. 333. 334. 335. 336. 337. 338. 339.
  340. 341. 342. 343. 344. 345. 346. 347. 348. 349. 350. 351. 352. 353.
  354. 355. 356. 357. 358. 359. 360. 361. 362. 363. 364. 365. 366. 367.
  368. 369. 370. 371. 372. 373. 374. 375. 376. 377. 378. 379. 380. 381.
  382. 383. 384. 385. 386. 387. 388. 389. 390. 391. 392. 393. 394. 395.
  396. 397. 398. 399. 400. 401. 402. 403. 404. 405. 406. 407. 408. 409.
  410. 411. 412. 413. 414. 415. 416. 417. 418. 419. 420. 421. 422. 423.
  424. 425. 426. 427. 428. 429. 430. 431. 432. 433. 434. 435. 436. 437.
  438. 439. 440. 441. 442. 443. 444. 445. 446. 447. 448. 449. 450. 451.
  452. 453. 454. 455. 456. 457. 458. 459. 460. 461. 462. 463. 464. 465.
  466. 467. 468. 469. 470. 471. 472. 473. 474. 475. 476. 477. 478. 479.
  480. 481. 482. 483. 484. 485. 486. 487. 488. 489. 490. 491. 492. 493.
  494. 495. 496. 497. 498. 499. 500. 501. 502. 503. 504. 505. 506. 507.
  508. 509. 510. 511.]]
Output (dst_gm):
[[  2.   3.   4.   5.   6.   7.   8.   9.  10.  11.  12.  13.  14.  15.
   16.  17.  18.  19.  20.  21.  22.  23.  24.  25.  26.  27.  28.  29.
   30.  31.  32.  33.  34.  35.  36.  37.  38.  39.  40.  41.  42.  43.
   44.  45.  46.  47.  48.  49.  50.  51.  52.  53.  54.  55.  56.  57.
   58.  59.  60.  61.  62.  63.  64.  65.  66.  67.  68.  69.  70.  71.
   72.  73.  74.  75.  76.  77.  78.  79.  80.  81.  82.  83.  84.  85.
   86.  87.  88.  89.  90.  91.  92.  93.  94.  95.  96.  97.  98.  99.
  100. 101. 102. 103. 104. 105. 106. 107. 108. 109. 110. 111. 112. 113.
  114. 115. 116. 117. 118. 119. 120. 121. 122. 123. 124. 125. 126. 127.
  128. 129. 130. 131. 132. 133. 134. 135. 136. 137. 138. 139. 140. 141.
  142. 143. 144. 145. 146. 147. 148. 149. 150. 151. 152. 153. 154. 155.
  156. 157. 158. 159. 160. 161. 162. 163. 164. 165. 166. 167. 168. 169.
  170. 171. 172. 173. 174. 175. 176. 177. 178. 179. 180. 181. 182. 183.
  184. 185. 186. 187. 188. 189. 190. 191. 192. 193. 194. 195. 196. 197.
  198. 199. 200. 201. 202. 203. 204. 205. 206. 207. 208. 209. 210. 211.
  212. 213. 214. 215. 216. 217. 218. 219. 220. 221. 222. 223. 224. 225.
  226. 227. 228. 229. 230. 231. 232. 233. 234. 235. 236. 237. 238. 239.
  240. 241. 242. 243. 244. 245. 246. 247. 248. 249. 250. 251. 252. 253.
  254. 255. 256. 257.]
 [258. 259. 260. 261. 262. 263. 264. 265. 266. 267. 268. 269. 270. 271.
  272. 273. 274. 275. 276. 277. 278. 279. 280. 281. 282. 283. 284. 285.
  286. 287. 288. 289. 290. 291. 292. 293. 294. 295. 296. 297. 298. 299.
  300. 301. 302. 303. 304. 305. 306. 307. 308. 309. 310. 311. 312. 313.
  314. 315. 316. 317. 318. 319. 320. 321. 322. 323. 324. 325. 326. 327.
  328. 329. 330. 331. 332. 333. 334. 335. 336. 337. 338. 339. 340. 341.
  342. 343. 344. 345. 346. 347. 348. 349. 350. 351. 352. 353. 354. 355.
  356. 357. 358. 359. 360. 361. 362. 363. 364. 365. 366. 367. 368. 369.
  370. 371. 372. 373. 374. 375. 376. 377. 378. 379. 380. 381. 382. 383.
  384. 385. 386. 387. 388. 389. 390. 391. 392. 393. 394. 395. 396. 397.
  398. 399. 400. 401. 402. 403. 404. 405. 406. 407. 408. 409. 410. 411.
  412. 413. 414. 415. 416. 417. 418. 419. 420. 421. 422. 423. 424. 425.
  426. 427. 428. 429. 430. 431. 432. 433. 434. 435. 436. 437. 438. 439.
  440. 441. 442. 443. 444. 445. 446. 447. 448. 449. 450. 451. 452. 453.
  454. 455. 456. 457. 458. 459. 460. 461. 462. 463. 464. 465. 466. 467.
  468. 469. 470. 471. 472. 473. 474. 475. 476. 477. 478. 479. 480. 481.
  482. 483. 484. 485. 486. 487. 488. 489. 490. 491. 492. 493. 494. 495.
  496. 497. 498. 499. 500. 501. 502. 503. 504. 505. 506. 507. 508. 509.
  510. 511. 512. 513.]]

Example of discontiguous data operations

"""
Add 2 to 128 source operands using vadds, and then place the destination data at an interval of 32 operands for every 32 operands.
"""
tik_instance = tik.Tik()
dtype_size = {
    "int8": 1,
    "uint8": 1,
    "int16": 2,
    "uint16": 2,
    "float16": 2,
    "int32": 4,
    "uint32": 4,
    "float32": 4,
    "int64": 8,
}

src_shape = (4, 32)
dst_shape = (8, 32)
dtype = "float16"
src_elements = 4 * 32
dst_elements = 8 * 32
# Number of operations per iteration, which is 32 in the current example. In bitwise mode, mask can be represented as [0, 2**32-1].
mask = 32
# Number of iterations, which is 4 in the current example. You can adjust the number of iterations as required.
repeat_times = 4
# Iteration stride between the previous repeat header and the next repeat header of the destination operand. The unit is 32 bytes. Because there are 32 operations in each iteration and every 16 operands are interspaced with another 16 operands, the destination operand needs to be placed at an iteration interval of four blocks.
dst_rep_stride = 4
src_rep_stride = 2
src_gm = tik_instance.Tensor(dtype, src_shape, name="src_gm", scope=tik.scope_gm)
dst_gm = tik_instance.Tensor(dtype, dst_shape, name="dst_gm", scope=tik.scope_gm)
src_ub = tik_instance.Tensor(dtype, src_shape, name="src_ub", scope=tik.scope_ubuf)
dst_ub = tik_instance.Tensor(dtype, dst_shape, name="dst_ub", scope=tik.scope_ubuf)
scalar = tik_instance.Scalar(dtype=dtype, init_value=2)
# Number of moved segments.
nburst = 1
# Length of the moved segment each time, in 32 bytes.
burst = src_elements * dtype_size[dtype] // 32 // nburst
dst_burst = dst_elements * dtype_size[dtype] // 32 // nburst
# Stride between the previous burst tail and the next burst header, in 32 bytes.
dst_stride, src_stride = 0, 0
# Copy the user input to the source Unified Buffer.
tik_instance.data_move(src_ub, src_gm, 0, nburst, burst, src_stride, dst_stride)
# Set dst_ub to 0. For details about this parameter, see the description of the related instruction.
tik_instance.vec_dup(128, dst_ub, 0, 2, 8)
tik_instance.vec_adds(mask, dst_ub, src_ub, scalar, repeat_times, dst_rep_stride, src_rep_stride)
# Copy the compute result to the destination Global Memory.
tik_instance.data_move(dst_gm, dst_ub, 0, nburst, dst_burst, src_stride, dst_stride)
tik_instance.BuildCCE(kernel_name="vec_adds", inputs=[src_gm], outputs=[dst_gm])

Result example:
Input (src_gm):
[[  0.   1.   2.   3.   4.   5.   6.   7.   8.   9.  10.  11.  12.  13.
   14.  15.  16.  17.  18.  19.  20.  21.  22.  23.  24.  25.  26.  27.
   28.  29.  30.  31.]
 [ 32.  33.  34.  35.  36.  37.  38.  39.  40.  41.  42.  43.  44.  45.
   46.  47.  48.  49.  50.  51.  52.  53.  54.  55.  56.  57.  58.  59.
   60.  61.  62.  63.]
 [ 64.  65.  66.  67.  68.  69.  70.  71.  72.  73.  74.  75.  76.  77.
   78.  79.  80.  81.  82.  83.  84.  85.  86.  87.  88.  89.  90.  91.
   92.  93.  94.  95.]
 [ 96.  97.  98.  99. 100. 101. 102. 103. 104. 105. 106. 107. 108. 109.
  110. 111. 112. 113. 114. 115. 116. 117. 118. 119. 120. 121. 122. 123.
  124. 125. 126. 127.]]
Output (dst_gm):
[[  2.   3.   4.   5.   6.   7.   8.   9.  10.  11.  12.  13.  14.  15.
   16.  17.  18.  19.  20.  21.  22.  23.  24.  25.  26.  27.  28.  29.
   30.  31.  32.  33.]
 [  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.]
 [ 34.  35.  36.  37.  38.  39.  40.  41.  42.  43.  44.  45.  46.  47.
   48.  49.  50.  51.  52.  53.  54.  55.  56.  57.  58.  59.  60.  61.
   62.  63.  64.  65.]
 [  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.]
 [ 66.  67.  68.  69.  70.  71.  72.  73.  74.  75.  76.  77.  78.  79.
   80.  81.  82.  83.  84.  85.  86.  87.  88.  89.  90.  91.  92.  93.
   94.  95.  96.  97.]
 [  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.]
 [ 98.  99. 100. 101. 102. 103. 104. 105. 106. 107. 108. 109. 110. 111.
  112. 113. 114. 115. 116. 117. 118. 119. 120. 121. 122. 123. 124. 125.
  126. 127. 128. 129.]
 [  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.]]

Parent topic: Dual Sources with One Scalar (Gather Mode)