General Definition

Description

This is a generic format for a pair-reduce instruction, which uniformly processes the source operands of adjacent pair in each block in the current iteration. Note that it gives no out-of-box instruction code.

Prototype

instruction (mask, dst, src, repeat_times, dst_rep_stride, src_rep_stride)

Parameters

**Table 1**
Parameter	Input/Output	Description
instruction	Input	A string specifying the instruction name. Only lowercase letters are supported in TIK DSL.
mask	Input	For details, see the description of the mask parameter in Table 1.
dst	Output	Start element of the destination Tensor operand. The scope of the tensor is the Unified Buffer.
src	Input	Start element of the source Tensor operand. The scope of the tensor is the Unified Buffer.
repeat_times	Input	Repeat times (or iterations).
dst_rep_stride	Input	Repeat stride size for the destination operand between the corresponding blocks of successive iterations.
src_rep_stride	Input	Repeat stride size for the source operand between the corresponding blocks of successive iterations.

Restrictions

repeat_times ∈ [0, 255]. Must be a Scalar of type int16/int32/int64/uint16/uint32/uint64, an immediate of type int, or an Expr of type int16/int32/int64/uint16/uint32/uint64. If repeat_times is an immediate, 0 is not supported.
The parallelism degree in each repeat depends on the data precision and SoC version. The following uses PAR to represent the parallelism degree.
dst_rep_stride and src_rep_stride . Must be a Scalar of type int16/int32/int64/uint16/uint32/uint64, an immediate of type int, or an Expr of type int16/int32/int64/uint16/uint32/uint64.
For an Atlas 200/300/500 Inference Product , if dst_rep_stride is set to 0, the value 1 is used.

For an Atlas Training Series Product , if dst_rep_stride is set to 0, the value 1 is used.
Note that the implementation of dst_rep_stride is different. The unit is 128 bytes.
To save memory space, you can define a tensor reused by the source and destination operands (which means they have overlapped addresses). The general instruction restrictions are as follows.
- In the event of a single repeat (repeat_times = 1), the source operand must completely overlap the destination operand.
- In the event of multiple repeats (repeat_times > 1), if there is a dependency between the source operand and the destination operand, that is, the destination operand of the Nth iteration is the source operand of the (N+1)th iteration, address overlapping is not allowed.
For details about the alignment requirements of the operand address offset, see General Restrictions.

Example

tik_instance = tik.Tik()
dtype_size = {
    "int8": 1,
    "uint8": 1,
    "int16": 2,
    "uint16": 2,
    "float16": 2,
    "int32": 4,
    "uint32": 4,
    "float32": 4,
    "int64": 8,
}
src_shape = (2, 128)
dst_shape = (3, 64)
dtype = "float16"
src_elements = 2 * 128
dst_elements = 3 * 64
# Number of operations per iteration, which is 32 in the current example.
mask = 32
# Number of iterations, which is 2 in the current example. You can adjust the number of iterations as required.
repeat_times = 2
# Stride between operand headers in adjacent iterations. The unit is block. Currently, the header of the first iteration is two blocks away from the header of the second iteration. Note that for dst_rep_stride, one block is 128 bytes.
dst_rep_stride = 2
src_rep_stride = 3
src_gm = tik_instance.Tensor(dtype, src_shape, name="src_gm", scope=tik.scope_gm)
dst_gm = tik_instance.Tensor(dtype, dst_shape, name="dst_gm", scope=tik.scope_gm)
src_ub = tik_instance.Tensor(dtype, src_shape, name="src_ub", scope=tik.scope_ubuf)
dst_ub = tik_instance.Tensor(dtype, dst_shape, name="dst_ub", scope=tik.scope_ubuf)
# Move the user input from the Global Memory to the Unified Buffer.
# Number of moved segments.
nburst = 1
# Length of the moved segment each time, in 32 bytes.
burst = src_elements * dtype_size[dtype] // 32 // nburst
dst_burst = dst_elements * dtype_size[dtype] // 32 // nburst
# Stride between the previous burst tail and the next burst header, in 32 bytes.
dst_stride, src_stride = 0, 0
tik_instance.data_move(src_ub, src_gm, 0, nburst, burst, dst_stride, src_stride)
# Assign the initial value 0 to dst_ub. For details about vec_dup, see the corresponding section.
tik_instance.vec_dup(64, dst_ub, 0, 3, 4)
tik_instance.vec_cpadd(mask, dst_ub, src_ub, repeat_times, dst_rep_stride, src_rep_stride)
# Move the compute result from the Unified Buffer to the Global Memory.
tik_instance.data_move(dst_gm, dst_ub, 0, nburst, dst_burst, dst_stride, src_stride)
tik_instance.BuildCCE(kernel_name="vec_cpadd", inputs=[src_gm], outputs=[dst_gm])

Result example:
Input (src_gm):
 [[  0.   1.   2.   3.   4.   5.   6.   7.   8.   9.  10.  11.  12.  13.
   14.  15.  16.  17.  18.  19.  20.  21.  22.  23.  24.  25.  26.  27.
   28.  29.  30.  31.  32.  33.  34.  35.  36.  37.  38.  39.  40.  41.
   42.  43.  44.  45.  46.  47.  48.  49.  50.  51.  52.  53.  54.  55.
   56.  57.  58.  59.  60.  61.  62.  63.  64.  65.  66.  67.  68.  69.
   70.  71.  72.  73.  74.  75.  76.  77.  78.  79.  80.  81.  82.  83.
   84.  85.  86.  87.  88.  89.  90.  91.  92.  93.  94.  95.  96.  97.
   98.  99. 100. 101. 102. 103. 104. 105. 106. 107. 108. 109. 110. 111.
  112. 113. 114. 115. 116. 117. 118. 119. 120. 121. 122. 123. 124. 125.
  126. 127.]
 [128. 129. 130. 131. 132. 133. 134. 135. 136. 137. 138. 139. 140. 141.
  142. 143. 144. 145. 146. 147. 148. 149. 150. 151. 152. 153. 154. 155.
  156. 157. 158. 159. 160. 161. 162. 163. 164. 165. 166. 167. 168. 169.
  170. 171. 172. 173. 174. 175. 176. 177. 178. 179. 180. 181. 182. 183.
  184. 185. 186. 187. 188. 189. 190. 191. 192. 193. 194. 195. 196. 197.
  198. 199. 200. 201. 202. 203. 204. 205. 206. 207. 208. 209. 210. 211.
  212. 213. 214. 215. 216. 217. 218. 219. 220. 221. 222. 223. 224. 225.
  226. 227. 228. 229. 230. 231. 232. 233. 234. 235. 236. 237. 238. 239.
  240. 241. 242. 243. 244. 245. 246. 247. 248. 249. 250. 251. 252. 253.
  254. 255.]]

Output (dst_gm):
[[  1.   5.   9.  13.  17.  21.  25.  29.  33.  37.  41.  45.  49.  53.
   57.  61.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.]
 [  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.]
 [ 97. 101. 105. 109. 113. 117. 121. 125. 129. 133. 137. 141. 145. 149.
  153. 157.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.]]

Parent topic: Pair Reduce