vec_sel
Description
Selects elements based on sel bitwise. If a bit is 1, the corresponding element in src0 is selected; if a bit is 0, src1. The selections are recorded as dst_temp and then filtered by mask. The left bits are set to the result dst, and the filtered bits retain dst's original value.
For example, src0 is [1,2,3,4,5,6,7,8], src1 is [9,10,11,12,13,14,15,16], sel is [0,0,0,0,1,1,1,1], mask is [1,1,1,1,0,0,0,0], and the original value of dst is [-1,-2,-3,-4,-5,-6,-7,-8]. After bitwise selection based on sel, dst_temp [9,10,11,12,5,6,7,8] is obtained. The result dst filtered by mask is [9,10,11,12,-5,-6,-7,-8].
Prototype
vec_sel(mask, mode, dst, sel, src0, src1, repeat_times, dst_rep_stride=0, src0_rep_stride=0, src1_rep_stride=0)
Pipe: Vector
Parameters
|
Parameter |
Input/Output |
Description |
|---|---|---|
|
mask |
Input |
For details, see the description of the mask parameter in Table 1. |
|
mode |
Input |
Instruction mode, selected from: 0: Select between two tensors based on sel. Multiple iterations are supported. Element selection in each iteration is determined by the first 128 bits (if the destination operand is of type float16) or 64 bits (if the destination operand is of type float32) of sel. 1: Select between a tensor and a scalar bitwise based on sel. Multiple iterations are supported. 2: Select between two tensors bitwise based on sel. Multiple iterations are supported. |
|
dst |
Output |
Start element of the destination Tensor operand. The scope of the tensor is the Unified Buffer. |
|
sel |
Input |
The mask. If a bit is 1, the corresponding element in src0 is selected; if a bit is 0, src1. In mode 0, 1, or 2, sel is a Tensor of type uint8/uint16/uint32/uint64.
|
|
src0 |
Input |
Start element of the source Tensor operand 0. The scope of the tensor is the Unified Buffer. Note: dst must have the same data type as src0 and src1. |
|
src1 |
Input |
Start element of the source Tensor operand 1. The scope of the tensor is the Unified Buffer. In mode 0 or 2, the argument is a tensor. In mode 1, the argument is A Scalar of type int/float or an immediate of type int/float. Note: dst must have the same data type as src0 and src1. |
|
repeat_times |
Input |
Repeat times (or iterations). |
|
dst_rep_stride |
Input |
Repeat stride size for the destination operand between the corresponding blocks of successive iterations. |
|
src0_rep_stride |
Input |
Repeat stride size for source operand 0 between the corresponding blocks of successive iterations |
|
src1_rep_stride |
Input |
Repeat stride size for source operand 1 between the corresponding blocks of successive iterations Note: This parameter is invalid in mode 1. |
Returns
None
Applicability
Restrictions
- The mode argument must be an immediate.
- repeat_times ∈ [0, 255]. Must be a Scalar of type int16/int32/int64/uint16/uint32/uint64, an immediate of type int (other than 0), or an Expr of type int16/int32/int64/uint16/uint32/uint64.
- dst_rep_stride, src0_rep_stride, and src1_rep_stride
, in the unit of 32 bytes. Must be a Scalar of type int16/int32/int64/uint16/uint32/uint64, an immediate of type int, or an Expr of type int16/int32/int64/uint16/uint32/uint64. - dst and src0 must be different tensors or the same element of the same tensor, rather than different elements of the same tensor. This also applies to dst and src1.
- To save memory space, you can define a tensor reused by the source and destination operands (which means they have overlapped addresses). The general instruction restrictions are as follows.
- In the event of a single iteration repeat (repeat_times = 1), the source operand must completely overlap the destination operand.
- In the event of multiple iteration repeats (repeat_times > 1), if there is a dependency between the source operand and the destination operand, that is, the destination operand of the Nth iteration is the source operand of the (N+1)th iteration, address overlapping is not allowed.
- For details about the alignment requirements of the operand address offset, see General Restrictions.
Example
- mode = 0
# mode 0 from tbe import tik tik_instance = tik.Tik() src0_gm = tik_instance.Tensor("float16", (256,), name="src0_gm", scope=tik.scope_gm) src1_gm = tik_instance.Tensor("float16", (256,), name="src1_gm", scope=tik.scope_gm) src0_ub = tik_instance.Tensor("float16", (256,), name="src0_ub", scope=tik.scope_ubuf) src1_ub = tik_instance.Tensor("float16", (256,), name="src1_ub", scope=tik.scope_ubuf) dst_gm = tik_instance.Tensor("float16", (256,), name="dst_gm", scope=tik.scope_gm) dst_ub = tik_instance.Tensor("float16", (256,), name="dst_ub", scope=tik.scope_ubuf) # Copy the user input to the source Unified Buffer. tik_instance.data_move(src0_ub, src0_gm, 0, 1, 16, 0, 0) tik_instance.data_move(src1_ub, src1_gm, 0, 1, 16, 0, 0) is_le = tik_instance.Tensor("uint16", (16,), name="is_le", scope=tik.scope_ubuf) tik_instance.vec_cmpv_le(is_le, src0_ub, src1_ub, 2, 8, 8) tik_instance.vec_sel(128, 0, dst_ub, is_le, src0_ub, src1_ub, 2, 8, 8, 8) # Copy the compute result to the destination Global Memory. tik_instance.data_move(dst_gm, dst_ub, 0, 1, 16, 0, 0) tik_instance.BuildCCE(kernel_name="vec_sel", inputs=[src0_gm, src1_gm], outputs=[dst_gm])Result example:
Input (float16): src0_gm = [ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49. 50. 51. 52. 53. 54. 55. 56. 57. 58. 59. 60. 61. 62. 63. 64. 65. 66. 67. 68. 69. 70. 71. 72. 73. 74. 75. 76. 77. 78. 79. 80. 81. 82. 83. 84. 85. 86. 87. 88. 89. 90. 91. 92. 93. 94. 95. 96. 97. 98. 99. 100. 101. 102. 103. 104. 105. 106. 107. 108. 109. 110. 111. 112. 113. 114. 115. 116. 117. 118. 119. 120. 121. 122. 123. 124. 125. 126. 127. 128. 129. 130. 131. 132. 133. 134. 135. 136. 137. 138. 139. 140. 141. 142. 143. 144. 145. 146. 147. 148. 149. 150. 151. 152. 153. 154. 155. 156. 157. 158. 159. 160. 161. 162. 163. 164. 165. 166. 167. 168. 169. 170. 171. 172. 173. 174. 175. 176. 177. 178. 179. 180. 181. 182. 183. 184. 185. 186. 187. 188. 189. 190. 191. 192. 193. 194. 195. 196. 197. 198. 199. 200. 201. 202. 203. 204. 205. 206. 207. 208. 209. 210. 211. 212. 213. 214. 215. 216. 217. 218. 219. 220. 221. 222. 223. 224. 225. 226. 227. 228. 229. 230. 231. 232. 233. 234. 235. 236. 237. 238. 239. 240. 241. 242. 243. 244. 245. 246. 247. 248. 249. 250. 251. 252. 253. 254. 255.] src1_gm = [2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.] Output (in the second iteration, element selection is also determined by the first 128 bits of is_le): dst_gm = [ 0. 1. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 128. 129. 130. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.]
- mode = 1
# mode 1 from tbe import tik tik_instance = tik.Tik() src0_gm = tik_instance.Tensor("float16", (128,), name="src0_gm", scope=tik.scope_gm) sel_gm = tik_instance.Tensor("uint16", (16,), name="sel_gm", scope=tik.scope_gm) dst_gm = tik_instance.Tensor("float16", (128,), name="dst_gm", scope=tik.scope_gm) src0_ub = tik_instance.Tensor("float16", (128,), name="src0_ub", scope=tik.scope_ubuf) sel_ub = tik_instance.Tensor("uint16", (16,), name="sel_ub", scope=tik.scope_ubuf) dst_ub = tik_instance.Tensor("float16", (128,), name="dst_ub", scope=tik.scope_ubuf) # Copy the user input to the source Unified Buffer. tik_instance.data_move(src0_ub, src0_gm, 0, 1, 8, 0, 0) tik_instance.data_move(sel_ub, sel_gm, 0, 1, 1, 0, 0) src1 = tik_instance.Scalar(dtype="float16", init_value=5.2) tik_instance.vec_sel(128, 1, dst_ub, sel_ub, src0_ub, src1, 1, 8, 8, 8) # Copy the compute result to the destination Global Memory. tik_instance.data_move(dst_gm, dst_ub, 0, 1, 8, 0, 0) tik_instance.BuildCCE(kernel_name="vec_sel", inputs=[src0_gm, sel_gm], outputs=[dst_gm])Result example:
Input (src0_gm): [ 7.047 -4.61 -2.754 -8.84 0.724 -7.03 -7.41 3.191 -5.715 5.375 7.97 -9.04 -7.51 5.297 -4.668 -2.234 -6.406 -4.133 3.457 6.25 6.332 2.072 4.99 5.32 -8.53 -5.195 2.773 3.496 3.469 1.08 8.55 7.95 -8.01 0.355 5.715 -8.13 1.624 -6.242 1.843 4.48 -7.227 -9.08 4.043 5.066 9.92 -0.2786 1.384 2.338 -2.158 5.65 -2.76 -3.553 9.125 -4.727 4.836 1.339 1.57 5.67 -9.89 5.16 2.68 3.041 -0.7236 -9.83 -2.875 3.049 -2.734 5.13 -6.285 -4.58 -1.953 -3.678 -1.075 2.342 -6.2 -1.131 -0.581 -4.168 4.375 -1.106 3.5 6.32 8.625 -0.0924 2.592 8.586 -5.125 -2.816 -0.981 -7.402 -1.569 -3.521 -5.89 0.1954 -3.088 -8.39 -7.53 -8.04 -3.55 0.6875 -7.797 -0.898 -0.806 1.412 0.04767 -7.867 7.44 -0.4033 5.625 -0.4336 -7.85 -6.535 1.921 -8.984 3.691 -0.612 -6.465 1.917 -2.377 9.086 6.625 -1.057 5.23 -4.594 3.834 -7.27 6.39 -8.81 ] Input (sel_gm): [34801 47741 38465 50533 40608 34912 29182 41277 8651 53727 22558 7409 50749 47716 41029 56293] Output (dst_gm): [ 7.047 5.2 5.2 5.2 0.724 -7.03 -7.41 3.191 -5.715 5.375 7.97 5.2 5.2 5.2 5.2 -2.234 -6.406 5.2 3.457 6.25 6.332 2.072 4.99 5.2 5.2 -5.195 5.2 3.496 3.469 1.08 5.2 7.95 -8.01 5.2 5.2 5.2 5.2 5.2 1.843 5.2 5.2 -9.08 4.043 5.2 9.92 5.2 5.2 2.338 -2.158 5.2 -2.76 5.2 5.2 -4.727 4.836 5.2 1.57 5.2 -9.89 5.2 5.2 5.2 -0.7236 -9.83 5.2 5.2 5.2 5.2 5.2 -4.58 5.2 -3.678 5.2 2.342 -6.2 -1.131 -0.581 5.2 5.2 -1.106 5.2 5.2 5.2 5.2 5.2 8.586 -5.125 5.2 5.2 5.2 5.2 -3.521 5.2 5.2 5.2 -8.39 5.2 -8.04 -3.55 0.6875 -7.797 -0.898 -0.806 1.412 0.04767 5.2 5.2 5.2 5.625 -0.4336 -7.85 5.2 1.921 5.2 3.691 -0.612 -6.465 1.917 5.2 5.2 6.625 5.2 5.2 5.2 5.2 -7.27 5.2 -8.81 ]
- mode = 2
# mode 2 from tbe import tik tik_instance = tik.Tik() src0_gm = tik_instance.Tensor("float16", (128,), name="src0_gm", scope=tik.scope_gm) src1_gm = tik_instance.Tensor("float16", (128,), name="src1_gm", scope=tik.scope_gm) sel_gm = tik_instance.Tensor("uint16", (16,), name="sel_gm", scope=tik.scope_gm) dst_gm = tik_instance.Tensor("float16", (128,), name="dst_gm", scope=tik.scope_gm) src0_ub = tik_instance.Tensor("float16", (128,), name="src0_ub", scope=tik.scope_ubuf) src1_ub = tik_instance.Tensor("float16", (128,), name="src1_ub", scope=tik.scope_ubuf) sel_ub = tik_instance.Tensor("uint16", (16,), name="sel_ub", scope=tik.scope_ubuf) dst_ub = tik_instance.Tensor("float16", (128,), name="dst_ub", scope=tik.scope_ubuf) # Copy the user input to the source Unified Buffer. tik_instance.data_move(src0_ub, src0_gm, 0, 1, 8, 0, 0) tik_instance.data_move(src1_ub, src1_gm, 0, 1, 8, 0, 0) tik_instance.data_move(sel_ub, sel_gm, 0, 1, 1, 0, 0) tik_instance.vec_sel(128, 2, dst_ub, sel_ub, src0_ub, src1_ub, 1, 8, 8, 8) # Copy the compute result to the destination Global Memory. tik_instance.data_move(dst_gm, dst_ub, 0, 1, 8, 0, 0) tik_instance.BuildCCE(kernel_name="vec_sel", inputs=[src0_gm, src1_gm, sel_gm], outputs=[dst_gm])Result example:
Input (src0_gm): [-2.555 -7.137 9.93 5.52 -4.195 -9.07 -8.266 0.1783 -3.87 -5.79 -7.863 -6.137 -0.766 -6.12 8.05 5.273 -5.97 -6.73 -7.797 -6.492 -2.367 9.76 5.523 -7.477 3.629 -9.24 2.078 -7.43 3.941 8.9 0.981 -1.694 6.7 -8.734 -3.82 -2.44 -8.64 -3.736 -5.23 -6.348 -5.86 7.6 0.2576 -3.514 6.043 0.0805 -2.475 0.4766 -1.011 7.66 5.87 0.924 7.734 -2.246 3.477 6.703 5.438 -3.555 -1.588 0.3542 7.46 -3.89 -8.98 9.97 -6.195 2.178 -9.54 -9.2 -0.0888 0.625 9.16 -1.165 -8.87 3.057 -2.197 4.81 -3.875 -8.07 -1.478 -1.128 -1. 2.316 -1.426 1.563 -3.62 1.21 5.49 4.86 3.428 4.406 9.98 9.55 3.342 -3.312 -6.234 7.082 -7.13 -9.61 -6.676 4.758 -3.373 4.13 -8.5 -9.67 0.2986 -8.984 6.758 1.481 -9. -5.84 3.895 5.164 -2.203 -0.2065 4.645 0.812 5.28 -9.42 -2.527 0.0811 -8.484 2.807 -0.178 4.258 0.1595 8.28 9.53 2.85 ] Input (src1_gm): [ 8.98 7.25 7.016 8.83 -1.67 2.062 4.71 -7.613 -4.023 6.9 0.5405 -4.277 -6.766 6.555 -9.445 -1.903 -9.445 -0.826 6.02 -2.701 5.883 5.695 3.201 2.83 9.99 7.68 -7.254 5.855 -6.188 9.89 -2.97 4.703 5.332 7.63 -6.938 -5.273 -4.19 7.76 -4.133 -4.582 1.795 8.945 7.902 9.31 1.126 -2.088 -5.78 -8.82 -6.203 8.01 -7.64 -7.703 -7.43 -7.414 -5.523 6.207 -5.785 5.38 4.82 7.605 9.016 -0.77 1.106 -1.48 8.625 2.41 -5.598 -4.02 2.88 -6.11 4.29 -6.32 -8.42 6.105 -1.016 0.834 0.8794 5.184 6.98 -3.557 -9.91 8.336 1.978 -3.084 -5.523 3.527 9.28 6.89 7.55 3.988 -4.633 0.02034 -2.785 -5.453 3.363 -5.215 -0.5625 5.703 -0.4485 1.616 5.883 5.637 4.914 2.947 5.715 2.455 -9.82 -0.541 7.15 7.21 0.768 1.718 -0.04523 -1.464 -3.38 4.926 4.258 -9.14 -3.22 -3.625 2.277 -2.277 3.557 -6.855 8.77 6.895 2.064 -9.57 ] Input (sel_gm): [25595 40859 53116 59877 7219 22186 18639 30224 65041 61683 61785 4586 43795 50906 26534 38912] Output (dst_gm): [-2.555 -7.137 7.016 5.52 -4.195 -9.07 -8.266 0.1783 -3.87 -5.79 0.5405 -4.277 -6.766 -6.12 8.05 -1.903 -5.97 -6.73 6.02 -6.492 -2.367 5.695 3.201 -7.477 3.629 -9.24 2.078 -7.43 3.941 9.89 -2.97 -1.694 5.332 7.63 -3.82 -2.44 -8.64 -3.736 -5.23 -4.582 -5.86 7.6 0.2576 -3.514 1.126 -2.088 -2.475 0.4766 -1.011 8.01 5.87 -7.703 -7.43 -2.246 3.477 6.703 5.438 5.38 4.82 0.3542 9.016 -3.89 -8.98 9.97 -6.195 2.178 -5.598 -4.02 -0.0888 0.625 4.29 -6.32 -8.42 6.105 -2.197 4.81 -3.875 5.184 6.98 -3.557 -9.91 2.316 1.978 1.563 -5.523 1.21 9.28 4.86 7.55 4.406 9.98 0.02034 3.342 -5.453 -6.234 -5.215 -7.13 -9.61 -6.676 4.758 5.883 5.637 -8.5 -9.67 5.715 2.455 -9.82 1.481 7.15 7.21 3.895 1.718 -0.04523 -1.464 -3.38 4.926 5.28 -9.14 -3.22 -3.625 2.277 2.807 -0.178 -6.855 0.1595 8.28 9.53 -9.57 ]
- Example 2
"""Select two groups of 128 source operands using vsel, and place the destination data at an interval of 48 operands for every 48 operands.""" from tbe import tik tik_instance = tik.Tik() dtype_size = { "int8": 1, "uint8": 1, "int16": 2, "uint16": 2, "float16": 2, "int32": 4, "uint32": 4, "float32": 4, "int64": 8, } shape = (2, 64) dst_shape = (12, 16) dtype = "float16" elements = 2 * 64 dst_element = 12 * 16 # Number of operations per iteration, which is 48 in the current example. mask = 48 # Iteration stride between the previous repeat header and the next repeat header of the destination operand. The unit is 32 bytes. dst_rep_stride = 6 src0_rep_stride = 4 src1_rep_stride = 4 # Number of iterations, which is 2 in the current example. You can adjust the number of iterations as required. repeat_times = 2 # Instruction mode [0,2]. In the current example, 0 is used. mode = 0 src0_gm = tik_instance.Tensor(dtype, shape, name="src0_gm", scope=tik.scope_gm) src1_gm = tik_instance.Tensor(dtype, shape, name="src1_gm", scope=tik.scope_gm) src0_ub = tik_instance.Tensor(dtype, shape, name="src0_ub", scope=tik.scope_ubuf) src1_ub = tik_instance.Tensor(dtype, shape, name="src1_ub", scope=tik.scope_ubuf) dst_gm = tik_instance.Tensor(dtype, dst_shape, name="dst_gm", scope=tik.scope_gm) dst_ub = tik_instance.Tensor(dtype, dst_shape, name="dst_ub", scope=tik.scope_ubuf) # Number of moved segments. nburst = 1 # Length of the moved segment each time, in 32 bytes. burst = elements * dtype_size[dtype] // 32 // nburst dst_burst = dst_element * dtype_size[dtype] // 32 // nburst # Stride between the previous burst tail and the next burst header, in 32 bytes. dst_stride, src_stride = 0, 0 # Copy the user input to the source Unified Buffer. tik_instance.data_move(src0_ub, src0_gm, 0, nburst, burst, dst_stride, src_stride) tik_instance.data_move(src1_ub, src1_gm, 0, nburst, burst, dst_stride, src_stride) # For details about how to use vec_cmpv_le and vec_dup, see related sections. is_le = tik_instance.Tensor("uint16", (16,), name="is_le", scope=tik.scope_ubuf) tik_instance.vec_cmpv_le(is_le, src0_ub, src1_ub, 1, 8, 8) tik_instance.vec_dup(64, dst_ub, 0, 3, 4) # The mask is 48, that is, 48 operations in each iteration. The first 10 operands of is_le are 1 and are selected from src0. Other operands are selected from src1. tik_instance.vec_sel(mask, mode, dst_ub, is_le, src0_ub, src1_ub, repeat_times, dst_rep_stride, src0_rep_stride, src1_rep_stride) # Copy the compute result to the destination Global Memory. tik_instance.data_move(dst_gm, dst_ub, 0, nburst, dst_burst, dst_stride, src_stride) tik_instance.BuildCCE(kernel_name="vec_sel", inputs=[src0_gm, src1_gm], outputs=[dst_gm]) Result example: Input (src0_gm): [[ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49. 50. 51. 52. 53. 54. 55. 56. 57. 58. 59. 60. 61. 62. 63.] [ 64. 65. 66. 67. 68. 69. 70. 71. 72. 73. 74. 75. 76. 77. 78. 79. 80. 81. 82. 83. 84. 85. 86. 87. 88. 89. 90. 91. 92. 93. 94. 95. 96. 97. 98. 99. 100. 101. 102. 103. 104. 105. 106. 107. 108. 109. 110. 111. 112. 113. 114. 115. 116. 117. 118. 119. 120. 121. 122. 123. 124. 125. 126. 127.]] Input (src1_gm): [[10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10.] [10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10.]] Output (dst_gm): [[ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 10. 10. 10. 10. 10.] [10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10.] [10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10.] [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [64. 65. 66. 67. 68. 69. 70. 71. 72. 73. 74. 10. 10. 10. 10. 10.] [10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10.] [10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10.] [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]