vec_sel

Description

Selects elements based on sel bitwise. If a bit is 1, the corresponding element in src0 is selected; if a bit is 0, src1. The selections are recorded as dst_temp and then filtered by mask. The left bits are set to the result dst, and the filtered bits retain dst's original value.

For example, src0 is [1,2,3,4,5,6,7,8], src1 is [9,10,11,12,13,14,15,16], sel is [0,0,0,0,1,1,1,1], mask is [1,1,1,1,0,0,0,0], and the original value of dst is [-1,-2,-3,-4,-5,-6,-7,-8]. After bitwise selection based on sel, dst_temp [9,10,11,12,5,6,7,8] is obtained. The result dst filtered by mask is [9,10,11,12,-5,-6,-7,-8].

Prototype

vec_sel(mask, mode, dst, sel, src0, src1, repeat_times, dst_rep_stride=0, src0_rep_stride=0, src1_rep_stride=0)

Pipe: Vector

Parameters

Table 1 Parameter description

Parameter

Input/Output

Description

mask

Input

For details, see the description of the mask parameter in Table 1.

mode

Input

Instruction mode, selected from:

0: Select between two tensors based on sel. Multiple iterations are supported. Element selection in each iteration is determined by the first 128 bits (if the destination operand is of type float16) or 64 bits (if the destination operand is of type float32) of sel.

1: Select between a tensor and a scalar bitwise based on sel. Multiple iterations are supported.

2: Select between two tensors bitwise based on sel. Multiple iterations are supported.

Atlas 200/300/500 Inference Product supports only mode 0.

Atlas Training Series Product supports only mode 0.

dst

Output

Start element of the destination Tensor operand.

The scope of the tensor is the Unified Buffer.

Atlas 200/300/500 Inference Product : Tensor of type float16

Atlas Training Series Product : Tensor of type float16/float32

sel

Input

The mask. If a bit is 1, the corresponding element in src0 is selected; if a bit is 0, src1.

In mode 0, 1, or 2, sel is a Tensor of type uint8/uint16/uint32/uint64.

  • When mode = 0, element selection in each iteration is determined by the first 128 bits (if the destination operand is of type float16) or 64 bits (if the destination operand is of type float32) of sel. For details, see the call example.
  • In mode 1 or 2, elements are consumed continuously between iterations.

src0

Input

Start element of the source Tensor operand 0.

The scope of the tensor is the Unified Buffer.

Note: dst must have the same data type as src0 and src1.

src1

Input

Start element of the source Tensor operand 1.

The scope of the tensor is the Unified Buffer.

In mode 0 or 2, the argument is a tensor. In mode 1, the argument is A Scalar of type int/float or an immediate of type int/float.

Note: dst must have the same data type as src0 and src1.

repeat_times

Input

Repeat times (or iterations).

dst_rep_stride

Input

Repeat stride size for the destination operand between the corresponding blocks of successive iterations.

src0_rep_stride

Input

Repeat stride size for source operand 0 between the corresponding blocks of successive iterations

src1_rep_stride

Input

Repeat stride size for source operand 1 between the corresponding blocks of successive iterations

Note: This parameter is invalid in mode 1.

Returns

None

Applicability

Atlas 200/300/500 Inference Product

Atlas Training Series Product

Restrictions

  • The mode argument must be an immediate.
  • repeat_times ∈ [0, 255]. Must be a Scalar of type int16/int32/int64/uint16/uint32/uint64, an immediate of type int (other than 0), or an Expr of type int16/int32/int64/uint16/uint32/uint64.
  • dst_rep_stride, src0_rep_stride, and src1_rep_stride , in the unit of 32 bytes. Must be a Scalar of type int16/int32/int64/uint16/uint32/uint64, an immediate of type int, or an Expr of type int16/int32/int64/uint16/uint32/uint64.
  • dst and src0 must be different tensors or the same element of the same tensor, rather than different elements of the same tensor. This also applies to dst and src1.
  • To save memory space, you can define a tensor reused by the source and destination operands (which means they have overlapped addresses). The general instruction restrictions are as follows.
    • In the event of a single iteration repeat (repeat_times = 1), the source operand must completely overlap the destination operand.
    • In the event of multiple iteration repeats (repeat_times > 1), if there is a dependency between the source operand and the destination operand, that is, the destination operand of the Nth iteration is the source operand of the (N+1)th iteration, address overlapping is not allowed.
  • For details about the alignment requirements of the operand address offset, see General Restrictions.

Example

  • mode = 0
    # mode 0
    from tbe import tik
    tik_instance = tik.Tik()
    src0_gm = tik_instance.Tensor("float16", (256,), name="src0_gm", scope=tik.scope_gm)
    src1_gm = tik_instance.Tensor("float16", (256,), name="src1_gm", scope=tik.scope_gm)
    src0_ub = tik_instance.Tensor("float16", (256,), name="src0_ub", scope=tik.scope_ubuf)
    src1_ub = tik_instance.Tensor("float16", (256,), name="src1_ub", scope=tik.scope_ubuf)
    dst_gm = tik_instance.Tensor("float16", (256,), name="dst_gm", scope=tik.scope_gm)
    dst_ub = tik_instance.Tensor("float16", (256,), name="dst_ub", scope=tik.scope_ubuf)
    # Copy the user input to the source Unified Buffer.
    tik_instance.data_move(src0_ub, src0_gm, 0, 1, 16, 0, 0)
    tik_instance.data_move(src1_ub, src1_gm, 0, 1, 16, 0, 0)
    is_le = tik_instance.Tensor("uint16", (16,), name="is_le", scope=tik.scope_ubuf)
    tik_instance.vec_cmpv_le(is_le, src0_ub, src1_ub, 2, 8, 8)
    tik_instance.vec_sel(128, 0, dst_ub, is_le, src0_ub, src1_ub, 2, 8, 8, 8)
    # Copy the compute result to the destination Global Memory.
    tik_instance.data_move(dst_gm, dst_ub, 0, 1, 16, 0, 0)
    
    tik_instance.BuildCCE(kernel_name="vec_sel", inputs=[src0_gm, src1_gm], outputs=[dst_gm])
    

    Result example:

    Input (float16):
    src0_gm = 
    [  0.   1.   2.   3.   4.   5.   6.   7.   8.   9.  10.  11.  12.  13.
      14.  15.  16.  17.  18.  19.  20.  21.  22.  23.  24.  25.  26.  27.
      28.  29.  30.  31.  32.  33.  34.  35.  36.  37.  38.  39.  40.  41.
      42.  43.  44.  45.  46.  47.  48.  49.  50.  51.  52.  53.  54.  55.
      56.  57.  58.  59.  60.  61.  62.  63.  64.  65.  66.  67.  68.  69.
      70.  71.  72.  73.  74.  75.  76.  77.  78.  79.  80.  81.  82.  83.
      84.  85.  86.  87.  88.  89.  90.  91.  92.  93.  94.  95.  96.  97.
      98.  99. 100. 101. 102. 103. 104. 105. 106. 107. 108. 109. 110. 111.
     112. 113. 114. 115. 116. 117. 118. 119. 120. 121. 122. 123. 124. 125.
     126. 127. 128. 129. 130. 131. 132. 133. 134. 135. 136. 137. 138. 139.
     140. 141. 142. 143. 144. 145. 146. 147. 148. 149. 150. 151. 152. 153.
     154. 155. 156. 157. 158. 159. 160. 161. 162. 163. 164. 165. 166. 167.
     168. 169. 170. 171. 172. 173. 174. 175. 176. 177. 178. 179. 180. 181.
     182. 183. 184. 185. 186. 187. 188. 189. 190. 191. 192. 193. 194. 195.
     196. 197. 198. 199. 200. 201. 202. 203. 204. 205. 206. 207. 208. 209.
     210. 211. 212. 213. 214. 215. 216. 217. 218. 219. 220. 221. 222. 223.
     224. 225. 226. 227. 228. 229. 230. 231. 232. 233. 234. 235. 236. 237.
     238. 239. 240. 241. 242. 243. 244. 245. 246. 247. 248. 249. 250. 251.
     252. 253. 254. 255.]
    src1_gm = 
    [2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.
     2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.
     2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.
     2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.
     2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.
     2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.
     2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.
     2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.
     2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.
     2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.
     2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.]
    
    Output (in the second iteration, element selection is also determined by the first 128 bits of is_le):
    dst_gm = 
    [  0.   1.   2.   2.   2.   2.   2.   2.   2.   2.   2.   2.   2.   2.
       2.   2.   2.   2.   2.   2.   2.   2.   2.   2.   2.   2.   2.   2.
       2.   2.   2.   2.   2.   2.   2.   2.   2.   2.   2.   2.   2.   2.
       2.   2.   2.   2.   2.   2.   2.   2.   2.   2.   2.   2.   2.   2.
       2.   2.   2.   2.   2.   2.   2.   2.   2.   2.   2.   2.   2.   2.
       2.   2.   2.   2.   2.   2.   2.   2.   2.   2.   2.   2.   2.   2.
       2.   2.   2.   2.   2.   2.   2.   2.   2.   2.   2.   2.   2.   2.
       2.   2.   2.   2.   2.   2.   2.   2.   2.   2.   2.   2.   2.   2.
       2.   2.   2.   2.   2.   2.   2.   2.   2.   2.   2.   2.   2.   2.
       2.   2. 128. 129. 130.   2.   2.   2.   2.   2.   2.   2.   2.   2.
       2.   2.   2.   2.   2.   2.   2.   2.   2.   2.   2.   2.   2.   2.
       2.   2.   2.   2.   2.   2.   2.   2.   2.   2.   2.   2.   2.   2.
       2.   2.   2.   2.   2.   2.   2.   2.   2.   2.   2.   2.   2.   2.
       2.   2.   2.   2.   2.   2.   2.   2.   2.   2.   2.   2.   2.   2.
       2.   2.   2.   2.   2.   2.   2.   2.   2.   2.   2.   2.   2.   2.
       2.   2.   2.   2.   2.   2.   2.   2.   2.   2.   2.   2.   2.   2.
       2.   2.   2.   2.   2.   2.   2.   2.   2.   2.   2.   2.   2.   2.
       2.   2.   2.   2.   2.   2.   2.   2.   2.   2.   2.   2.   2.   2.
       2.   2.   2.   2.]
  • mode = 1
    # mode 1
    from tbe import tik
    tik_instance = tik.Tik()
    src0_gm = tik_instance.Tensor("float16", (128,), name="src0_gm", scope=tik.scope_gm)
    sel_gm = tik_instance.Tensor("uint16", (16,), name="sel_gm", scope=tik.scope_gm)
    dst_gm = tik_instance.Tensor("float16", (128,), name="dst_gm", scope=tik.scope_gm)
    src0_ub = tik_instance.Tensor("float16", (128,), name="src0_ub", scope=tik.scope_ubuf)
    sel_ub = tik_instance.Tensor("uint16", (16,), name="sel_ub", scope=tik.scope_ubuf)
    dst_ub = tik_instance.Tensor("float16", (128,), name="dst_ub", scope=tik.scope_ubuf)
    # Copy the user input to the source Unified Buffer.
    tik_instance.data_move(src0_ub, src0_gm, 0, 1, 8, 0, 0)
    tik_instance.data_move(sel_ub, sel_gm, 0, 1, 1, 0, 0)
    src1 = tik_instance.Scalar(dtype="float16", init_value=5.2)
    tik_instance.vec_sel(128, 1, dst_ub, sel_ub, src0_ub, src1, 1, 8, 8, 8)
    # Copy the compute result to the destination Global Memory.
    tik_instance.data_move(dst_gm, dst_ub, 0, 1, 8, 0, 0)
    tik_instance.BuildCCE(kernel_name="vec_sel", inputs=[src0_gm, sel_gm], outputs=[dst_gm])

    Result example:

    Input (src0_gm):
    [ 7.047   -4.61    -2.754   -8.84     0.724   -7.03    -7.41     3.191
     -5.715    5.375    7.97    -9.04    -7.51     5.297   -4.668   -2.234
     -6.406   -4.133    3.457    6.25     6.332    2.072    4.99     5.32
     -8.53    -5.195    2.773    3.496    3.469    1.08     8.55     7.95
     -8.01     0.355    5.715   -8.13     1.624   -6.242    1.843    4.48
     -7.227   -9.08     4.043    5.066    9.92    -0.2786   1.384    2.338
     -2.158    5.65    -2.76    -3.553    9.125   -4.727    4.836    1.339
      1.57     5.67    -9.89     5.16     2.68     3.041   -0.7236  -9.83
     -2.875    3.049   -2.734    5.13    -6.285   -4.58    -1.953   -3.678
     -1.075    2.342   -6.2     -1.131   -0.581   -4.168    4.375   -1.106
      3.5      6.32     8.625   -0.0924   2.592    8.586   -5.125   -2.816
     -0.981   -7.402   -1.569   -3.521   -5.89     0.1954  -3.088   -8.39
     -7.53    -8.04    -3.55     0.6875  -7.797   -0.898   -0.806    1.412
      0.04767 -7.867    7.44    -0.4033   5.625   -0.4336  -7.85    -6.535
      1.921   -8.984    3.691   -0.612   -6.465    1.917   -2.377    9.086
      6.625   -1.057    5.23    -4.594    3.834   -7.27     6.39    -8.81   ]
    
    Input (sel_gm):
    [34801 47741 38465 50533 40608 34912 29182 41277  8651 53727 22558  7409
     50749 47716 41029 56293]
    
    Output (dst_gm):
    [ 7.047    5.2      5.2      5.2      0.724   -7.03    -7.41     3.191
     -5.715    5.375    7.97     5.2      5.2      5.2      5.2     -2.234
     -6.406    5.2      3.457    6.25     6.332    2.072    4.99     5.2
      5.2     -5.195    5.2      3.496    3.469    1.08     5.2      7.95
     -8.01     5.2      5.2      5.2      5.2      5.2      1.843    5.2
      5.2     -9.08     4.043    5.2      9.92     5.2      5.2      2.338
     -2.158    5.2     -2.76     5.2      5.2     -4.727    4.836    5.2
      1.57     5.2     -9.89     5.2      5.2      5.2     -0.7236  -9.83
      5.2      5.2      5.2      5.2      5.2     -4.58     5.2     -3.678
      5.2      2.342   -6.2     -1.131   -0.581    5.2      5.2     -1.106
      5.2      5.2      5.2      5.2      5.2      8.586   -5.125    5.2
      5.2      5.2      5.2     -3.521    5.2      5.2      5.2     -8.39
      5.2     -8.04    -3.55     0.6875  -7.797   -0.898   -0.806    1.412
      0.04767  5.2      5.2      5.2      5.625   -0.4336  -7.85     5.2
      1.921    5.2      3.691   -0.612   -6.465    1.917    5.2      5.2
      6.625    5.2      5.2      5.2      5.2     -7.27     5.2     -8.81   ]
  • mode = 2
    # mode 2
    from tbe import tik
    tik_instance = tik.Tik()
    src0_gm = tik_instance.Tensor("float16", (128,), name="src0_gm", scope=tik.scope_gm)
    src1_gm = tik_instance.Tensor("float16", (128,), name="src1_gm", scope=tik.scope_gm)
    sel_gm = tik_instance.Tensor("uint16", (16,), name="sel_gm", scope=tik.scope_gm)
    dst_gm = tik_instance.Tensor("float16", (128,), name="dst_gm", scope=tik.scope_gm)
    src0_ub = tik_instance.Tensor("float16", (128,), name="src0_ub", scope=tik.scope_ubuf)
    src1_ub = tik_instance.Tensor("float16", (128,), name="src1_ub", scope=tik.scope_ubuf)
    sel_ub = tik_instance.Tensor("uint16", (16,), name="sel_ub", scope=tik.scope_ubuf)
    dst_ub = tik_instance.Tensor("float16", (128,), name="dst_ub", scope=tik.scope_ubuf)
    # Copy the user input to the source Unified Buffer.
    tik_instance.data_move(src0_ub, src0_gm, 0, 1, 8, 0, 0)
    tik_instance.data_move(src1_ub, src1_gm, 0, 1, 8, 0, 0)
    tik_instance.data_move(sel_ub, sel_gm, 0, 1, 1, 0, 0)
    tik_instance.vec_sel(128, 2, dst_ub, sel_ub, src0_ub, src1_ub, 1, 8, 8, 8)
    # Copy the compute result to the destination Global Memory.
    tik_instance.data_move(dst_gm, dst_ub, 0, 1, 8, 0, 0)
    tik_instance.BuildCCE(kernel_name="vec_sel", inputs=[src0_gm, src1_gm, sel_gm], outputs=[dst_gm])

    Result example:

    Input (src0_gm):
    [-2.555  -7.137   9.93    5.52   -4.195  -9.07   -8.266   0.1783 -3.87
     -5.79   -7.863  -6.137  -0.766  -6.12    8.05    5.273  -5.97   -6.73
     -7.797  -6.492  -2.367   9.76    5.523  -7.477   3.629  -9.24    2.078
     -7.43    3.941   8.9     0.981  -1.694   6.7    -8.734  -3.82   -2.44
     -8.64   -3.736  -5.23   -6.348  -5.86    7.6     0.2576 -3.514   6.043
      0.0805 -2.475   0.4766 -1.011   7.66    5.87    0.924   7.734  -2.246
      3.477   6.703   5.438  -3.555  -1.588   0.3542  7.46   -3.89   -8.98
      9.97   -6.195   2.178  -9.54   -9.2    -0.0888  0.625   9.16   -1.165
     -8.87    3.057  -2.197   4.81   -3.875  -8.07   -1.478  -1.128  -1.
      2.316  -1.426   1.563  -3.62    1.21    5.49    4.86    3.428   4.406
      9.98    9.55    3.342  -3.312  -6.234   7.082  -7.13   -9.61   -6.676
      4.758  -3.373   4.13   -8.5    -9.67    0.2986 -8.984   6.758   1.481
     -9.     -5.84    3.895   5.164  -2.203  -0.2065  4.645   0.812   5.28
     -9.42   -2.527   0.0811 -8.484   2.807  -0.178   4.258   0.1595  8.28
      9.53    2.85  ]
    
    Input (src1_gm):
    [ 8.98     7.25     7.016    8.83    -1.67     2.062    4.71    -7.613
     -4.023    6.9      0.5405  -4.277   -6.766    6.555   -9.445   -1.903
     -9.445   -0.826    6.02    -2.701    5.883    5.695    3.201    2.83
      9.99     7.68    -7.254    5.855   -6.188    9.89    -2.97     4.703
      5.332    7.63    -6.938   -5.273   -4.19     7.76    -4.133   -4.582
      1.795    8.945    7.902    9.31     1.126   -2.088   -5.78    -8.82
     -6.203    8.01    -7.64    -7.703   -7.43    -7.414   -5.523    6.207
     -5.785    5.38     4.82     7.605    9.016   -0.77     1.106   -1.48
      8.625    2.41    -5.598   -4.02     2.88    -6.11     4.29    -6.32
     -8.42     6.105   -1.016    0.834    0.8794   5.184    6.98    -3.557
     -9.91     8.336    1.978   -3.084   -5.523    3.527    9.28     6.89
      7.55     3.988   -4.633    0.02034 -2.785   -5.453    3.363   -5.215
     -0.5625   5.703   -0.4485   1.616    5.883    5.637    4.914    2.947
      5.715    2.455   -9.82    -0.541    7.15     7.21     0.768    1.718
     -0.04523 -1.464   -3.38     4.926    4.258   -9.14    -3.22    -3.625
      2.277   -2.277    3.557   -6.855    8.77     6.895    2.064   -9.57   ]
    
    Input (sel_gm):
    [25595 40859 53116 59877  7219 22186 18639 30224 65041 61683 61785  4586
     43795 50906 26534 38912]
    
    Output (dst_gm):
    [-2.555   -7.137    7.016    5.52    -4.195   -9.07    -8.266    0.1783
     -3.87    -5.79     0.5405  -4.277   -6.766   -6.12     8.05    -1.903
     -5.97    -6.73     6.02    -6.492   -2.367    5.695    3.201   -7.477
      3.629   -9.24     2.078   -7.43     3.941    9.89    -2.97    -1.694
      5.332    7.63    -3.82    -2.44    -8.64    -3.736   -5.23    -4.582
     -5.86     7.6      0.2576  -3.514    1.126   -2.088   -2.475    0.4766
     -1.011    8.01     5.87    -7.703   -7.43    -2.246    3.477    6.703
      5.438    5.38     4.82     0.3542   9.016   -3.89    -8.98     9.97
     -6.195    2.178   -5.598   -4.02    -0.0888   0.625    4.29    -6.32
     -8.42     6.105   -2.197    4.81    -3.875    5.184    6.98    -3.557
     -9.91     2.316    1.978    1.563   -5.523    1.21     9.28     4.86
      7.55     4.406    9.98     0.02034  3.342   -5.453   -6.234   -5.215
     -7.13    -9.61    -6.676    4.758    5.883    5.637   -8.5     -9.67
      5.715    2.455   -9.82     1.481    7.15     7.21     3.895    1.718
     -0.04523 -1.464   -3.38     4.926    5.28    -9.14    -3.22    -3.625
      2.277    2.807   -0.178   -6.855    0.1595   8.28     9.53    -9.57   ]
  • Example 2
    """Select two groups of 128 source operands using vsel, and place the destination data at an interval of 48 operands for every 48 operands."""
    from tbe import tik
    tik_instance = tik.Tik()
    dtype_size = {
        "int8": 1,
        "uint8": 1,
        "int16": 2,
        "uint16": 2,
        "float16": 2,
        "int32": 4,
        "uint32": 4,
        "float32": 4,
        "int64": 8,
    }
    shape = (2, 64)
    dst_shape = (12, 16)
    dtype = "float16"
    elements = 2 * 64
    dst_element = 12 * 16
    # Number of operations per iteration, which is 48 in the current example.
    mask = 48
    # Iteration stride between the previous repeat header and the next repeat header of the destination operand. The unit is 32 bytes.
    dst_rep_stride = 6
    src0_rep_stride = 4
    src1_rep_stride = 4
    # Number of iterations, which is 2 in the current example. You can adjust the number of iterations as required.
    repeat_times = 2
    # Instruction mode [0,2]. In the current example, 0 is used.
    mode = 0
    src0_gm = tik_instance.Tensor(dtype, shape, name="src0_gm", scope=tik.scope_gm)
    src1_gm = tik_instance.Tensor(dtype, shape, name="src1_gm", scope=tik.scope_gm)
    src0_ub = tik_instance.Tensor(dtype, shape, name="src0_ub", scope=tik.scope_ubuf)
    src1_ub = tik_instance.Tensor(dtype, shape, name="src1_ub", scope=tik.scope_ubuf)
    dst_gm = tik_instance.Tensor(dtype, dst_shape, name="dst_gm", scope=tik.scope_gm)
    dst_ub = tik_instance.Tensor(dtype, dst_shape, name="dst_ub", scope=tik.scope_ubuf)
    # Number of moved segments.
    nburst = 1
    # Length of the moved segment each time, in 32 bytes.
    burst = elements * dtype_size[dtype] // 32 // nburst
    dst_burst = dst_element * dtype_size[dtype] // 32 // nburst
    # Stride between the previous burst tail and the next burst header, in 32 bytes.
    dst_stride, src_stride = 0, 0
    # Copy the user input to the source Unified Buffer.
    
    tik_instance.data_move(src0_ub, src0_gm, 0, nburst, burst, dst_stride, src_stride)
    tik_instance.data_move(src1_ub, src1_gm, 0, nburst, burst, dst_stride, src_stride)
    # For details about how to use vec_cmpv_le and vec_dup, see related sections.
    is_le = tik_instance.Tensor("uint16", (16,), name="is_le", scope=tik.scope_ubuf)
    tik_instance.vec_cmpv_le(is_le, src0_ub, src1_ub, 1, 8, 8)
    
    tik_instance.vec_dup(64, dst_ub, 0, 3, 4)
    # The mask is 48, that is, 48 operations in each iteration. The first 10 operands of is_le are 1 and are selected from src0. Other operands are selected from src1.
    tik_instance.vec_sel(mask, mode, dst_ub, is_le, src0_ub, src1_ub, repeat_times, dst_rep_stride, src0_rep_stride, src1_rep_stride)
    # Copy the compute result to the destination Global Memory.
    tik_instance.data_move(dst_gm, dst_ub, 0, nburst, dst_burst, dst_stride, src_stride)
    
    tik_instance.BuildCCE(kernel_name="vec_sel", inputs=[src0_gm, src1_gm], outputs=[dst_gm])
    Result example:
    Input (src0_gm):
    [[  0.   1.   2.   3.   4.   5.   6.   7.   8.   9.  10.  11.  12.  13.
       14.  15.  16.  17.  18.  19.  20.  21.  22.  23.  24.  25.  26.  27.
       28.  29.  30.  31.  32.  33.  34.  35.  36.  37.  38.  39.  40.  41.
       42.  43.  44.  45.  46.  47.  48.  49.  50.  51.  52.  53.  54.  55.
       56.  57.  58.  59.  60.  61.  62.  63.]
     [ 64.  65.  66.  67.  68.  69.  70.  71.  72.  73.  74.  75.  76.  77.
       78.  79.  80.  81.  82.  83.  84.  85.  86.  87.  88.  89.  90.  91.
       92.  93.  94.  95.  96.  97.  98.  99. 100. 101. 102. 103. 104. 105.
      106. 107. 108. 109. 110. 111. 112. 113. 114. 115. 116. 117. 118. 119.
      120. 121. 122. 123. 124. 125. 126. 127.]]
    Input (src1_gm):
    [[10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10.
      10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10.
      10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10.
      10. 10. 10. 10. 10. 10. 10. 10. 10. 10.]
     [10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10.
      10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10.
      10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10.
      10. 10. 10. 10. 10. 10. 10. 10. 10. 10.]]
    Output (dst_gm):
    [[ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9. 10. 10. 10. 10. 10. 10.]
     [10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10.]
     [10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10.]
     [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
     [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
     [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
     [64. 65. 66. 67. 68. 69. 70. 71. 72. 73. 74. 10. 10. 10. 10. 10.]
     [10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10.]
     [10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10.]
     [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
     [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
     [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]]