General Definition

Description

This is a generic format for an instruction with two source operands.

Note that this section describes only the general format and gives no out-of-box instruction code.

Prototype

instruction (mask, dst, src0, src1, repeat_times, dst_rep_stride, src0_rep_stride, src1_rep_stride)

Parameters

Table 1 Parameter description

Parameter

Input/Output

Description

instruction

Input

A string specifying the instruction name. Only lowercase letters are supported in TIK DSL.

mask

Input

For details, see the description of the mask parameter in Table 1.

dst

Output

Destination operand, which points to the start element of the tensor. The supported data types vary depending on the specific instruction.

The scope of the tensor is the Unified Buffer.

src0

Input

Source operand 0, which points to the start element of the tensor. The supported data types vary depending on the specific instruction.

The scope of the tensor is the Unified Buffer.

src1

Input

Source operand 1, which points to the start element of the tensor. The supported data types vary depending on the specific instruction.

The scope of the tensor is the Unified Buffer.

repeat_times

Input

Repeat times (or iterations).

dst_rep_stride

Input

Repeat stride size for the destination operand between the corresponding blocks of successive iterations.

src0_rep_stride

Input

Repeat stride size for source operand 0 between the corresponding blocks of successive iterations.

src1_rep_stride

Input

Repeat stride size for source operand 1 between the corresponding blocks of successive iterations.

Restrictions

  • repeat_times ∈ [0, 255]. Must be a Scalar of type int16/int32/int64/uint16/uint32/uint64, an immediate of type int (other than 0), or an Expr of type int16/int32/int64/uint16/uint32/uint64.
  • The parallelism degree in each repeat depends on the data precision and SoC version. The following uses PAR to represent the parallelism degree.
  • dst_rep_stride, src0_rep_stride, and src1_rep_stride , in the unit of 32 bytes. Must be a Scalar of type int16/int32/int64/uint16/uint32/uint64, an immediate of type int, or an Expr of type int16/int32/int64/uint16/uint32/uint64.
  • To save memory space, you can define a tensor reused by the source and destination operands (which means they have overlapped addresses). The general instruction restrictions are as follows.
    • In the event of a single repeat (repeat_times = 1), the source operand must completely overlap the destination operand.
    • In the event of multiple repeats (repeat_times > 1), if there is a dependency between the source operand and the destination operand, that is, the destination operand of the Nth iteration is the source operand of the (N+1)th iteration, address overlapping is not allowed. For vec_add, vec_sub, vec_mul, vec_max, v_min, vec_and, and vec_or, address overlapping is supported in the following cases: (1) The data type is float16, int32, or float32, and the destination operand completely overlaps the second source operand; (2) src1_rep_stride = dst_rep_stride = 0; (3) src0 and src1 do not overlap.
  • For details about the alignment requirements of the operand address offset, see General Restrictions.

Example

  • Example of contiguous data operations
    from tbe import tik
    tik_instance = tik.Tik()
    dtype_size = {
        "int8": 1,
        "uint8": 1,
        "int16": 2,
        "uint16": 2,
        "float16": 2,
        "int32": 4,
        "uint32": 4,
        "float32": 4,
        "int64": 8,
    }
    shape = (2, 512)
    dtype = "float16"
    elements = 2 * 512
    # Number of operations per iteration, which is 128 in the current example. In bitwise mode, mask can be represented as [2**64-1, 2**64-1].
    mask = 128
    # Number of iterations, which is 8 in the current example. You can adjust the number of iterations as required.
    repeat_times = 8
    # Iteration stride between the previous repeat header and the next repeat header of the destination operand. The unit is 32 bytes. In the current example, the destination operand is placed contiguously. If data does not need to be processed contiguously, adjust the corresponding parameter.
    dst_rep_stride = 8
    # Iteration stride between the previous repeat header and the next repeat header of the source operand. The unit is 32 bytes. In the current example, the source operand is read contiguously. If data does not need to be processed contiguously, adjust the corresponding parameter.
    src0_rep_stride = 8
    src1_rep_stride = 8
    src0_gm = tik_instance.Tensor(dtype, shape, name="src0_gm", scope=tik.scope_gm)
    src1_gm = tik_instance.Tensor(dtype, shape, name="src1_gm", scope=tik.scope_gm)
    dst_gm = tik_instance.Tensor(dtype, shape, name="dst_gm", scope=tik.scope_gm)
    src0_ub = tik_instance.Tensor(dtype, shape, name="src0_ub", scope=tik.scope_ubuf)
    src1_ub = tik_instance.Tensor(dtype, shape, name="src1_ub", scope=tik.scope_ubuf)
    dst_ub = tik_instance.Tensor(dtype, shape, name="dst_ub", scope=tik.scope_ubuf)
    # Number of moved segments.
    nburst = 1
    # Length of the moved segment each time, in 32 bytes.
    burst = elements * dtype_size[dtype] // 32 // nburst
    # Stride between the previous burst tail and the next burst header, in 32 bytes.
    dst_stride, src_stride = 0, 0
    # Copy the user input to the source Unified Buffer.
    tik_instance.data_move(src0_ub, src0_gm, 0, nburst, burst, src_stride, dst_stride)
    tik_instance.data_move(src1_ub, src1_gm, 0, nburst, burst, src_stride, dst_stride)
    tik_instance.vec_add(mask, dst_ub, src0_ub, src1_ub, repeat_times, dst_rep_stride, src0_rep_stride, src1_rep_stride)
    # Copy the compute result to the destination Global Memory.
    tik_instance.data_move(dst_gm, dst_ub, 0, nburst, burst, src_stride, dst_stride)
    tik_instance.BuildCCE(kernel_name="vec_add", inputs=[src0_gm, src1_gm], outputs=[dst_gm])
    
    Result example:
    Input (src0_gm):
    [[0.000e+00 1.000e+00 2.000e+00 3.000e+00 4.000e+00 5.000e+00 6.000e+00
      7.000e+00 8.000e+00 9.000e+00 1.000e+01 1.100e+01 1.200e+01 1.300e+01
      1.400e+01 1.500e+01 1.600e+01 1.700e+01 1.800e+01 1.900e+01 2.000e+01
      2.100e+01 2.200e+01 2.300e+01 2.400e+01 2.500e+01 2.600e+01 2.700e+01
      2.800e+01 2.900e+01 3.000e+01 3.100e+01 3.200e+01 3.300e+01 3.400e+01
      3.500e+01 3.600e+01 3.700e+01 3.800e+01 3.900e+01 4.000e+01 4.100e+01
      4.200e+01 4.300e+01 4.400e+01 4.500e+01 4.600e+01 4.700e+01 4.800e+01
      4.900e+01 5.000e+01 5.100e+01 5.200e+01 5.300e+01 5.400e+01 5.500e+01
      5.600e+01 5.700e+01 5.800e+01 5.900e+01 6.000e+01 6.100e+01 6.200e+01
      6.300e+01 6.400e+01 6.500e+01 6.600e+01 6.700e+01 6.800e+01 6.900e+01
      7.000e+01 7.100e+01 7.200e+01 7.300e+01 7.400e+01 7.500e+01 7.600e+01
      7.700e+01 7.800e+01 7.900e+01 8.000e+01 8.100e+01 8.200e+01 8.300e+01
      8.400e+01 8.500e+01 8.600e+01 8.700e+01 8.800e+01 8.900e+01 9.000e+01
      9.100e+01 9.200e+01 9.300e+01 9.400e+01 9.500e+01 9.600e+01 9.700e+01
      ...
      1.009e+03 1.010e+03 1.011e+03 1.012e+03 1.013e+03 1.014e+03 1.015e+03
      1.016e+03 1.017e+03 1.018e+03 1.019e+03 1.020e+03 1.021e+03 1.022e+03
      1.023e+03]]
    Input (src1_gm):
    [[0.000e+00 1.000e+00 2.000e+00 3.000e+00 4.000e+00 5.000e+00 6.000e+00
      7.000e+00 8.000e+00 9.000e+00 1.000e+01 1.100e+01 1.200e+01 1.300e+01
      1.400e+01 1.500e+01 1.600e+01 1.700e+01 1.800e+01 1.900e+01 2.000e+01
      2.100e+01 2.200e+01 2.300e+01 2.400e+01 2.500e+01 2.600e+01 2.700e+01
      2.800e+01 2.900e+01 3.000e+01 3.100e+01 3.200e+01 3.300e+01 3.400e+01
      3.500e+01 3.600e+01 3.700e+01 3.800e+01 3.900e+01 4.000e+01 4.100e+01
      4.200e+01 4.300e+01 4.400e+01 4.500e+01 4.600e+01 4.700e+01 4.800e+01
      4.900e+01 5.000e+01 5.100e+01 5.200e+01 5.300e+01 5.400e+01 5.500e+01
      5.600e+01 5.700e+01 5.800e+01 5.900e+01 6.000e+01 6.100e+01 6.200e+01
      6.300e+01 6.400e+01 6.500e+01 6.600e+01 6.700e+01 6.800e+01 6.900e+01
      7.000e+01 7.100e+01 7.200e+01 7.300e+01 7.400e+01 7.500e+01 7.600e+01
      7.700e+01 7.800e+01 7.900e+01 8.000e+01 8.100e+01 8.200e+01 8.300e+01
      8.400e+01 8.500e+01 8.600e+01 8.700e+01 8.800e+01 8.900e+01 9.000e+01
      9.100e+01 9.200e+01 9.300e+01 9.400e+01 9.500e+01 9.600e+01 9.700e+01
      ...
      1.009e+03 1.010e+03 1.011e+03 1.012e+03 1.013e+03 1.014e+03 1.015e+03
      1.016e+03 1.017e+03 1.018e+03 1.019e+03 1.020e+03 1.021e+03 1.022e+03
      1.023e+03]]
    Output (dst_gm):
    [[0.000e+00 2.000e+00 4.000e+00 6.000e+00 8.000e+00 1.000e+01 1.200e+01
      1.400e+01 1.600e+01 1.800e+01 2.000e+01 2.200e+01 2.400e+01 2.600e+01
      2.800e+01 3.000e+01 3.200e+01 3.400e+01 3.600e+01 3.800e+01 4.000e+01
      4.200e+01 4.400e+01 4.600e+01 4.800e+01 5.000e+01 5.200e+01 5.400e+01
      5.600e+01 5.800e+01 6.000e+01 6.200e+01 6.400e+01 6.600e+01 6.800e+01
      7.000e+01 7.200e+01 7.400e+01 7.600e+01 7.800e+01 8.000e+01 8.200e+01
      8.400e+01 8.600e+01 8.800e+01 9.000e+01 9.200e+01 9.400e+01 9.600e+01
      9.800e+01 1.000e+02 1.020e+02 1.040e+02 1.060e+02 1.080e+02 1.100e+02
      1.120e+02 1.140e+02 1.160e+02 1.180e+02 1.200e+02 1.220e+02 1.240e+02
      1.260e+02 1.280e+02 1.300e+02 1.320e+02 1.340e+02 1.360e+02 1.380e+02
      1.400e+02 1.420e+02 1.440e+02 1.460e+02 1.480e+02 1.500e+02 1.520e+02
      1.540e+02 1.560e+02 1.580e+02 1.600e+02 1.620e+02 1.640e+02 1.660e+02
      1.680e+02 1.700e+02 1.720e+02 1.740e+02 1.760e+02 1.780e+02 1.800e+02
      1.820e+02 1.840e+02 1.860e+02 1.880e+02 1.900e+02 1.920e+02 1.940e+02
      ...
      2.018e+03 2.020e+03 2.022e+03 2.024e+03 2.026e+03 2.028e+03 2.030e+03
      2.032e+03 2.034e+03 2.036e+03 2.038e+03 2.040e+03 2.042e+03 2.044e+03
      2.046e+03]]
  • Example of discontiguous data operations
    """
    Add the 160 source operands in the two groups using vec_add, and place the data at an interval of 32 operands for every 32 operands.
    """
    tik_instance = tik.Tik()
    dtype_size = {
        "int8": 1,
        "uint8": 1,
        "int16": 2,
        "uint16": 2,
        "float16": 2,
        "int32": 4,
        "uint32": 4,
        "float32": 4,
        "int64": 8,
    }
    
    src_shape = (5, 32)
    dst_shape = (10, 32)
    dtype = "float16"
    elements = 5 * 32
    dst_elements = 10 * 32
    # Number of operations per iteration, which is 32 in the current example. In bitwise mode, mask can be represented as [0, 2**32-1].
    mask = 32
    # Number of iterations, which is 5 in the current example. You can adjust the number of iterations as required.
    repeat_times = 5
    # Iteration stride between the previous repeat header and the next repeat header of the destination operand. The unit is 32 bytes. Because there are 32 operations in each iteration and every 32 operands are interspaced with another 32 operands, the destination operand needs to be placed at an iteration interval of four blocks.
    dst_rep_stride = 4
    src0_rep_stride = 2
    src1_rep_stride = 2
    src0_gm = tik_instance.Tensor(dtype, src_shape, name="src0_gm", scope=tik.scope_gm)
    src1_gm = tik_instance.Tensor(dtype, src_shape, name="src1_gm", scope=tik.scope_gm)
    dst_gm = tik_instance.Tensor(dtype, dst_shape, name="dst_gm", scope=tik.scope_gm)
    src0_ub = tik_instance.Tensor(dtype, src_shape, name="src0_ub", scope=tik.scope_ubuf)
    src1_ub = tik_instance.Tensor(dtype, src_shape, name="src1_ub", scope=tik.scope_ubuf)
    dst_ub = tik_instance.Tensor(dtype, dst_shape, name="dst_ub", scope=tik.scope_ubuf)
    # Number of moved segments.
    nburst = 1
    # Length of the moved segment each time, in 32 bytes.
    burst = elements * dtype_size[dtype] // 32 // nburst
    dst_burst = dst_elements * dtype_size[dtype] // 32 // nburst
    # Stride between the previous burst tail and the next burst header, in 32 bytes.
    dst_stride, src_stride = 0, 0
    # Copy the user input to the source Unified Buffer.
    tik_instance.data_move(src0_ub, src0_gm, 0, nburst, burst, src_stride, dst_stride)
    tik_instance.data_move(src1_ub, src1_gm, 0, nburst, burst, src_stride, dst_stride)
    # Set dst_ub to 0. For details about this parameter, see the description of the related instruction.
    tik_instance.vec_dup(64, dst_ub, 0, 5, 4)
    tik_instance.vec_add(mask, dst_ub, src0_ub, src1_ub, repeat_times, dst_rep_stride, src0_rep_stride, src1_rep_stride)
    # Copy the compute result to the destination Global Memory.
    tik_instance.data_move(dst_gm, dst_ub, 0, nburst, dst_burst, src_stride, dst_stride)
    tik_instance.BuildCCE(kernel_name="vec_add", inputs=[src0_gm, src1_gm], outputs=[dst_gm])
    
    
    Result example:
    Input (src0_gm):
    [[  0.   1.   2.   3.   4.   5.   6.   7.   8.   9.  10.  11.  12.  13.
       14.  15.  16.  17.  18.  19.  20.  21.  22.  23.  24.  25.  26.  27.
       28.  29.  30.  31.]
     [ 32.  33.  34.  35.  36.  37.  38.  39.  40.  41.  42.  43.  44.  45.
       46.  47.  48.  49.  50.  51.  52.  53.  54.  55.  56.  57.  58.  59.
       60.  61.  62.  63.]
     [ 64.  65.  66.  67.  68.  69.  70.  71.  72.  73.  74.  75.  76.  77.
       78.  79.  80.  81.  82.  83.  84.  85.  86.  87.  88.  89.  90.  91.
       92.  93.  94.  95.]
     [ 96.  97.  98.  99. 100. 101. 102. 103. 104. 105. 106. 107. 108. 109.
      110. 111. 112. 113. 114. 115. 116. 117. 118. 119. 120. 121. 122. 123.
      124. 125. 126. 127.]
     [128. 129. 130. 131. 132. 133. 134. 135. 136. 137. 138. 139. 140. 141.
      142. 143. 144. 145. 146. 147. 148. 149. 150. 151. 152. 153. 154. 155.
      156. 157. 158. 159.]]
    Input (src1_gm):
    [[  0.   1.   2.   3.   4.   5.   6.   7.   8.   9.  10.  11.  12.  13.
       14.  15.  16.  17.  18.  19.  20.  21.  22.  23.  24.  25.  26.  27.
       28.  29.  30.  31.]
     [ 32.  33.  34.  35.  36.  37.  38.  39.  40.  41.  42.  43.  44.  45.
       46.  47.  48.  49.  50.  51.  52.  53.  54.  55.  56.  57.  58.  59.
       60.  61.  62.  63.]
     [ 64.  65.  66.  67.  68.  69.  70.  71.  72.  73.  74.  75.  76.  77.
       78.  79.  80.  81.  82.  83.  84.  85.  86.  87.  88.  89.  90.  91.
       92.  93.  94.  95.]
     [ 96.  97.  98.  99. 100. 101. 102. 103. 104. 105. 106. 107. 108. 109.
      110. 111. 112. 113. 114. 115. 116. 117. 118. 119. 120. 121. 122. 123.
      124. 125. 126. 127.]
     [128. 129. 130. 131. 132. 133. 134. 135. 136. 137. 138. 139. 140. 141.
      142. 143. 144. 145. 146. 147. 148. 149. 150. 151. 152. 153. 154. 155.
      156. 157. 158. 159.]]
    Output (dst_gm):
    [[  0.   2.   4.   6.   8.  10.  12.  14.  16.  18.  20.  22.  24.  26.
       28.  30.  32.  34.  36.  38.  40.  42.  44.  46.  48.  50.  52.  54.
       56.  58.  60.  62.]
     [  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
        0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
        0.   0.   0.   0.]
     [ 64.  66.  68.  70.  72.  74.  76.  78.  80.  82.  84.  86.  88.  90.
       92.  94.  96.  98. 100. 102. 104. 106. 108. 110. 112. 114. 116. 118.
      120. 122. 124. 126.]
     [  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
        0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
        0.   0.   0.   0.]
     [128. 130. 132. 134. 136. 138. 140. 142. 144. 146. 148. 150. 152. 154.
      156. 158. 160. 162. 164. 166. 168. 170. 172. 174. 176. 178. 180. 182.
      184. 186. 188. 190.]
     [  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
        0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
        0.   0.   0.   0.]
     [192. 194. 196. 198. 200. 202. 204. 206. 208. 210. 212. 214. 216. 218.
      220. 222. 224. 226. 228. 230. 232. 234. 236. 238. 240. 242. 244. 246.
      248. 250. 252. 254.]
     [  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
        0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
        0.   0.   0.   0.]
     [256. 258. 260. 262. 264. 266. 268. 270. 272. 274. 276. 278. 280. 282.
      284. 286. 288. 290. 292. 294. 296. 298. 300. 302. 304. 306. 308. 310.
      312. 314. 316. 318.]
     [  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
        0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
        0.   0.   0.   0.]]