data_move

Description

Moves data between src and dst. Both src and dst can be Tensors at the same time.

Atlas 200/300/500 Inference Product: UB->UB/UB->OUT/OUT->UB/OUT->L1

Atlas Training Series Product: UB->UB/UB->OUT/OUT->UB/OUT->L1

Prototype

data_move (dst, src, sid, nburst, burst, src_stride, dst_stride, *args, **argv)

Parameters

**Table 1** Parameter description
Parameter	Input/Output	Description
dst	Output	Destination operand. For details about the data type restrictions, see Table 2. If the scope of dst is the L1 Buffer or Unified Buffer, the address offset must be 32-byte aligned.
src	Input	Source operand. For details about the data type restrictions, see Table 2. If the scope of src is the L1 Buffer or Unified Buffer, the address offset must be 32-byte aligned.
sid	Input	SMMU ID, which is hardware-related and reserved. Must be a Scalar (int32), an immediate (int32), or an Expr (int32). Must be in the range of [0, 15]. The value 0 is recommended.
nburst	Input	Number of bursts to move. Must be a Scalar (int32), an immediate (int32), or an Expr (int32). Must be in the range of [1, 4095].
burst	Input	Burst length of contiguous data transfer, in the unit of 32 bytes. The value is in the range of [1, 65535]. Must be a Scalar of type int16/int32/int64/uint16/uint32/uint64, an immediate of type int, or an Expr of type int16/int32/int64/uint16/uint32/uint64.
src_stride	Input	Tail-to-header stride between adjacent bursts of the source tensor. Must be in the range of [0, 65535]. Must be a Scalar of type int16/int32/int64/uint16/uint32/uint64, an immediate of type int, or an Expr of type int16/int32/int64/uint16/uint32/uint64.
dst_stride	Input	Tail-to-header stride between adjacent bursts of the destination tensor. Must be in the range of [0, 65535]. Must be a Scalar of type int16/int32/int64/uint16/uint32/uint64, an immediate of type int, or an Expr of type int16/int32/int64/uint16/uint32/uint64.
*args	Input	Number of extended arguments
**argv	Input	Extended arguments

**Table 2** Data types, scopes, and parameter units related to data_move
src.scope	dst.scope	dtype (src and dst Have the Same dtype)	burst Unit	src_stride Unit	dst_stride Unit
OUT	L1	uint8, int8, float16, uint16, int16, float32, int32, uint32, uint64, int64	32 bytes	32 bytes	32 bytes
L1	OUT	uint8, int8, float16, uint16, int16, float32, int32, uint32, uint64, int64	32 bytes	32 bytes	32 bytes
OUT	UB	uint8, int8, float16, uint16, int16, float32, int32, uint32, uint64, int64	32 bytes	32 bytes	32 bytes
UB	OUT	uint8, int8, float16, uint16, int16, float32, int32, uint32, uint64, int64	32 bytes	32 bytes	32 bytes
UB	UB	uint8, int8, float16, uint16, int16, float32, int32, uint32, uint64, int64	32 bytes	32 bytes	32 bytes

Applicability

Atlas 200/300/500 Inference Product

Atlas Training Series Product

Restrictions

None

Returns

None

Examples

Example 1: All data is transferred at once.

from tbe import tik
tik_instance = tik.Tik()
src_gm = tik_instance.Tensor("float16", (512,), name="src_gm", scope=tik.scope_gm)
dst_gm = tik_instance.Tensor("float16", (512,), name="dst_gm", scope=tik.scope_gm)
tensor_ub = tik_instance.Tensor("float16", (512,), name="tensor_ub", scope=tik.scope_ubuf)
# Move the user input from the Global Memory to the Unified Buffer.
# nburst indicates the transfer times of blocks, equal to the number of bursts. burst indicates the data length per burst, in the unit of 32 bytes. src_stride and dst_stride indicate the burst-to-burst strides.
# In order to improve the running performance, it is advisable to reduce the transfer times (nburst) and increase the data length per burst (burst) as far as possible.
# In this example, nburst = 1 indicates that all data is transferred at once. burst = 512 x 2//32 = 32 indicates that burst length is 32. src_stride = 0 and dst_stride = 0 indicate that the data is transferred in a contiguous mode.
tik_instance.data_move(tensor_ub, src_gm, 0, 1, 32, 0, 0)
# Copy the compute result to the destination Global Memory.
tik_instance.data_move(dst_gm, tensor_ub, 0, 1, 32, 0, 0)

tik_instance.BuildCCE(kernel_name="data_move", inputs=[src_gm], outputs=[dst_gm])

Result example:

Input (src_gm):
[-1.539    1.418   -6.418   -9.55    -9.336   -6.484    4.117    7.914
 -2.012    3.201   -5.375    7.32     4.       9.99     3.502    8.27
 -8.125   -9.33    -6.812   -8.695    9.87    -4.914   -5.992    1.233
 -3.662    3.477   -4.9     -3.924    6.438    8.266    7.31     8.97
  6.06    -3.646    9.695    0.623    9.84    -5.234    4.715   -8.07
  ...
  2.19     3.709   -3.611   -6.97    -0.772   -0.5938   2.953    7.043
  6.63     8.55     1.873    0.1703  -0.715   -5.35    -4.52     7.31
  4.336   -7.113   -5.875    9.44    -2.812   -6.5     -0.742   -6.703
  3.297   -7.605    0.3582  -1.62     2.578   -6.35    -2.166    9.95
  4.57     2.746    9.88    -3.354    5.645   -6.434   -2.32     2.59   ]

Output (dst_gm):
[-1.539    1.418   -6.418   -9.55    -9.336   -6.484    4.117    7.914
 -2.012    3.201   -5.375    7.32     4.       9.99     3.502    8.27
 -8.125   -9.33    -6.812   -8.695    9.87    -4.914   -5.992    1.233
 -3.662    3.477   -4.9     -3.924    6.438    8.266    7.31     8.97
  6.06    -3.646    9.695    0.623    9.84    -5.234    4.715   -8.07
  ...
  2.19     3.709   -3.611   -6.97    -0.772   -0.5938   2.953    7.043
  6.63     8.55     1.873    0.1703  -0.715   -5.35    -4.52     7.31
  4.336   -7.113   -5.875    9.44    -2.812   -6.5     -0.742   -6.703
  3.297   -7.605    0.3582  -1.62     2.578   -6.35    -2.166    9.95
  4.57     2.746    9.88    -3.354    5.645   -6.434   -2.32     2.59   ]

Example 2: The input is not 32-byte aligned.

from tbe import tik
tik_instance = tik.Tik()
src_gm = tik_instance.Tensor("float16", (23, ), name="src_gm", scope=tik.scope_gm)
dst_gm = tik_instance.Tensor("float16", (23, ), name="dst_gm", scope=tik.scope_gm)
src_ub = tik_instance.Tensor("float16", (32, ), name="src_ub", scope=tik.scope_ubuf)
dst_ub = tik_instance.Tensor("float16", (32, ), name="dst_ub", scope=tik.scope_ubuf)
tik_instance.vec_dup(32, src_ub, 0, 1, 1)
tik_instance.vec_dup(32, dst_ub, 0, 1, 1)
with tik_instance.for_range(0, 2) as i:
    # Data movement can be performed twice. For the first movement, the data is 32-byte aligned and moved. For the second movement, the data is moved forward and then moved to the Unified Buffer in the 32-byte aligned mode.
    tik_instance.data_move(src_ub[i*16], src_gm[i*(23-16)], 0, 1, 1, 0, 0)
tik_instance.vec_add(32, dst_ub, src_ub, src_ub, 1, 1, 1, 1)
# Movement from the Unified Buffer to the Global Memory adopts the same way. For the first movement, the 32-byte aligned data is moved to Global Memory. For the second movement, the address of the Global Memory is rolled back. After 32-byte aligned is met, the remaining data in the Unified Buffer is stored.
with tik_instance.for_range(0, 2) as i:
    tik_instance.data_move(dst_gm[i*(23-16)], dst_ub[i*16], 0, 1, 1, 0, 0)
tik_instance.BuildCCE(kernel_name="data_move", inputs=[src_gm], outputs=[dst_gm])

Input (src_gm):

[ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17.

18. 19. 20. 21. 22.]

Output (dst_gm):

[ 0. 2. 4. 6. 8. 10. 12. 14. 16. 18. 20. 22. 24. 26. 28. 30. 32. 34.

36. 38. 40. 42. 44.]

Example 3: The size of the input exceeds that of the Unified Buffer.

In this example, the source operands and destination operand of vec_add use the same Tensor, that is, the addresses overlap completely. Assume that the available space of the Unified Buffer is 248 KB. The code example is as follows:

from tbe import tik
tik_instance = tik.Tik()
src_gm = tik_instance.Tensor("float16", (126976, 2), name="src_gm", scope=tik.scope_gm)
dst_gm = tik_instance.Tensor("float16", (126976, 2), name="dst_gm", scope=tik.scope_gm)
dst_ub = tik_instance.Tensor("float16", (126976, ), name="dst_ub", scope=tik.scope_ubuf)
with tik_instance.for_range(0, 2) as i:
    # If the data in the Global Memory exceeds the maximum memory of the Unified Buffer, move a segment to the Unified Buffer for calculation. After the calculation is complete, move the data back to the Global Memory. The process can be repeated multiple times.
    tik_instance.data_move(dst_ub, src_gm[i*126976], 0, 1, 7936, 0, 0)
    with tik_instance.for_range(0, 3) as j:
        # The maximum value of repeat_times is 255. If all data cannot be calculated at once, multiple calculations are recommended. To save space, both src and dst are the same Unified Buffer.
        tik_instance.vec_add(128, dst_ub[j*128*255], dst_ub[j*128*255], dst_ub[j*128*255], 255, 8, 8, 8)     
    tik_instance.vec_add(128, dst_ub[3 * 128 * 255], dst_ub[3 * 128 * 255], dst_ub[3 * 128 * 255], 227, 8, 8, 8)
    # Move the calculated data back to the Global Memory and then calculate the remaining data.
    tik_instance.data_move(dst_gm[i*126976], dst_ub, 0, 1, 7936, 0, 0)
tik_instance.BuildCCE(kernel_name="data_move", inputs=[src_gm], outputs=[dst_gm])

Input (src_gm):

[2. 2. 2. ... 2. 2. 2.]

Output (dst_gm):

[4. 4. 4. ... 4. 4. 4.]

Parent topic: Data Movement