data_move
Description
Moves data between src and dst. Both src and dst can be Tensors at the same time.
Prototype
data_move (dst, src, sid, nburst, burst, src_stride, dst_stride, *args, **argv)
Parameters
Parameter |
Input/Output |
Description |
|---|---|---|
dst |
Output |
Destination operand. For details about the data type restrictions, see Table 2. If the scope of dst is the L1 Buffer or Unified Buffer, the address offset must be 32-byte aligned. |
src |
Input |
Source operand. For details about the data type restrictions, see Table 2. If the scope of src is the L1 Buffer or Unified Buffer, the address offset must be 32-byte aligned. |
sid |
Input |
SMMU ID, which is hardware-related and reserved. Must be a Scalar (int32), an immediate (int32), or an Expr (int32). Must be in the range of [0, 15]. The value 0 is recommended. |
nburst |
Input |
Number of bursts to move. Must be a Scalar (int32), an immediate (int32), or an Expr (int32). Must be in the range of [1, 4095]. |
burst |
Input |
Burst length of contiguous data transfer, in the unit of 32 bytes. The value is in the range of [1, 65535]. Must be a Scalar of type int16/int32/int64/uint16/uint32/uint64, an immediate of type int, or an Expr of type int16/int32/int64/uint16/uint32/uint64. |
src_stride |
Input |
Tail-to-header stride between adjacent bursts of the source tensor. Must be in the range of [0, 65535]. Must be a Scalar of type int16/int32/int64/uint16/uint32/uint64, an immediate of type int, or an Expr of type int16/int32/int64/uint16/uint32/uint64. |
dst_stride |
Input |
Tail-to-header stride between adjacent bursts of the destination tensor. Must be in the range of [0, 65535]. Must be a Scalar of type int16/int32/int64/uint16/uint32/uint64, an immediate of type int, or an Expr of type int16/int32/int64/uint16/uint32/uint64. |
*args |
Input |
Number of extended arguments |
**argv |
Input |
Extended arguments |
src.scope |
dst.scope |
dtype (src and dst Have the Same dtype) |
burst Unit |
src_stride Unit |
dst_stride Unit |
|---|---|---|---|---|---|
OUT |
L1 |
uint8, int8, float16, uint16, int16, float32, int32, uint32, uint64, int64 |
32 bytes |
32 bytes |
32 bytes |
L1 |
OUT |
uint8, int8, float16, uint16, int16, float32, int32, uint32, uint64, int64 |
32 bytes |
32 bytes |
32 bytes |
OUT |
UB |
uint8, int8, float16, uint16, int16, float32, int32, uint32, uint64, int64 |
32 bytes |
32 bytes |
32 bytes |
UB |
OUT |
uint8, int8, float16, uint16, int16, float32, int32, uint32, uint64, int64 |
32 bytes |
32 bytes |
32 bytes |
UB |
UB |
uint8, int8, float16, uint16, int16, float32, int32, uint32, uint64, int64 |
32 bytes |
32 bytes |
32 bytes |
Applicability
Restrictions
None
Returns
None
Examples
- Example 1: All data is transferred at once.
from tbe import tik tik_instance = tik.Tik() src_gm = tik_instance.Tensor("float16", (512,), name="src_gm", scope=tik.scope_gm) dst_gm = tik_instance.Tensor("float16", (512,), name="dst_gm", scope=tik.scope_gm) tensor_ub = tik_instance.Tensor("float16", (512,), name="tensor_ub", scope=tik.scope_ubuf) # Move the user input from the Global Memory to the Unified Buffer. # nburst indicates the transfer times of blocks, equal to the number of bursts. burst indicates the data length per burst, in the unit of 32 bytes. src_stride and dst_stride indicate the burst-to-burst strides. # In order to improve the running performance, it is advisable to reduce the transfer times (nburst) and increase the data length per burst (burst) as far as possible. # In this example, nburst = 1 indicates that all data is transferred at once. burst = 512 x 2//32 = 32 indicates that burst length is 32. src_stride = 0 and dst_stride = 0 indicate that the data is transferred in a contiguous mode. tik_instance.data_move(tensor_ub, src_gm, 0, 1, 32, 0, 0) # Copy the compute result to the destination Global Memory. tik_instance.data_move(dst_gm, tensor_ub, 0, 1, 32, 0, 0) tik_instance.BuildCCE(kernel_name="data_move", inputs=[src_gm], outputs=[dst_gm])Result example:
Input (src_gm): [-1.539 1.418 -6.418 -9.55 -9.336 -6.484 4.117 7.914 -2.012 3.201 -5.375 7.32 4. 9.99 3.502 8.27 -8.125 -9.33 -6.812 -8.695 9.87 -4.914 -5.992 1.233 -3.662 3.477 -4.9 -3.924 6.438 8.266 7.31 8.97 6.06 -3.646 9.695 0.623 9.84 -5.234 4.715 -8.07 ... 2.19 3.709 -3.611 -6.97 -0.772 -0.5938 2.953 7.043 6.63 8.55 1.873 0.1703 -0.715 -5.35 -4.52 7.31 4.336 -7.113 -5.875 9.44 -2.812 -6.5 -0.742 -6.703 3.297 -7.605 0.3582 -1.62 2.578 -6.35 -2.166 9.95 4.57 2.746 9.88 -3.354 5.645 -6.434 -2.32 2.59 ] Output (dst_gm): [-1.539 1.418 -6.418 -9.55 -9.336 -6.484 4.117 7.914 -2.012 3.201 -5.375 7.32 4. 9.99 3.502 8.27 -8.125 -9.33 -6.812 -8.695 9.87 -4.914 -5.992 1.233 -3.662 3.477 -4.9 -3.924 6.438 8.266 7.31 8.97 6.06 -3.646 9.695 0.623 9.84 -5.234 4.715 -8.07 ... 2.19 3.709 -3.611 -6.97 -0.772 -0.5938 2.953 7.043 6.63 8.55 1.873 0.1703 -0.715 -5.35 -4.52 7.31 4.336 -7.113 -5.875 9.44 -2.812 -6.5 -0.742 -6.703 3.297 -7.605 0.3582 -1.62 2.578 -6.35 -2.166 9.95 4.57 2.746 9.88 -3.354 5.645 -6.434 -2.32 2.59 ]
- Example 2: The input is not 32-byte aligned.
from tbe import tik tik_instance = tik.Tik() src_gm = tik_instance.Tensor("float16", (23, ), name="src_gm", scope=tik.scope_gm) dst_gm = tik_instance.Tensor("float16", (23, ), name="dst_gm", scope=tik.scope_gm) src_ub = tik_instance.Tensor("float16", (32, ), name="src_ub", scope=tik.scope_ubuf) dst_ub = tik_instance.Tensor("float16", (32, ), name="dst_ub", scope=tik.scope_ubuf) tik_instance.vec_dup(32, src_ub, 0, 1, 1) tik_instance.vec_dup(32, dst_ub, 0, 1, 1) with tik_instance.for_range(0, 2) as i: # Data movement can be performed twice. For the first movement, the data is 32-byte aligned and moved. For the second movement, the data is moved forward and then moved to the Unified Buffer in the 32-byte aligned mode. tik_instance.data_move(src_ub[i*16], src_gm[i*(23-16)], 0, 1, 1, 0, 0) tik_instance.vec_add(32, dst_ub, src_ub, src_ub, 1, 1, 1, 1) # Movement from the Unified Buffer to the Global Memory adopts the same way. For the first movement, the 32-byte aligned data is moved to Global Memory. For the second movement, the address of the Global Memory is rolled back. After 32-byte aligned is met, the remaining data in the Unified Buffer is stored. with tik_instance.for_range(0, 2) as i: tik_instance.data_move(dst_gm[i*(23-16)], dst_ub[i*16], 0, 1, 1, 0, 0) tik_instance.BuildCCE(kernel_name="data_move", inputs=[src_gm], outputs=[dst_gm])Input (src_gm):
[ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17.
18. 19. 20. 21. 22.]
Output (dst_gm):
[ 0. 2. 4. 6. 8. 10. 12. 14. 16. 18. 20. 22. 24. 26. 28. 30. 32. 34.
36. 38. 40. 42. 44.]
- Example 3: The size of the input exceeds that of the Unified Buffer.In this example, the source operands and destination operand of vec_add use the same Tensor, that is, the addresses overlap completely. Assume that the available space of the Unified Buffer is 248 KB. The code example is as follows:
from tbe import tik tik_instance = tik.Tik() src_gm = tik_instance.Tensor("float16", (126976, 2), name="src_gm", scope=tik.scope_gm) dst_gm = tik_instance.Tensor("float16", (126976, 2), name="dst_gm", scope=tik.scope_gm) dst_ub = tik_instance.Tensor("float16", (126976, ), name="dst_ub", scope=tik.scope_ubuf) with tik_instance.for_range(0, 2) as i: # If the data in the Global Memory exceeds the maximum memory of the Unified Buffer, move a segment to the Unified Buffer for calculation. After the calculation is complete, move the data back to the Global Memory. The process can be repeated multiple times. tik_instance.data_move(dst_ub, src_gm[i*126976], 0, 1, 7936, 0, 0) with tik_instance.for_range(0, 3) as j: # The maximum value of repeat_times is 255. If all data cannot be calculated at once, multiple calculations are recommended. To save space, both src and dst are the same Unified Buffer. tik_instance.vec_add(128, dst_ub[j*128*255], dst_ub[j*128*255], dst_ub[j*128*255], 255, 8, 8, 8) tik_instance.vec_add(128, dst_ub[3 * 128 * 255], dst_ub[3 * 128 * 255], dst_ub[3 * 128 * 255], 227, 8, 8, 8) # Move the calculated data back to the Global Memory and then calculate the remaining data. tik_instance.data_move(dst_gm[i*126976], dst_ub, 0, 1, 7936, 0, 0) tik_instance.BuildCCE(kernel_name="data_move", inputs=[src_gm], outputs=[dst_gm])Input (src_gm):
[2. 2. 2. ... 2. 2. 2.]
Output (dst_gm):
[4. 4. 4. ... 4. 4. 4.]