data_move

Description

Moves data between src and dst. Both src and dst can be Tensors at the same time.

Atlas 200/300/500 Inference Product: UB->UB/UB->OUT/OUT->UB/OUT->L1

Atlas Training Series Product: UB->UB/UB->OUT/OUT->UB/OUT->L1

Prototype

data_move (dst, src, sid, nburst, burst, src_stride, dst_stride, *args, **argv)

Parameters

Table 1 Parameter description

Parameter

Input/Output

Description

dst

Output

Destination operand. For details about the data type restrictions, see Table 2.

If the scope of dst is the L1 Buffer or Unified Buffer, the address offset must be 32-byte aligned.

src

Input

Source operand. For details about the data type restrictions, see Table 2.

If the scope of src is the L1 Buffer or Unified Buffer, the address offset must be 32-byte aligned.

sid

Input

SMMU ID, which is hardware-related and reserved. Must be a Scalar (int32), an immediate (int32), or an Expr (int32). Must be in the range of [0, 15]. The value 0 is recommended.

nburst

Input

Number of bursts to move. Must be a Scalar (int32), an immediate (int32), or an Expr (int32). Must be in the range of [1, 4095].

burst

Input

Burst length of contiguous data transfer, in the unit of 32 bytes. The value is in the range of [1, 65535].

Must be a Scalar of type int16/int32/int64/uint16/uint32/uint64, an immediate of type int, or an Expr of type int16/int32/int64/uint16/uint32/uint64.

src_stride

Input

Tail-to-header stride between adjacent bursts of the source tensor. Must be in the range of [0, 65535]. Must be a Scalar of type int16/int32/int64/uint16/uint32/uint64, an immediate of type int, or an Expr of type int16/int32/int64/uint16/uint32/uint64.

dst_stride

Input

Tail-to-header stride between adjacent bursts of the destination tensor. Must be in the range of [0, 65535]. Must be a Scalar of type int16/int32/int64/uint16/uint32/uint64, an immediate of type int, or an Expr of type int16/int32/int64/uint16/uint32/uint64.

*args

Input

Number of extended arguments

**argv

Input

Extended arguments

Table 2 Data types, scopes, and parameter units related to data_move

src.scope

dst.scope

dtype

(src and dst Have the Same dtype)

burst

Unit

src_stride

Unit

dst_stride

Unit

OUT

L1

uint8, int8, float16, uint16, int16, float32, int32, uint32, uint64, int64

32 bytes

32 bytes

32 bytes

L1

OUT

uint8, int8, float16, uint16, int16, float32, int32, uint32, uint64, int64

32 bytes

32 bytes

32 bytes

OUT

UB

uint8, int8, float16, uint16, int16, float32, int32, uint32, uint64, int64

32 bytes

32 bytes

32 bytes

UB

OUT

uint8, int8, float16, uint16, int16, float32, int32, uint32, uint64, int64

32 bytes

32 bytes

32 bytes

UB

UB

uint8, int8, float16, uint16, int16, float32, int32, uint32, uint64, int64

32 bytes

32 bytes

32 bytes

Applicability

Atlas 200/300/500 Inference Product

Atlas Training Series Product

Restrictions

None

Returns

None

Examples

  • Example 1: All data is transferred at once.
    from tbe import tik
    tik_instance = tik.Tik()
    src_gm = tik_instance.Tensor("float16", (512,), name="src_gm", scope=tik.scope_gm)
    dst_gm = tik_instance.Tensor("float16", (512,), name="dst_gm", scope=tik.scope_gm)
    tensor_ub = tik_instance.Tensor("float16", (512,), name="tensor_ub", scope=tik.scope_ubuf)
    # Move the user input from the Global Memory to the Unified Buffer.
    # nburst indicates the transfer times of blocks, equal to the number of bursts. burst indicates the data length per burst, in the unit of 32 bytes. src_stride and dst_stride indicate the burst-to-burst strides.
    # In order to improve the running performance, it is advisable to reduce the transfer times (nburst) and increase the data length per burst (burst) as far as possible.
    # In this example, nburst = 1 indicates that all data is transferred at once. burst = 512 x 2//32 = 32 indicates that burst length is 32. src_stride = 0 and dst_stride = 0 indicate that the data is transferred in a contiguous mode.
    tik_instance.data_move(tensor_ub, src_gm, 0, 1, 32, 0, 0)
    # Copy the compute result to the destination Global Memory.
    tik_instance.data_move(dst_gm, tensor_ub, 0, 1, 32, 0, 0)
    
    tik_instance.BuildCCE(kernel_name="data_move", inputs=[src_gm], outputs=[dst_gm])

    Result example:

    Input (src_gm):
    [-1.539    1.418   -6.418   -9.55    -9.336   -6.484    4.117    7.914
     -2.012    3.201   -5.375    7.32     4.       9.99     3.502    8.27
     -8.125   -9.33    -6.812   -8.695    9.87    -4.914   -5.992    1.233
     -3.662    3.477   -4.9     -3.924    6.438    8.266    7.31     8.97
      6.06    -3.646    9.695    0.623    9.84    -5.234    4.715   -8.07
      ...
      2.19     3.709   -3.611   -6.97    -0.772   -0.5938   2.953    7.043
      6.63     8.55     1.873    0.1703  -0.715   -5.35    -4.52     7.31
      4.336   -7.113   -5.875    9.44    -2.812   -6.5     -0.742   -6.703
      3.297   -7.605    0.3582  -1.62     2.578   -6.35    -2.166    9.95
      4.57     2.746    9.88    -3.354    5.645   -6.434   -2.32     2.59   ]
    
    Output (dst_gm):
    [-1.539    1.418   -6.418   -9.55    -9.336   -6.484    4.117    7.914
     -2.012    3.201   -5.375    7.32     4.       9.99     3.502    8.27
     -8.125   -9.33    -6.812   -8.695    9.87    -4.914   -5.992    1.233
     -3.662    3.477   -4.9     -3.924    6.438    8.266    7.31     8.97
      6.06    -3.646    9.695    0.623    9.84    -5.234    4.715   -8.07
      ...
      2.19     3.709   -3.611   -6.97    -0.772   -0.5938   2.953    7.043
      6.63     8.55     1.873    0.1703  -0.715   -5.35    -4.52     7.31
      4.336   -7.113   -5.875    9.44    -2.812   -6.5     -0.742   -6.703
      3.297   -7.605    0.3582  -1.62     2.578   -6.35    -2.166    9.95
      4.57     2.746    9.88    -3.354    5.645   -6.434   -2.32     2.59   ]
  • Example 2: The input is not 32-byte aligned.
    from tbe import tik
    tik_instance = tik.Tik()
    src_gm = tik_instance.Tensor("float16", (23, ), name="src_gm", scope=tik.scope_gm)
    dst_gm = tik_instance.Tensor("float16", (23, ), name="dst_gm", scope=tik.scope_gm)
    src_ub = tik_instance.Tensor("float16", (32, ), name="src_ub", scope=tik.scope_ubuf)
    dst_ub = tik_instance.Tensor("float16", (32, ), name="dst_ub", scope=tik.scope_ubuf)
    tik_instance.vec_dup(32, src_ub, 0, 1, 1)
    tik_instance.vec_dup(32, dst_ub, 0, 1, 1)
    with tik_instance.for_range(0, 2) as i:
        # Data movement can be performed twice. For the first movement, the data is 32-byte aligned and moved. For the second movement, the data is moved forward and then moved to the Unified Buffer in the 32-byte aligned mode.
        tik_instance.data_move(src_ub[i*16], src_gm[i*(23-16)], 0, 1, 1, 0, 0)
    tik_instance.vec_add(32, dst_ub, src_ub, src_ub, 1, 1, 1, 1)
    # Movement from the Unified Buffer to the Global Memory adopts the same way. For the first movement, the 32-byte aligned data is moved to Global Memory. For the second movement, the address of the Global Memory is rolled back. After 32-byte aligned is met, the remaining data in the Unified Buffer is stored.
    with tik_instance.for_range(0, 2) as i:
        tik_instance.data_move(dst_gm[i*(23-16)], dst_ub[i*16], 0, 1, 1, 0, 0)
    tik_instance.BuildCCE(kernel_name="data_move", inputs=[src_gm], outputs=[dst_gm])

    Input (src_gm):

    [ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17.

    18. 19. 20. 21. 22.]

    Output (dst_gm):

    [ 0. 2. 4. 6. 8. 10. 12. 14. 16. 18. 20. 22. 24. 26. 28. 30. 32. 34.

    36. 38. 40. 42. 44.]

  • Example 3: The size of the input exceeds that of the Unified Buffer.
    In this example, the source operands and destination operand of vec_add use the same Tensor, that is, the addresses overlap completely. Assume that the available space of the Unified Buffer is 248 KB. The code example is as follows:
    from tbe import tik
    tik_instance = tik.Tik()
    src_gm = tik_instance.Tensor("float16", (126976, 2), name="src_gm", scope=tik.scope_gm)
    dst_gm = tik_instance.Tensor("float16", (126976, 2), name="dst_gm", scope=tik.scope_gm)
    dst_ub = tik_instance.Tensor("float16", (126976, ), name="dst_ub", scope=tik.scope_ubuf)
    with tik_instance.for_range(0, 2) as i:
        # If the data in the Global Memory exceeds the maximum memory of the Unified Buffer, move a segment to the Unified Buffer for calculation. After the calculation is complete, move the data back to the Global Memory. The process can be repeated multiple times.
        tik_instance.data_move(dst_ub, src_gm[i*126976], 0, 1, 7936, 0, 0)
        with tik_instance.for_range(0, 3) as j:
            # The maximum value of repeat_times is 255. If all data cannot be calculated at once, multiple calculations are recommended. To save space, both src and dst are the same Unified Buffer.
            tik_instance.vec_add(128, dst_ub[j*128*255], dst_ub[j*128*255], dst_ub[j*128*255], 255, 8, 8, 8)     
        tik_instance.vec_add(128, dst_ub[3 * 128 * 255], dst_ub[3 * 128 * 255], dst_ub[3 * 128 * 255], 227, 8, 8, 8)
        # Move the calculated data back to the Global Memory and then calculate the remaining data.
        tik_instance.data_move(dst_gm[i*126976], dst_ub, 0, 1, 7936, 0, 0)
    tik_instance.BuildCCE(kernel_name="data_move", inputs=[src_gm], outputs=[dst_gm])

    Input (src_gm):

    [2. 2. 2. ... 2. 2. 2.]

    Output (dst_gm):

    [4. 4. 4. ... 4. 4. 4.]