TIK Data Movement

API Prototype

For vector computation, data is stored in the Unified Buffer and then computed. The data flow is Global Memory > Unified Buffer > Global Memory. TIK provides the data_move API to implement data movement between the Global Memory and Unified Buffer. The function prototype is as follows.

data_move(dst, src, sid, nburst, burst, src_stride, dst_stride, *args, **argv)

In the data_move function prototype, pay attention to six parameters: dst, src, nburst, burst, src_stride, and dst_stride.

dst indicates the destination operand, that is, the destination address of data movement. src indicates the source operand, that is, the start address of data movement. nburst indicates the number of movement times. burst indicates the burst length of one movement (in the unit of 32-byte block). src_stride and dst_stride indicate the burst-to-burst strides of the source operand and destination operand, respectively.

data_move supports both contiguous and noncontiguous data movement.

Figure 1 Contiguous and noncontiguous data movements

Contiguous Movement

In TIK operator development, most movements of data or addresses happen in contiguous mode.

from tbe import tik

# Instantiate a tik_instance object.
tik_instance = tik.Tik()
# Define a tensor in the GM scope.
data_input_gm = tik_instance.Tensor("int32", (256,), name="data_input_gm", scope=tik.scope_gm)
# Define a tensor in the UB scope.
data_input_ub = tik_instance.Tensor("int32", (256,), name="data_input_ub", scope=tik.scope_ubuf)
# Call data_move to transfer the input tensor from GM to UB.
tik_instance.data_move(data_input_ub, data_input_gm, 0, 1, 32, 0, 0)
# Issue instructions on UB.
.............
# Move the data out.

This example allocates a 256-byte tensor of type int32 in GM and UB, respectively and then moves the tensor data from GM to UB. The arguments passed to the data_move call are described as follows in sequence.

tik_instance.data_move(data_input_ub, data_input_gm, 0, 1, 32, 0, 0)

dst = data_input_ub: It is the destination tensor.
src = data_input_gm: It is the source tensor.
sid = 0: Pass 0 in normal cases.
nburst = 1: There are 256 int32. The size of each int32 is 4 bytes, that is, the total size is 1024 bytes (256 * 4), which are far less than the UB size. Therefore, only one movement is needed.
burst = 32: It is the length of the continuously transferred data. A burst (or a block) is 32 bytes, and the size of the source data is 1024 bytes. Therefore, a total of 32 blocks (1024/32) are transferred.
src_stride = 0: As data is transferred contiguously, zero burst-to-burst stride is needed (moved only once).
dst_stride = 0: As data is transferred contiguously, zero burst-to-burst stride is needed (moved only once).

The following figure shows the movement of 256 int32 data elements. Each block is 32-byte long, which can store eight int32 data elements.

If the input data is too large to fit into the UB, the data in the GM needs to be moved to the UB for calculation in multiple times, and returned to the GM in multiple times. Assume that the available storage space of the UB is 248 KB. The code example is as follows:

from tbe import tik
tik_instance = tik.Tik()
src_gm = tik_instance.Tensor("float16", (126976, 2), name="src_gm", scope=tik.scope_gm)
dst_gm = tik_instance.Tensor("float16", (126976, 2), name="dst_gm", scope=tik.scope_gm)
dst_ub = tik_instance.Tensor("float16", (126976, ), name="dst_ub", scope=tik.scope_ubuf)
with tik_instance.for_range(0, 2) as i:
    # If the data in the Global Memory exceeds the maximum memory of the Unified Buffer, move a segment to the Unified Buffer for calculation. After the calculation is complete, move the data back to the Global Memory. The process can be repeated multiple times.
    tik_instance.data_move(dst_ub, src_gm[i*126976], 0, 1, 7936, 0, 0)
    with tik_instance.for_range(0, 3) as j:
        # The maximum value of repeat_times is 255. If all data cannot be calculated at once, multiple calculations are recommended. To save space, both src and dst are the same Unified Buffer.
        tik_instance.vec_add(128, dst_ub[j*128*255], dst_ub[j*128*255], dst_ub[j*128*255], 255, 8, 8, 8)     
    tik_instance.vec_add(128, dst_ub[3 * 128 * 255], dst_ub[3 * 128 * 255], dst_ub[3 * 128 * 255], 227, 8, 8, 8)
    # Move the calculated data back to the Global Memory and then calculate the remaining data.
    tik_instance.data_move(dst_gm[i*126976], dst_ub, 0, 1, 7936, 0, 0)
tik_instance.BuildCCE(kernel_name="data_move", inputs=[src_gm], outputs=[dst_gm])

In this example, the source operands and destination operand of vec_add use the same Tensor in the UB, that is, the addresses overlap completely.
TIK data movement and calculation are performed based on one-dimensional data.

Example:

The size of the input data in the GM is 496 KB (126976 * 2 * 2/1024), which exceeds the available space of the UB. As such, two movements are required (496 KB/248 KB = 2). The statement is as follows.
```
with tik_instance.for_range(0, 2) as i
```
The size of data moved each time is 248 KB, which means one movement is enough. Set nburst to 1 in the data_move instruction. The movement unit is block, that is, 32 bytes. For one movement, the total block number is: 248 * 1024/32 = 7936. Set burst to 7936 as follows.
```
    tik_instance.data_move(dst_ub, src_gm[i*126976], 0, 1, 7936, 0, 0)
```

A maximum of 256-byte data can be calculated all at once by the Vector Unit, and the maximum repeat number is 255. Therefore, a maximum of 65280-byte (256 * 255) data can be processed in one iteration. To move all data, the number of iterations is: 248 * 1024/65280 = 3.89.

In the first three iterations, the allowed maximum data volume (65280 bytes) is fed for computation. In the last iteration, the rest 58112-byte data needs to be repeated 227 times (58112/256 = 227).

    with tik_instance.for_range(0, 3) as j:
        # In the first three iterations, the total data size of each iteration is: 256 * 255 bytes, that is, 128 * 255 float16. To minimize the memory footprint, src and dst use the same UB.
        tik_instance.vec_add(128, dst_ub[j*128*255], dst_ub[j*128*255], dst_ub[j*128*255], 255, 8, 8, 8)
    # All the rest data is processed in the last iteration, which needs to be repeated 227 times.
    tik_instance.vec_add(128, dst_ub[3 * 128 * 255], dst_ub[3 * 128 * 255], dst_ub[3 * 128 * 255], 227, 8, 8, 8)

Finally, the computed data is moved to the GM, and the next iteration is executed to process the remaining data.

Noncontiguous Movement

The preceding example, contiguous data movement is implemented by setting both src_stride and dst_stride to 0. Noncontiguous data movement is a bit sophisticated.

from tbe import tik

tik_instance = tik.Tik()
data_input_gm = tik_instance.Tensor("int32", (256,), name="data_input_gm", scope=tik.scope_gm)
data_input_ub = tik_instance.Tensor("int32", (176,), name="data_input_ub", scope=tik.scope_ubuf)
# Noncontiguous data movement
tik_instance.data_move(data_input_ub, data_input_gm, 0, 4, 4, 4, 2)
.............

nburst =4: Four bursts are needed.
burst = 4: Four blocks, or 32 int32s, are transferred every burst.
src_stride= 4: Keep a 4-block (equaling 32 int32s) stride between the bursts of the source tensor.
dst_stride = 2: Keep a 2-block (equaling 16 int32s) stride between the bursts of the source tensor.

The following figure shows the movement process.

Noncontiguous address movement is rarely used. In practice, it is common to set burst to a formula formatted as "element_size_to_move * DATA_TYPE_SIZE/BLOCK_SIZE_BYTE", instead of an int.

tik_instance.data_move(data_input_ub, data_input_gm, SID, DEFAULT_NBURST,
                       element_size_to_move * DATA_TYPE_SIZE // BLOCK_SIZE_BYTE,
                       STRIDE_ZERO, STRIDE_ZERO)

where DEFAULT_NBURST = 1, BLOCK_SIZE_BYTE = 32, and STRIDE_ZERO = 0. This also improves the readability of your TIK operator code for other readers to understand the implementation logic.

Contiguous Movement with Offset

You might need to transfer a tensor from a specific location in certain situations.

from tbe import tik

tik_instance = tik.Tik()
data_input_gm = tik_instance.Tensor("int32", (256,), name="data_input_gm", scope=tik.scope_gm)
data_input_ub = tik_instance.Tensor("int32", (256,), name="data_input_ub", scope=tik.scope_ubuf)
# Contiguous data movement with offset
tik_instance.data_move(data_input_ub[8], data_input_gm[16], 0, 1, 30, 0, 0)
.............

As shown in the figure, a total of 30 blocks are transferred. The transfer starts from the third block of the src operand (that is, the 16th int32 element). The dst operand receives the 30 blocks from the second block (that is, the eighth int32 element).

Notes:

Due to architecture differences, the alignment restrictions on the address vary depending on Ascend AI Processor version. For details, see Table 2. Sometimes, data read from or written to the UB must be 32-byte aligned (multiples of 4 or 8, depending on the data type). For example, for blocks of eight int32 elements, the UB tensor start address must be a multiple of 8.

However, the GM has no such 32-byte alignment restriction, despite the Ascend AI Processor version. However, if the input data is not 32-byte aligned, to ensure the accuracy of the compute result, you can roll back the remainder to meet the 32-byte alignment requirement, and then move the data to the UB. After the computation is complete, move the result back to the GM in the same way.

As shown in Figure 2, the input data consists of 23 float16.

GM-to-UB movement:

For the first movement, move 16 float16 to the UB, that is, move GM[0]–GM[15] to UB[0]–UB[15].
For the second movement, 9 data elements are rolled back before movement to meet the 32-byte alignment requirement, that is, move GM[7]–GM[22] to UB[16]–UB[31].
Obviously, GM[7]–GM[15] are moved to the UB twice, that is, data of UB[7]–UB[15] is the same as data of UB[16]–UB[24], as shown in the gray data blocks in the following figure.

UB-to-GM movement:

For the first movement, move UB[0]–UB[15] to GM[0]–GM[15].
For the second movement, move UB[16]–UB[31] to GM[7]–GM[22].
The data in UB[16]–UB[24] overwrites the data in UB[7]–UB[15], that is, the gray data blocks in the GM. This does not introduce errors as the two parts of data are the same.

Figure 2 Input data that is not 32-byte aligned

A code example is provided as follows:

from tbe import tik
tik_instance = tik.Tik()
src_gm = tik_instance.Tensor("float16", (23, ), name="src_gm", scope=tik.scope_gm)
dst_gm = tik_instance.Tensor("float16", (23, ), name="dst_gm", scope=tik.scope_gm)
src_ub = tik_instance.Tensor("float16", (32, ), name="src_ub", scope=tik.scope_ubuf)
dst_ub = tik_instance.Tensor("float16", (32, ), name="dst_ub", scope=tik.scope_ubuf)
tik_instance.vec_dup(32, src_ub, 0, 1, 1)
tik_instance.vec_dup(32, dst_ub, 0, 1, 1)
with tik_instance.for_range(0, 2) as i:
    # Data movement can be performed twice. For the first movement, the data is 32-byte aligned and moved. For the second movement, the data is moved forward and then moved to the Unified Buffer in the 32-byte aligned mode.
    tik_instance.data_move(src_ub[i*16], src_gm[i*(23-16)], 0, 1, 1, 0, 0)
tik_instance.vec_add(32, dst_ub, src_ub, src_ub, 1, 1, 1, 1)
# Movement from the Unified Buffer to the Global Memory adopts the same way. For the first movement, the 32-byte aligned data is moved to Global Memory. For the second movement, the address of the Global Memory is rolled back. After 32-byte aligned is met, the remaining data in the Unified Buffer is stored.
with tik_instance.for_range(0, 2) as i:
    tik_instance.data_move(dst_gm[i*(23-16)], dst_ub[i*16], 0, 1, 1, 0, 0)
tik_instance.BuildCCE(kernel_name="data_move", inputs=[src_gm], outputs=[dst_gm])

Exercise

Develop a basic TIK operator to implement the basic data movement (in and out) function as follows:

Create a tensor space in GM and UB, respectively. Keep the space as small as possible to avoid memory waste.
Move-in part: Transfer 129 float16 elements from the GM to the UB. Read from the GM at address GM[2] contiguously. Write to the UB at a stride of 16 float16 elements between every 16 float16 elements.
Move-out part: Transfer 127 int32 elements from the UB to the GM. Read from the UB at address UB[32], at a stride of 16 int32 elements between every 32 int32 elements. Write to the GM contiguously.

Note: Assume that the start address of the UB needs to be 32-byte aligned.

[Key]

from tbe import tik

tik_instance = tik.Tik()
data_input_gm = tik_instance.Tensor("float16", (146,), name="data_input_gm", scope=tik.scope_gm)
data_input_ub = tik_instance.Tensor("float16", (272,), name="data_input_ub", scope=tik.scope_ubuf)
tik_instance.data_move(data_input_ub, data_input_gm[2], 0, 9, 1, 0, 1)
.............
data_output_gm = tik_instance.Tensor("int32", (128,), name="data_output_gm", scope=tik.scope_gm)
data_output_ub = tik_instance.Tensor("int32", (384,), name="data_output_ub", scope=tik.scope_ubuf)
tik_instance.data_move(data_output_gm, data_output_ub[32], 0, 8, 2, 4, 0)

[Explanation]

GM to UB movement: There are 129 float16 elements to move and each block can hold 16 float16 elements; therefore, nine blocks are needed. Data is continuously moved from the GM, starting from the data element indexed with 2, and has no address alignment restriction, which means that a total of 146 bytes (9 x 16 + 2) are moved from the GM to the UB. To write to the UB with a stride of 16 elements between every 16 elements, a UB space of 272 bytes (9 x (16 + 16) – 16) is needed. One block is moved each time (burst=1), and a total of nine blocks are moved. Therefore, nine movements are needed (nburst=9). The source end uses the continuous mode (src_stride=0), and the destination end has a stride of 16 float16 elements, that is, one block (dst_stride=1). Therefore, the arguments in data_move are 9, 1, 0, and 1.

UB to GM movement: There are 127 int32 elements to move and each block can hold eight int32 elements; therefore, 16 blocks are needed. The GM space is 128 bytes (16 x 8). To read the UB starting from element 32 with a stride of 32 elements between every 16 elements, considering the 32-byte address alignment requirement, a GM space of 384 bytes (8 x (16 + 32) – 32 + 32) is needed. 16 int32 elements, that is, two blocks, are moved each time (burst=2), and a total of 16 blocks are moved. Therefore, eight movements are needed (nburst=8). The source end has a stride of 32 int32 elements, that is, four blocks (src_stride=4), and the destination end uses the continuous mode (dst_stride=0). Therefore, the arguments in data_move are 8, 2, 4, and 0.

Parent topic: Operator Code Implementation (TBE TIK)