TIK Vector Computation

API Prototype

TIK provides plenty of APIs to schedule the compute resources of the Vector Unit. It is significant to properly set the API parameters. The single-input Vector compute APIs and multi-input vector compute APIs show more similarities than differences. The following uses the single-input Vector compute APIs as an example to illustrate the basic API call principles. The function prototype of single-input Vector computing is as follows:

instruction(mask, dst, src, repeat_times, dst_rep_stride, src_rep_stride)

The mask parameter specifies whether each element is masked in vector computing. The Vector Unit is able to compute up to eight blocks (256 bytes) in parallel. To achieve this, set mask to the maximum value of the corresponding data type. If data to be computed is less than eight blocks, which means the Vector Unit cannot be fully utilized, set mask based on the actual data volume. Note that TIK provides two modes for assigning the value of mask: contiguous mode and bitwise mode. The contiguous mode is easier to control, while the bit-wise mode is more flexible but also more sophisticated.
Parameters dst and src indicate the destination operand and source operand, respectively. They also specify where Vector Unit's access starts.
Parameters repeat_times, dst_rep_stride, and src_rep_stride are especially important in compute API calls. In the TIK API of the current version, the Vector Unit reads 256 contiguous bytes for computation each time. To read the complete data for processing, the Vector Unit needs to read the input data in multiple repeat times.
The repeat_times parameter indicates the number of repeat times needed in a single API call. Since the increase of API calls means increased latency, placing these repeat times in a single API call greatly reduces unnecessary execution overheads and improves the overall execution efficiency. The maximum value of repeat_times is 255 due to the limitations of Ascend AI Processor hardware.
As shown in Figure 1, src_rep_stride indicates the start address stride (in blocks) between repeat times of the source operand and dst_rep_stride indicates that of the destination operand. The two parameters are collectively referred to as *_rep_stride for convenience in the following description.
Assume that a tensor is defined for both the destination and source operands (which means that the destination and source operands have overlapped addresses), and *_rep_stride is set to 8. In this case, the Vector Unit reads eight contiguous blocks in the first repeat, and reads the next eight contiguous blocks in the second repeat. In this case, the Vector Unit can complete computation of all input data in multiple repeat times. For details, see Contiguous Address Computation.

Note that *_rep_stride is assigned with special values in special scenarios.
- When repeat_times > 1 and *_rep_stride > 8 (= 10, for example), the Vector Unit reads data from noncontiguous addresses between adjacent repeat times. See Noncontiguous Address Computation for more information.
- When repeat_times > 1 and *_rep_stride = 0, the first eight blocks are repeatedly read and computed in the vector operation.
- When repeat_times > 1 and 0 < *_rep_stride < 8, data of two adjacent repeat times is repeatedly read and computed in the vector operation. This scenario is generally not involved.
In conclusion, you need to set parameters based on the data access mode of the operator to obtain the correct computation result.

Figure 1 Examples of different settings of repeat_times and *_rep_stride

Contiguous Address Computation

from tbe import tik

tik_instance = tik.Tik()
data_input_gm = tik_instance.Tensor("float32", (256,), name="data_input_gm", scope=tik.scope_gm)
data_input_ub = tik_instance.Tensor("float32", (256,), name="data_input_ub", scope=tik.scope_ubuf)
tik_instance.data_move(data_input_ub, data_input_gm, 0, 1, 32, 0, 0)
# Call vec_abs to perform single-input operations on data_input_ub.
tik_instance.vec_abs(64, data_input_ub, data_input_ub, 256//64, 8, 8)
# Move the data out.

vec_abs computes the absolute value of the input element-wise. In this example, the absolute values of 256 float32 elements are computed.

mask = 64: It is a 128-bit parameter. If a bit is set to 0, the corresponding element is masked in vector computing. If a bit is set to 1, otherwise. It can be used to specify the contiguous elements that are valid. The Vector Unit is able to compute up to 256 bytes in parallel, which are 128 16-bit elements (mask <= 128) or 64 32-bit elements (mask <= 64). For example, mask = 16 indicates that the first 16 elements are computed. mask applies to the source operand in every repeat. To unmask all the float32 elements, set mask to 64.
repeat_times = 4: The Vector Unit is able to compute 64 float32 elements (256 bytes) in parallel. To finish the computation of 256 float32 elements, four repeats are needed.
dst_rep_stride = 8: To implement contiguous address computation, set the start address stride (in blocks) between repeat times of the destination operand to the maximum that the Vector Unit can compute in parallel, which is 8 blocks (256/32).
src_rep_stride = 8: the same as dst_rep_stride.

The following gives a computation diagram.

A block holds 8 float32 elements. To compute 256 float32 elements, 32 blocks are needed. The Vector Unit is able to compute up to 256 bytes, or eight blocks (as outlined in red), in parallel. It takes four repeats to compute all data. To implement contiguous address computation, rep_stride must be set to 8.

Note: In this example, the dst and src operands are the same tensor. Refer to the API reference for details.

Noncontiguous Address Computation

from tbe import tik
tik_instance = tik.Tik()
data_input_gm = tik_instance.Tensor("float32", (256,), name="data_input_gm", scope=tik.scope_gm)
data_input_ub = tik_instance.Tensor("float32", (256,), name="data_input_ub", scope=tik.scope_ubuf)
data_output_ub = tik_instance.Tensor("float32", (272,), name="data_output_ub", scope=tik.scope_ubuf)
tik_instance.data_move(data_input_ub, data_input_gm, 0, 1, 32, 0, 0)
# Call vec_abs to perform single-input operations on data_input_ub.
tik_instance.vec_abs(32, data_output_ub, data_input_ub, 2, 18, 16)
# Move the data out.

vec_abs computes the absolute value of the input element-wise.

mask =32: It is a 128-bit parameter. The Vector Unit is able to compute 64 32-bit elements in parallel. mask = 32 indicates the first 32 float32 elements, that is, four blocks, are computed in one repeat.
repeat_times = 2: two repeats are required.
dst_rep_stride = 18: start address stride (in blocks) between repeat times of the destination operand. The Vector Unit is able to compute up to 256 bytes, or eight blocks in parallel. Therefore, the next repeat occurs 10 blocks after the previous repeat, as shown in Figure 2.
src_rep_stride = 16: start address stride (in blocks) between repeat times of the source operand. The next repeat occurs 8 blocks after the previous repeat, as shown in Figure 2.

Figure 2 Noncontiguous address computation example

Assume that eight float32 elements of each block in the source operand are the same. For example, the first block in the source operand has eight elements valued –1. The blocks in each red box indicate that the data is involved in one repeat, and the gray parts indicates that the data is not changed. D in the destination operand indicates a default value, that is, the original value stored in the tensor.

Note: If there is offset in the UB, refer to Contiguous Movement with Offset with the data_move API. The main focus is whether the AI Processor in use requires 32-byte alignment in the UB.

Exercise

What is the expected result of the following TIK operator?

from tbe import tik

tik_instance = tik.Tik()
data_input_gm_1 = tik_instance.Tensor("int32", (256,), name="data_input_gm_1", scope=tik.scope_gm)
data_input_ub_1 = tik_instance.Tensor("int32", (192,), name="data_input_ub_1", scope=tik.scope_ubuf)
data_input_gm_2 = tik_instance.Tensor("int32", (288,), name="data_input_gm_2", scope=tik.scope_gm)
data_input_ub_2 = tik_instance.Tensor("int32", (256,), name="data_input_ub_2", scope=tik.scope_ubuf)
data_output_ub = tik_instance.Tensor("int32", (256,), name="data_output_ub", scope=tik.scope_ubuf)
data_output_gm = tik_instance.Tensor("int32", (192,), name="data_output_ub", scope=tik.scope_gm)

tik_instance.vec_dup(64, data_input_ub_1, 0, 3, 1, 8)
tik_instance.data_move(data_input_ub_1, data_input_gm_1, 0, 4, 4, 4, 2)
tik_instance.data_move(data_input_ub_2, data_input_gm_2[32], 0, 1, 32, 8, 8)
tik_instance.vec_add(64, data_output_ub, data_input_ub_1, data_input_ub_2, 3, 8, 8, 8)
tik_instance.data_move(data_output_gm, data_output_ub, 0, 1, 24, 8, 8)

Inputs (int32):

data_input_gm_1 = {1,2,3,...,256}

data_input_gm_2 = {1,2,3,...,288}

1. What are the values of the first 40 elements of data_output_gm that are transferred?

2. What are the values of the elements of data_output_gm?

[Key]

# Assume that the 192 elements are all 0s.
tik_instance.vec_dup(64, data_input_ub_1, 0, 3, 8)
# The first 40 values are {1,2,3,4, ...,31,32,0,0,0,0,0,0,0,0}.
tik_instance.data_move(data_input_ub_1, data_input_gm_1, 0, 4, 4, 4, 2)
# The first 40 values are {33,34,35,36, ...,70,71,72}.
tik_instance.data_move(data_input_ub_2, data_input_gm_2[32], 0, 1, 32, 8, 8)
# Add all values contiguously. Mask all elements.
tik_instance.vec_add(64, data_output_ub, data_input_ub_1, data_input_ub_2, 3, 8, 8, 8)
# Transfer all data continuously. The result is {34,36,38,40, ...,92,94,96,65,66,67,68,69,70,71,72}.
tik_instance.data_move(data_output_gm, data_output_ub, 0, 1, 24, 8, 8)
# The values of the 192 elements are: {34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,88,90,92,94,96,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,146,148,150,152,154,156,158,160,162,164,166,168,170,172,174,176,178,180,182,184,186,188,190,192,194,196,198,200,202,204,206,208,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,258,260,262,264,266,268,270,272,274,276,278,280,282,284,286,288,290,292,294,296,298,300,302,304,306,308,310,312,314,316,318,320,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,370,372,374,376,378,380,382,384,386,388,390,392,394,396,398,400,402,404,406,408,410,412,414,416,418,420,422,424,426,428,430,432,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224}

Parent topic: Operator Code Implementation (TBE TIK)