vec_reduce_add

Description

Adds all input data. Each two data pieces are added in binary tree mode.

Assuming that the source operand is 256 pieces of float16 data [data0, data1, data2, ..., data255], the computation can be completed in two repeats. The computation process is as follows:

  1. [data0,data1,data2...data127] is the source operand of the first repeat. Result 01 is obtained through the following calculation method:
    1. Add data0 and data1 to obtain data00, add data2 and data3 to obtain data01, ..., add data124 and data125 to obtain data62, and add data126 and data127 to obtain data63.
    2. Add data00 and data01 to obtain data000, add data02 and data03 to obtain data001, ..., and add data62 and data63 to obtain data031.
    3. This rule applies until result01 is obtained.
  2. [data128,data1,data2...data255] is the source operand of the second repeat. Result 02 is obtained.
  3. Add result01 and result02 to obtain the destination operand [data], a float16.

Prototype

vec_reduce_add(mask, dst, src, work_tensor, repeat_times, src_rep_stride)

Parameters

Parameter

Input/Output

Description

mask

Input

For details, see the description of the mask parameter in Table 1.

dst

Output

Start element of the destination Tensor operand.

The scope of the tensor is the Unified Buffer.

src

Input

Start element of the source Tensor operand.

The scope of the tensor is the Unified Buffer.

work_tensor

Input

A tensor for storing temporary results during instruction execution to calculate the required workspace. Pay attention to the size. For details, see the restrictions of each instruction.

repeat_times

Input

Repeat times (or iterations).

src_rep_stride

Input

Repeat stride size for the source operand between the corresponding blocks of successive iterations.

dst, src, and work_tensor must be of the same data type:

Atlas 200/300/500 Inference Product: Tensors of type float16/float32

Atlas Training Series Product: Tensors of type float16/float32

Returns

None

Applicability

Atlas 200/300/500 Inference Product

Atlas Training Series Product

Restrictions

  • Space requirements for work_tensor:
    • For the Atlas 200/300/500 Inference Product, the work_tensor space requires at least repeat_times elements. For example, if repeat_times = 120, work_tensor must contain at least 120 elements.
    • For the Atlas Training Series Product, the work_tensor space requires at least repeat_times elements. For example, if repeat_times = 120, work_tensor must contain at least 120 elements.
  • repeat_times [1, 4095]. Must be a Scalar of type int32, an immediate of type int, or an Expr of type int32.
  • src_rep_stride [0, 65535]. Must be a Scalar of type int16/int32/int64/uint16/uint32/uint64, an immediate of type int, or an Expr of type int16/int32/int64/uint16/uint32/uint64.
  • Note that if the calculation result overflows during the calculation of adding every two elements, there are two processing modes: return the defined maximum value or return inf/nan. The mode is selected by the inf/nan control bit. When the max-value mode is used, if the sum of float16 data is greater than 65,504, the output will be 65504. For example, the source operand is [60000, 60000, –30000, 100], 60000 + 60000 > 65504, meaning that the result overflows. In this case, the maximum value 65504 will be used as the result. Similarly, –30000 + 100 = –29900, 65504 – 29900 = 35604.
  • src, dst, and work_tensor must not overlap with each other.
  • Address requirements for input and output tensors:
    • For the Atlas 200/300/500 Inference Product, src, dst, and work_tensor must not overlap with each other.
    • For the Atlas Training Series Product, src, dst, and work_tensor must not overlap with each other.
  • For details about the alignment requirements of the operand address offset, see General Restrictions.

Example

  • Example 1
    from tbe import tik
    tik_instance = tik.Tik()
    src_gm = tik_instance.Tensor("float16", (256,), name="src_gm", scope=tik.scope_gm)
    dst_gm = tik_instance.Tensor("float16", (32,), name="dst_gm", scope=tik.scope_gm)
    src_ub = tik_instance.Tensor("float16", (256,), name="src_ub", scope=tik.scope_ubuf)
    dst_ub = tik_instance.Tensor("float16", (32,), name="dst_ub", scope=tik.scope_ubuf)
    work_tensor_ub = tik_instance.Tensor("float16", (32,), tik.scope_ubuf, "work_tensor_ub")
    # Copy the user input to the source Unified Buffer.
    tik_instance.data_move(src_ub, src_gm, 0, 1, 16, 0, 0)
    # Assign 0 to the destination Unified Buffer as its initial value to show the difference between the inputs and outputs more clearly.
    tik_instance.vec_dup(32, dst_ub, 0, 1, 1)
    tik_instance.vec_reduce_add(128, dst_ub, src_ub, work_tensor_ub, 2, 8)
    # Copy the compute result to the destination Global Memory.
    tik_instance.data_move(dst_gm, dst_ub, 0, 1, 2, 0, 0)
    
    tik_instance.BuildCCE(kernel_name="vec_reduce_add", inputs=[src_gm], outputs=[dst_gm])

    Result example:

    Input:
    src_gm=[1,1,1,...,1]
    Output:
    dst_gm=[256,0,0,...,0]
  • Example 2
    tik_instance = tik.Tik()
    dtype_size = {
        "int8": 1,
        "uint8": 1,
        "int16": 2,
        "uint16": 2,
        "float16": 2,
        "int32": 4,
        "uint32": 4,
        "float32": 4,
        "int64": 8,
    }
    # Tensor shape
    src_shape = (3, 128)
    dst_shape = (64,)
    # Data volume
    src_elements = 3 * 128
    dst_elements = 64
    # Data type
    dtype = "float16"
    
    src_gm = tik_instance.Tensor(dtype, src_shape, name="src_gm", scope=tik.scope_gm)
    dst_gm = tik_instance.Tensor(dtype, dst_shape, name="dst_gm", scope=tik.scope_gm)
    src_ub = tik_instance.Tensor(dtype, src_shape, name="src_ub", scope=tik.scope_ubuf)
    dst_ub = tik_instance.Tensor(dtype, dst_shape, name="dst_ub", scope=tik.scope_ubuf)
    work_tensor_ub = tik_instance.Tensor(dtype, dst_shape, tik.scope_ubuf, "work_tensor_ub")
    # Copy the user input to the source Unified Buffer.
    # Number of moved segments.
    nburst = 1
    # Length of the moved segment each time, in 32 bytes.
    burst = src_elements * dtype_size[dtype] // 32 // nburst
    dst_burst = dst_elements * dtype_size[dtype] // 32 // nburst
    # Stride between the previous burst tail and the next burst header, in 32 bytes.
    dst_stride, src_stride = 0, 0
    tik_instance.data_move(src_ub, src_gm, 0, nburst, burst, dst_stride, src_stride)
    # Assign the initial value 0 to dst ubuf. For details about vec_dup, see the corresponding section.
    tik_instance.vec_dup(64, dst_ub, 0, 1, 1)
    tik_instance.vec_dup(64, work_tensor_ub, 0, 1, 1)
    # Number of source operands in each iteration. The value range varies according to the data type. For details, see the corresponding section. In this example, 34 operands are processed in each iteration.
    mask = 34
    # Configure iterations based on your actual requirements. Here, six iterations are used as an example.
    repeat_times = 6
    # Stride between operand headers in adjacent iterations. The unit is block. Currently, the header of the first iteration is three blocks away from the header of the second iteration.
    src_rep_stride = 3
    tik_instance.vec_reduce_add(mask, dst_ub, src_ub, work_tensor_ub, repeat_times, src_rep_stride)
    # In the current example, the value of work_tensor_ub is [34. 34. 36. 68. 68. 86. 0. 0....], which is the result of six iterations.
    # Copy the compute result to the destination Global Memory.
    tik_instance.data_move(dst_gm, dst_ub, 0, nburst, dst_burst, dst_stride, src_stride)
    
    tik_instance.BuildCCE(kernel_name="vec_reduce_add", inputs=[src_gm], outputs=[dst_gm])
    
    Result example:
    Input (src_gm):
    [[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
      1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
      1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
      1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
      1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
      1. 1. 1. 1. 1. 1. 1. 1.]
     [2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.
      2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.
      2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.
      2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.
      2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.
      2. 2. 2. 2. 2. 2. 2. 2.]
     [3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3.
      3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3.
      3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3.
      3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3.
      3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3.
      3. 3. 3. 3. 3. 3. 3. 3.]]
    
    Output (dst_gm):
    [326.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
       0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
       0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
       0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
       0.   0.   0.   0.   0.   0.   0.   0.]