vec_rec_high_preci

Description

Computes the reciprocal element-wise: This API has a higher precision than vec_rec.

Prototype

vec_rec_high_preci(mask, dst, src, work_tensor, repeat_times, dst_rep_stride, src_rep_stride)

Parameters

For details, see Parameters. The following describes only the dst, src, and work_tensor parameters.

dst and src must have the same data type: Tensors of type float16/float32. work_tensor is a Tensor of type float32.

  • If the source operand tensor has an offset, the passing formats are as follows: tensor[offset1:offset2] means starting from offset1 and ending at offset2. tensor[offset1:] means starting from offset1. tensor[offset] means that only one element is passed. (In this case, the tensor is impossible to be sliced and a runtime error will be reported. Therefore, this format is not allowed.)
  • If the source operand tensor does not have an offset, the tensor can be passed directly.

work_tensor:

work_tensor is a user-defined temporary buffer space for storing the intermediate result. The space is limited to scope_ubuf and is used for internal computation only.

work_tensor space calculation:

  1. Calculate the minimum buffer space required for src computation based on repeat_times, mask, and src_rep_stride as follows: src_extent_size = (repeat_times – 1) * src_rep_stride * block_len + mask_len

    When the source operand is of type float16, block_len is 16.

    When the source operand is of type float32, block_len is 8.

    In contiguous mask mode, mask_len is the mask value itself.

    In bitwise mask mode, mask_len is the mask value corresponding to the most significant bit.

  2. Round up the minimum buffer space required for src computation to the multiple of 32 bytes: wk_size_unit = ((src_extent_size + block_len – 1)//block_len) * block_len
  3. Calculate the size of work_tensor as follows:

    When the source operand is of type float16, work_tensor = 4 * wk_size_unit.

    When the source operand is of type float32, work_tensor = 2 * wk_size_unit.

Example of work_tensor space calculation:

  1. If src is of type float16, mask is 128, repeat_times is 2, and src_rep_stride is 8, then block_len is 16, mask_len is 128, and src_extent_size = (2 – 1) * 8 * 16 + 128 = 256. Round up src_extent_size to the multiple of 32 bytes: wk_size_unit = ((256 + 16 – 1)//16) * 16 = 256. Therefore, the size of work_tensor is 4 * 256 = 1024.
  2. If src is of type float32, mask is 64, repeat_times is 2, and src_rep_stride is 8, then block_len is 8, mask_len is 64, and src_extent_size = (2 – 1) * 8 * 8 + 64 = 128. Round up src_extent_size to the multiple of 32 bytes: wk_size_unit = ((128 + 8 – 1)//8) * 8 = 128. Therefore, the size of work_tensor is 2 * 128 = 256.

Restrictions

  • dst, src, and work_tensor must be declared in scope_ubuf.
  • The space of the dst, src, and work_tensor tensors must not overlap.
  • If any input value is 0, an unknown result may be produced.
  • For the Atlas 200/300/500 Inference Product, the compute result using this API with float16 or float32 input has higher precision than using the vec_rec API.
  • For the Atlas Training Series Product, the compute result using this API with float16 or float32 input has higher precision than using the vec_rec API.
  • For other restrictions, see Restrictions.

Returns

None

Example 1

This example applies to a small amount of data that can be moved at a time, helping you understand the API functions. For more complex samples with a large amount of data, see Example.

from tbe import tik
# Define a container.
tik_instance = tik.Tik()
# Define the tensors.
src_gm = tik_instance.Tensor("float32", (128,), name="src_gm", scope=tik.scope_gm)
dst_gm = tik_instance.Tensor("float32", (128,), name="dst_gm", scope=tik.scope_gm)
src_ub = tik_instance.Tensor("float32", (128,), name="src_ub", scope=tik.scope_ubuf)
dst_ub = tik_instance.Tensor("float32", (128,), name="dst_ub", scope=tik.scope_ubuf)
# Move data from the Global Memory to the Unified Buffer.
tik_instance.data_move(src_ub, src_gm, 0, 1, 128*4 // 32, 0, 0)
# Calculate the size of work_tensor.
mask = [0, 2**64 - 1]
mask_len = 64
repeat_times = 2
dst_rep_stride = 8
src_rep_stride = 8
block_len = 8  # src dtype is float32
src_extent_size = (repeat_times - 1)*src_rep_stride*block_len  + mask_len
wk_size_unit = ((src_extent_size + block_len - 1)//block_len) *block_len
wk_size = 2*wk_size_unit
# Define work_tensor.
work_tensor_ub = tik_instance.Tensor("float32", (wk_size,), name="work_tensor_ub", scope=tik.scope_ubuf)
# If work_tensor has an index, use the work_tensor[index:] format.
tik_instance.vec_rec_high_preci(mask_len, dst_ub, src_ub, work_tensor_ub[0:], repeat_times, dst_rep_stride, src_rep_stride)
# Move data from the Unified Buffer to the Global Memory.
tik_instance.data_move(dst_gm, dst_ub, 0, 1, 128*4 // 32, 0, 0)
tik_instance.BuildCCE(kernel_name="test_vec_rec_high_preci", inputs=[src_gm], outputs=[dst_gm])

Result example:

Input:
[-6.9427586 -3.5300326 1.176882 ... -6.196793 9.0379095]
Output:
[-0.14403497 -0.2832835 0.8497028 ... -0.16137381 0.11064506]

Example 2

This example applies to a small amount of data that can be moved at a time, helping you understand the API functions. For more complex samples with a large amount of data, see Example.

from tbe import tik
# Define a container.
tik_instance = tik.Tik()
# Define the tensors.
src_gm = tik_instance.Tensor("float16", (128,), name="src_gm", scope=tik.scope_gm)
dst_gm = tik_instance.Tensor("float16", (128,), name="dst_gm", scope=tik.scope_gm)
src_ub = tik_instance.Tensor("float16", (128,), name="src_ub", scope=tik.scope_ubuf)
dst_ub = tik_instance.Tensor("float16", (128,), name="dst_ub", scope=tik.scope_ubuf)
# Move data from the Global Memory to the Unified Buffer.
tik_instance.data_move(src_ub, src_gm, 0, 1, 128*2 // 32, 0, 0)
# Calculate the size of work_tensor.
mask = 128
mask_len = mask
repeat_times = 1
dst_rep_stride = 8
src_rep_stride = 8
block_len = 16  # src dtype is float16
src_extent_size = (repeat_times - 1)*src_rep_stride*block_len  + mask_len
wk_size_unit = ((src_extent_size + block_len - 1) // block_len)*block_len
wk_size = 4*wk_size_unit
# Define work_tensor.
work_tensor_ub = tik_instance.Tensor("float32", (wk_size,), name="work_tensor_ub", scope=tik.scope_ubuf)
# If work_tensor has an index, use the work_tensor[index:] format.
tik_instance.vec_rec_high_preci(mask_len, dst_ub, src_ub, work_tensor_ub[0:], repeat_times, dst_rep_stride, src_rep_stride)
# Move data from the Unified Buffer to the Global Memory.
tik_instance.data_move(dst_gm, dst_ub, 0, 1, 128*2 // 32, 0, 0)
tik_instance.BuildCCE(kernel_name="test_vec_rec_high_preci", inputs=[src_gm], outputs=[dst_gm])

Result example:

Input:
[-7.08 -4.434 1.294 ... 8.82 -2.854]
Output:
[-0.1412 -0.2256 0.773 ... 0.1134 -0.3503]