vec_rsqrt_high_preci

Description

Computes the reciprocal after extracting the square root element-wise: This API has a higher precision than vec_rsqrt.

Prototype

vec_rsqrt_high_preci(mask, dst, src, work_tensor, repeat_times, dst_rep_stride, src_rep_stride)

Parameters

For details, see Parameters. The following describes only the dst, src, and work_tensor parameters.

dst and src must have the same data type: Tensors of type float16/float32. work_tensor is a Tensor of type float32.

If the source operand tensor has an offset, the passing formats are as follows: tensor[offset1:offset2] means starting from offset1 and ending at offset2. tensor[offset1:] means starting from offset1. tensor[offset] means that only one element is passed. (In this case, the tensor is impossible to be sliced and a runtime error will be reported. Therefore, this format is not allowed.)
If the source operand tensor does not have an offset, the tensor can be passed directly.

work_tensor:

work_tensor is a user-defined temporary buffer space for storing the intermediate result. The space is limited to scope_ubuf and is used for internal computation only.

work_tensor space calculation:

Calculate the minimum buffer space required for src computation based on repeat_times, mask, and src_rep_stride as follows: src_extent_size = (repeat_times – 1) * src_rep_stride * block_len + mask_len
When the source operand is of type float16, block_len is 16.

When the source operand is of type float32, block_len is 8.

In contiguous mask mode, mask_len is the mask value itself.

In bitwise mask mode, mask_len is the mask value corresponding to the most significant bit.
Round up the minimum buffer space required for src computation to the multiple of 32 bytes: wk_size_unit = ((src_extent_size + block_len – 1)//block_len) * block_len
Calculate the size of work_tensor as follows:
For the Atlas 200/300/500 Inference Product , when the source operand is of type float16, work_tensor = 6 * wk_size_unit; when the source operand is of type float32, work_tensor = 4 * wk_size_unit. For example, if src is of type float16, mask is 128, repeat_times is 2, and src_rep_stride is 8, then block_len is 16, mask_len is 128, and src_extent_size = (2 – 1) * 8 * 16 + 128 = 256. Round up src_extent_size to the multiple of 32 bytes. Then wk_size_unit = 256 and work_tensor = 6 * 256 = 1536.

For the Atlas Training Series Product , when the source operand is of type float16, work_tensor = 5 * wk_size_unit; when the source operand is of type float32, work_tensor = 3 * wk_size_unit For example, if src is of type float16, mask is 128, repeat_times is 2, and src_rep_stride is 8, then block_len is 16, mask_len is 128, and src_extent_size = (2 – 1) * 8 * 16 + 128 = 256. Round up src_extent_size to the multiple of 32 bytes. Then wk_size_unit = 256 and work_tensor = 5 * 256 = 1280.

Returns

None

Applicability

Atlas 200/300/500 Inference Product

Atlas Training Series Product

Restrictions

dst, src, and work_tensor must be declared in scope_ubuf.
The space of the dst, src, and work_tensor tensors must not overlap.
If the value of src is not positive, an unknown result may be produced.
The compute result using this API has higher accuracy than using the vec_rsqrt API.
For other restrictions, see Restrictions.

Example 1

from tbe import tik
tik_instance = tik.Tik()
# Define the tensors.
dst_gm = tik_instance.Tensor("float16", (128,), name="dst_gm", scope=tik.scope_gm)
src_gm = tik_instance.Tensor("float16", (128,), name="src_gm", scope=tik.scope_gm)
src_ub = tik_instance.Tensor("float16", (128,), name="src_ub", scope=tik.scope_ubuf)
dst_ub = tik_instance.Tensor("float16", (128,), name="dst_ub", scope=tik.scope_ubuf)
# Move the input data from the Global Memory to the Unified Buffer.
tik_instance.data_move(src_ub, src_gm, 0, 1, 128*2 // 32, 0, 0)
mask = 128
mask_len = mask  # In contiguous mask mode, mask_len is the mask value itself.
repeat_times = 1
dst_rep_stride = 8
src_rep_stride = 8
block_len = 16  # src dtype is float16
src_extent_size = (repeat_times - 1)*src_rep_stride*block_len  + mask_len
wk_size_unit = ((src_extent_size + block_len - 1) // block_len)*block_len
wk_size = 6*wk_size_unit # Obtain the size of work_tensor.
# Define work_tensor.
work_tensor = tik_instance.Tensor("float32", (wk_size ,), name="work_tensor", scope=tik.scope_ubuf)
# If the tensor has an index offset, add a colon (:) after the subscript in the following format. Otherwise, the program will report an error.
tik_instance.vec_rsqrt_high_preci(mask, dst_ub, src_ub, work_tensor[0:], repeat_times, dst_rep_stride, src_rep_stride)
# Move the compute result from the Unified Buffer to the destination Global Memory.
tik_instance.data_move(dst_gm, dst_ub, 0, 1, 128*2 // 32, 0, 0)
tik_instance.BuildCCE("test_vec_rsqrt_high_preci", inputs=[src_gm], outputs=[dst_gm])

Result example:

Input:
src_gm=
   [6.996   1.381   5.996   7.902   ...  5.113   5.78    1.672   5.418  ]
Output:
dst_gm:
   [0.3782 0.851  0.4084 0.3557 ...  0.4421 0.416  0.7734 0.4297]

Example 2

from tbe import tik
tik_instance = tik.Tik()
# Define the tensors.
dst_gm = tik_instance.Tensor("float32", (128,), name="dst_gm", scope=tik.scope_gm)
src_gm = tik_instance.Tensor("float32", (128,), name="src_gm", scope=tik.scope_gm)
src_ub = tik_instance.Tensor("float32", (128,), name="src_ub", scope=tik.scope_ubuf)
dst_ub = tik_instance.Tensor("float32", (128,), name="dst_ub", scope=tik.scope_ubuf)
# Move the input data from the Global Memory to the Unified Buffer.
tik_instance.data_move(src_ub, src_gm, 0, 1, 128*4 // 32, 0, 0)
mask = [0, 2**64 - 1]
mask_len = 64  # In bitwise mask mode, mask_len is the mask value corresponding to the most significant bit.
repeat_times = 2
dst_rep_stride = 8
src_rep_stride = 8
block_len = 8  # src dtype is float32
src_extent_size = (repeat_times - 1)*src_rep_stride*block_len  + mask_len
wk_size_unit = ((src_extent_size + block_len - 1)//block_len)*block_len
wk_size = 4*wk_size_unit # Obtain the size of work_tensor.
# Define work_tensor.
work_tensor = tik_instance.Tensor("float32", (wk_size ,), name="work_tensor", scope=tik.scope_ubuf)
# If the tensor has an index offset, add a colon (:) after the subscript in the following format. Otherwise, the program will report an error.
tik_instance.vec_rsqrt_high_preci(mask, dst_ub, src_ub, work_tensor[0:], repeat_times, dst_rep_stride, src_rep_stride)
# Copy the compute result from the Unified Buffer to the destination Global Memory.
tik_instance.data_move(dst_gm, dst_ub, 0, 1, 128*4 // 32, 0, 0)
tik_instance.BuildCCE("test_vec_rsqrt_high_preci", inputs=[src_gm], outputs=[dst_gm])

Result example:

Input:
src_gm=
   [5.349619, 0.4301902, 4.7152824, 9.539162, ..., 5.7243876, 4.4785686, 7.030495, 7.489954]
Output:
dst_gm:
   [0.43235308, 1.5246484, 0.46051747, 0.32377616, ..., 0.41796073, 0.47253108, 0.37714386, 0.36539316]

Parent topic: Single Source (Gather Mode)