vec_rsqrt_high_preci
Description
Computes the reciprocal after extracting the square root element-wise:
This API has a higher precision than vec_rsqrt.
Prototype
vec_rsqrt_high_preci(mask, dst, src, work_tensor, repeat_times, dst_rep_stride, src_rep_stride)
Parameters
For details, see Parameters. The following describes only the dst, src, and work_tensor parameters.
dst and src must have the same data type: Tensors of type float16/float32. work_tensor is a Tensor of type float32.
- If the source operand tensor has an offset, the passing formats are as follows: tensor[offset1:offset2] means starting from offset1 and ending at offset2. tensor[offset1:] means starting from offset1. tensor[offset] means that only one element is passed. (In this case, the tensor is impossible to be sliced and a runtime error will be reported. Therefore, this format is not allowed.)
- If the source operand tensor does not have an offset, the tensor can be passed directly.
work_tensor:
work_tensor is a user-defined temporary buffer space for storing the intermediate result. The space is limited to scope_ubuf and is used for internal computation only.
work_tensor space calculation:
- Calculate the minimum buffer space required for src computation based on repeat_times, mask, and src_rep_stride as follows: src_extent_size = (repeat_times – 1) * src_rep_stride * block_len + mask_len
When the source operand is of type float16, block_len is 16.
When the source operand is of type float32, block_len is 8.
In contiguous mask mode, mask_len is the mask value itself.
In bitwise mask mode, mask_len is the mask value corresponding to the most significant bit.
- Round up the minimum buffer space required for src computation to the multiple of 32 bytes: wk_size_unit = ((src_extent_size + block_len – 1)//block_len) * block_len
- Calculate the size of work_tensor as follows:
For the
Atlas 200/300/500 Inference Product , when the source operand is of type float16, work_tensor = 6 * wk_size_unit; when the source operand is of type float32, work_tensor = 4 * wk_size_unit. For example, if src is of type float16, mask is 128, repeat_times is 2, and src_rep_stride is 8, then block_len is 16, mask_len is 128, and src_extent_size = (2 – 1) * 8 * 16 + 128 = 256. Round up src_extent_size to the multiple of 32 bytes. Then wk_size_unit = 256 and work_tensor = 6 * 256 = 1536.For the
Atlas Training Series Product , when the source operand is of type float16, work_tensor = 5 * wk_size_unit; when the source operand is of type float32, work_tensor = 3 * wk_size_unit For example, if src is of type float16, mask is 128, repeat_times is 2, and src_rep_stride is 8, then block_len is 16, mask_len is 128, and src_extent_size = (2 – 1) * 8 * 16 + 128 = 256. Round up src_extent_size to the multiple of 32 bytes. Then wk_size_unit = 256 and work_tensor = 5 * 256 = 1280.
Returns
None
Applicability
Restrictions
- dst, src, and work_tensor must be declared in scope_ubuf.
- The space of the dst, src, and work_tensor tensors must not overlap.
- If the value of src is not positive, an unknown result may be produced.
- The compute result using this API has higher accuracy than using the vec_rsqrt API.
- For other restrictions, see Restrictions.
Example 1
from tbe import tik
tik_instance = tik.Tik()
# Define the tensors.
dst_gm = tik_instance.Tensor("float16", (128,), name="dst_gm", scope=tik.scope_gm)
src_gm = tik_instance.Tensor("float16", (128,), name="src_gm", scope=tik.scope_gm)
src_ub = tik_instance.Tensor("float16", (128,), name="src_ub", scope=tik.scope_ubuf)
dst_ub = tik_instance.Tensor("float16", (128,), name="dst_ub", scope=tik.scope_ubuf)
# Move the input data from the Global Memory to the Unified Buffer.
tik_instance.data_move(src_ub, src_gm, 0, 1, 128*2 // 32, 0, 0)
mask = 128
mask_len = mask # In contiguous mask mode, mask_len is the mask value itself.
repeat_times = 1
dst_rep_stride = 8
src_rep_stride = 8
block_len = 16 # src dtype is float16
src_extent_size = (repeat_times - 1)*src_rep_stride*block_len + mask_len
wk_size_unit = ((src_extent_size + block_len - 1) // block_len)*block_len
wk_size = 6*wk_size_unit # Obtain the size of work_tensor.
# Define work_tensor.
work_tensor = tik_instance.Tensor("float32", (wk_size ,), name="work_tensor", scope=tik.scope_ubuf)
# If the tensor has an index offset, add a colon (:) after the subscript in the following format. Otherwise, the program will report an error.
tik_instance.vec_rsqrt_high_preci(mask, dst_ub, src_ub, work_tensor[0:], repeat_times, dst_rep_stride, src_rep_stride)
# Move the compute result from the Unified Buffer to the destination Global Memory.
tik_instance.data_move(dst_gm, dst_ub, 0, 1, 128*2 // 32, 0, 0)
tik_instance.BuildCCE("test_vec_rsqrt_high_preci", inputs=[src_gm], outputs=[dst_gm])
Result example:
Input: src_gm= [6.996 1.381 5.996 7.902 ... 5.113 5.78 1.672 5.418 ] Output: dst_gm: [0.3782 0.851 0.4084 0.3557 ... 0.4421 0.416 0.7734 0.4297]
Example 2
from tbe import tik
tik_instance = tik.Tik()
# Define the tensors.
dst_gm = tik_instance.Tensor("float32", (128,), name="dst_gm", scope=tik.scope_gm)
src_gm = tik_instance.Tensor("float32", (128,), name="src_gm", scope=tik.scope_gm)
src_ub = tik_instance.Tensor("float32", (128,), name="src_ub", scope=tik.scope_ubuf)
dst_ub = tik_instance.Tensor("float32", (128,), name="dst_ub", scope=tik.scope_ubuf)
# Move the input data from the Global Memory to the Unified Buffer.
tik_instance.data_move(src_ub, src_gm, 0, 1, 128*4 // 32, 0, 0)
mask = [0, 2**64 - 1]
mask_len = 64 # In bitwise mask mode, mask_len is the mask value corresponding to the most significant bit.
repeat_times = 2
dst_rep_stride = 8
src_rep_stride = 8
block_len = 8 # src dtype is float32
src_extent_size = (repeat_times - 1)*src_rep_stride*block_len + mask_len
wk_size_unit = ((src_extent_size + block_len - 1)//block_len)*block_len
wk_size = 4*wk_size_unit # Obtain the size of work_tensor.
# Define work_tensor.
work_tensor = tik_instance.Tensor("float32", (wk_size ,), name="work_tensor", scope=tik.scope_ubuf)
# If the tensor has an index offset, add a colon (:) after the subscript in the following format. Otherwise, the program will report an error.
tik_instance.vec_rsqrt_high_preci(mask, dst_ub, src_ub, work_tensor[0:], repeat_times, dst_rep_stride, src_rep_stride)
# Copy the compute result from the Unified Buffer to the destination Global Memory.
tik_instance.data_move(dst_gm, dst_ub, 0, 1, 128*4 // 32, 0, 0)
tik_instance.BuildCCE("test_vec_rsqrt_high_preci", inputs=[src_gm], outputs=[dst_gm])
Result example:
Input: src_gm= [5.349619, 0.4301902, 4.7152824, 9.539162, ..., 5.7243876, 4.4785686, 7.030495, 7.489954] Output: dst_gm: [0.43235308, 1.5246484, 0.46051747, 0.32377616, ..., 0.41796073, 0.47253108, 0.37714386, 0.36539316]