vec_ln_high_preci

Description

Computes the natural logarithm element-wise: This API has a higher precision than vec_ln.

Prototype

vec_ln_high_preci(mask, dst, src, work_tensor, repeat_times, dst_rep_stride, src_rep_stride)

Parameters

For details, see Parameters. The following describes only the dst, src, and work_tensor parameters.

dst, src, and work_tensor are Tensors of type float16.

  • If the source operand tensor has an offset, the passing formats are as follows: tensor[offset1:offset2] means starting from offset1 and ending at offset2. tensor[offset1:] means starting from offset1. tensor[offset] means that only one element is passed. (In this case, the tensor is impossible to be sliced and a runtime error will be reported. Therefore, this format is not allowed.)
  • If the source operand tensor does not have an offset, the tensor can be passed directly.

work_tensor:

work_tensor is a user-defined temporary buffer space for storing the intermediate result. The space is limited to scope_ubuf and is used for internal computation only.

work_tensor space calculation:

  1. Calculate the minimum buffer space required for src computation based on repeat_times, mask, and src_rep_stride as follows: src_extent_size = (repeat_times – 1) * src_rep_stride * 16 + mask_len

    In contiguous mask mode, mask_len is the mask value itself. In bitwise mask mode, mask_len is the mask value corresponding to the most significant bit.

  2. Round up the minimum space required for src computation to the multiple of 32 bytes: wk_size_unit = (src_extent_size + 15)//16 * 16
  3. Calculate the size of work_tensor as follows: work_tensor = 10 * wk_size_unit

Example of work_tensor space calculation:

  1. If mask = 128, rep_times = 2, and src_rep_stride = 8, then mask_len = 128, src_extent_size = (2 – 1) * 8 * 16 + mask_len = 256 and wk_size_unit = (src_extent_size + 15)//16 * 16 = 256. Therefore, work_tensor = 10 * wk_size_unit = 2560.
  2. If mask = [3, 2**64-1], rep_times = 2, src_rep_stride = 8, then mask_len = 66. The most significant bit of mask is 3, corresponding to binary bit 11. Therefore, mask_len = 2 + 64, src_extent_size = (2 – 1) * 8 * 16 + mask_len = 194, wk_size_unit = (src_extent_size + 15)//16 * 16 = 208. Therefore, work_tensor = 10 * wk_size_unit = 2080.

Returns

None

Applicability

Atlas 200/300/500 Inference Product

Atlas Training Series Product

Restrictions

  • dst, src, and work_tensor must be declared in scope_ubuf.
  • The space of the dst, src, and work_tensor tensors must not overlap.
  • If the value of src is not positive, an unknown result may be produced.
  • For the Atlas 200/300/500 Inference Product , the compute result using this API has higher accuracy in certain ranges than using the vec_ln API.
  • For the Atlas Training Series Product , this API has the same effect as vec_ln.
  • For other restrictions, see Restrictions.

Example

  • In contiguous mask mode
    from tbe import tik
    tik_instance = tik.Tik()
    # Define the tensors.
    src_gm = tik_instance.Tensor("float16", (2, 128), tik.scope_gm, "src_gm")
    src_ub = tik_instance.Tensor("float16", (2, 128), tik.scope_ubuf, "src_ub")
    dst_ub = tik_instance.Tensor("float16", (2, 128), tik.scope_ubuf, "dst_ub")
    # The size of work_tensor is 10 times src.
    work_tensor = tik_instance.Tensor("float16", (10, 2, 128), tik.scope_ubuf, "work_tensor")
    dst_gm = tik_instance.Tensor("float16", (2, 128), tik.scope_gm, "dst_gm")
    # Move the input data from the Global Memory to the Unified Buffer.
    tik_instance.data_move(src_ub, src_gm, 0, 1, 16, 0, 0)
    tik_instance.vec_dup(128, dst_ub, 0.0, 2, 8)
    mask = 128
    rep_times = 2
    src_rep_stride = 8
    dst_rep_stride = 8
    # If the input work_tensor has an index, use the work_tensor[index:] format.
    tik_instance.vec_ln_high_preci(mask, dst_ub, src_ub, work_tensor[0:], rep_times, dst_rep_stride, src_rep_stride)
    # Move the compute result from the Unified Buffer to the Global Memory.
    tik_instance.data_move(dst_gm, dst_ub, 0, 1, 16, 0, 0)
    tik_instance.BuildCCE(kernel_name="vln_rep_8_src_rep_8", inputs=[src_gm], outputs=[dst_gm])

    Result example:

    Input:
    [
      [1, 2, 3, 4, ......, 128],
      [1, 2, 3, 4, ......, 128]
    ]
    
    Output:
    [
      [0, 0.6931, 1.0986, 1.3863, ......, 4.8520], 
      [0, 0.6931, 1.0986, 1.3863, ......, 4.8520]
    ]
  • In bitwise mask mode
    from tbe import tik
    tik_instance = tik.Tik()
    kernel_name = "vln_rep_8_src_rep_8"
    # Define the tensors.
    src_gm = tik_instance.Tensor("float16", (2, 128), tik.scope_gm, "src_gm")
    src_ub = tik_instance.Tensor("float16", (2, 128), tik.scope_ubuf, "src_ub")
    dst_ub = tik_instance.Tensor("float16", (2, 128), tik.scope_ubuf, "dst_ub")
    # Calculate the size of work_tensor.
    rep_times = 2
    src_rep_stride = 8
    dst_rep_stride = 8
    mask = [3, 2**64-1]
    mask_len = 66
    src_extent_size = (rep_times - 1)*src_rep_stride*16 + mask_len
    wk_size_unit = (src_extent_size + 15)//16*16
    wk_size = 10*wk_size_unit
    work_tensor = tik_instance.Tensor("float16", (wk_size, ), tik.scope_ubuf, "work_tensor")
    dst_gm = tik_instance.Tensor("float16", (2, 128,), tik.scope_gm, "dst_gm")
    # Copy the user input to the source Unified Buffer.
    tik_instance.data_move(src_ub, src_gm, 0, 1, 16, 0, 0)
    # Initialize the destination Unified Buffer.
    tik_instance.vec_dup(128, dst_ub, 0.0, 2, 8)
    tik_instance.vec_ln_high_preci(mask, dst_ub, src_ub, work_tensor, rep_times, dst_rep_stride, src_rep_stride)
    # Copy the compute result to the destination Global Memory.
    tik_instance.data_move(dst_gm, dst_ub, 0, 1, 16, 0, 0)
    tik_instance.BuildCCE(kernel_name="vln_rep_8_src_rep_8", inputs=[src_gm], outputs=[dst_gm])

    Result example:

    Input:
    [
      [1, 2, 3, 4, ......, 65, 66, 67, ......, 128],
      [1, 2, 3, 4, ......, 65, 66, 67, ......, 128]
    ]
    
    Output:
    [
      [0, 0.6931, 1.0986, 1.3863, ......, 4.1744, 4.1897, 0, ......, 0], 
      [0, 0.6931, 1.0986, 1.3863, ......, 4.1744, 4.1897, 0, ......, 0]
    ]