For Beginners

In the following sample, 128 data elements of type float16 are read from addresses A and B in the Global Memory, respectively, and then the result is written to address C in the Global Memory.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from tbe import tik
import tbe.common.platform as tbe_platform

def simple_add():
    # Set this parameter based on the Ascend AI Processor version.
    soc_version="xxx"
    tbe_platform.set_current_compile_soc_info(soc_version,core_type="AiCore")
    tik_instance = tik.Tik()

    data_A = tik_instance.Tensor("float16", (128,), name="data_A", scope=tik.scope_gm)
    data_B = tik_instance.Tensor("float16", (128,), name="data_B", scope=tik.scope_gm)
    data_C = tik_instance.Tensor("float16", (128,), name="data_C", scope=tik.scope_gm)
    data_A_ub = tik_instance.Tensor("float16", (128,), name="data_A_ub", scope=tik.scope_ubuf)
    data_B_ub = tik_instance.Tensor("float16", (128,), name="data_B_ub", scope=tik.scope_ubuf)
    data_C_ub = tik_instance.Tensor("float16", (128,), name="data_C_ub", scope=tik.scope_ubuf)

    tik_instance.data_move(data_A_ub, data_A, 0, 1, 128 //16, 0, 0)
    tik_instance.data_move(data_B_ub, data_B, 0, 1, 128 //16, 0, 0)
    tik_instance.vec_add(128, data_C_ub[0], data_A_ub[0], data_B_ub[0], 1, 8, 8, 8)
    tik_instance.data_move(data_C, data_C_ub, 0, 1, 128 //16, 0, 0)
    tik_instance.BuildCCE(kernel_name="simple_add",inputs=[data_A,data_B],outputs=[data_C])

    return tik_instance

The preceding code is detailed as follows.

  1. Import the Python module.
    from tbe import tik

    tbe.tik: provides all TIK-related Python functions. For details, see python/site-packages/tbe/tik in the CANN component directory.

  2. Create a TIK DSL container.
    Create a TIK DSL container by using TIK Constructor.
    1
    tik_instance = tik.Tik()
    
  3. Define data.

    Define the input data, that is, data_A and data_B, and output data_C to the Global Memory by using Tensor. Each consists of 128 data elements of type float16.

    Define data_A_ub, data_B_ub, and data_C_ub in the Unified Buffer by using Tensor. Each consists of 128 data elements of type float16.

    • [API Definition] Tensor(dtype, shape, scope, name)
    • [Parameter Analysis]
      • dtype: data type of a tensor object.
      • shape: shape of a tensor object.
      • scope: buffer space where the tensor object is located. scope_gm indicates the data in the Global Memory. scope_ubuf indicates the data in the Unified Buffer.
      • name: tensor name, which must be unique.
    • [Example]
      # Define the input data data_A and data_B and output data data_C in the Global Memory. Each consists of 128 data elements of type float16.
      data_A = tik_instance.Tensor("float16", (128,), name="data_A", scope=tik.scope_gm)
      data_B = tik_instance.Tensor("float16", (128,), name="data_B", scope=tik.scope_gm)
      data_C = tik_instance.Tensor("float16", (128,), name="data_C", scope=tik.scope_gm)
      # Define data_A_ub, data_B_ub, and data_C_ub in the Unified Buffer. Each consists of 128 data elements of type float16.
      data_A_ub = tik_instance.Tensor("float16", (128,), name="data_A_ub", scope=tik.scope_ubuf)
      data_B_ub = tik_instance.Tensor("float16", (128,), name="data_B_ub", scope=tik.scope_ubuf)
      data_C_ub = tik_instance.Tensor("float16", (128,), name="data_C_ub", scope=tik.scope_ubuf)
  4. Move data in Global Memory to Unified Buffer.
    Data movement is implemented by using the function defined in data_move. That is, data in data_A is moved to data_A_ub and data in data_B is moved to data_B_ub.
    • [API Definition] data_move (dst, src, sid, nburst, burst, src_stride, dst_stride, *args, **argv)
    • [Parameter Analysis]
      • src/dst: source address/destination address
      • sid: SIM ID, which is fixed to 0
      • burst/nburst: burst indicates the size of data moved each time (in the unit of 32 bytes), and nburst indicates the number of data movements. The data to be moved consists of 128 data elements of type float16, that is, 128 x 2 bytes, which is less than the size of the Unified Buffer (256 KB). Therefore, the input data can be moved to the Unified Buffer at once (nburst = 1). Since one burst indicates 32 bytes, the number of bursts for each movement is (128 x 2/32), that is, burst = 8.
      • src_stride/dst_stride: strides of the source and destination addresses respectively. These two parameters need to be set when the data is moved with a specified interval. In the following example, both parameters are set to 0, which indicates the data is moved consecutively.
    • [Example]
      tik_instance.data_move(data_A_ub, data_A, 0, 1, 128 //16, 0, 0)
      tik_instance.data_move(data_B_ub, data_B, 0, 1, 128 //16, 0, 0)
  5. Perform the vec_add operation on the data loaded to data_A_ub and data_B_ub and write the compute result to data_C_ub.

    Before implementing the computation, get started with the basic operation units involved in TIK instructions.

    For TIK Vector instructions, 256 bytes of data can be processed per clock cycle. The masking function is provided to skip certain elements in the computation, and the iteration function is provided for repeated data computation.

    TIK instructions are processed in the space and time dimensions, supporting up to 256-byte data (that is, 128 float16/uint16/int16 elements, 64 float32/uint32/int32 elements, or 256 int8/uint8 elements) in the space dimension, and supporting the repeat operation in the time dimension. The data to be computed in a repeat operation is determined by the mask parameter. For float16 data, the Vector Unit computes 128 elements at once. For example, if mask is set to 128, the first 128 data elements of type float16 are computed.

    The computation is implemented based on the Add operator by using vec_add.

    • [API Definition] vec_add(mask, dst, src0, src1, repeat_times, dst_rep_stride, src0_rep_stride, src1_rep_stride)
    • [Parameter Analysis]
      • src0/src1/dst: source operand 0, source operand 1, and destination operand, which are data_A_ub, data_B_ub, and data_C_ub, respectively.
      • repeat_times: number of iteration repeats. Based on the preceding TIK instruction, the computation of 128 float16 elements can be completed in one iteration. Therefore, the value of repeat_times is 1.
      • dst_rep_stride/src0_rep_stride/src1_rep_stride: block-to-block stride between the destination operand/source operand 0/source operand 1 in adjacent iterations. The unit is 32 bytes. In the following example, they are set to 8, indicating that 8 x 32 bytes of data is processed in an iteration.
      • mask: data operation validity indicator. The value 128 indicates that all elements are computed.
    • [Example]
      tik_instance.vec_add(128, data_C_ub[0], data_A_ub[0], data_B_ub[0], 1, 8, 8, 8)
  6. Move the compute result in data_C_ub to data_C by using data_move.
    tik_instance.data_move(data_C, data_C_ub, 0, 1, 128 //16, 0, 0)
  7. Build the statements in the TIK DSL container into the code that can run on Ascend AI Processor.

    Build the TIK DSL container into an executable binary file of Ascend AI Processor by using BuildCCE.

    • [API Definition] BuildCCE(kernel_name, inputs, outputs, output_files_path=None, enable_l2=False)
    • [Parameter Analysis]
      • kernel_name: indicates the kernel name of the AI Core operator in the generated binary code.
      • inputs: stores the input tensor to the program file in Global Memory.
      • outputs: stores the output tensor to the program file in Global Memory.
      • output_files_path: specifies the path to store files generated in the build. Defaults to ./kernel_meta.
      • enable_l2: This parameter does not take effect currently.
    • [Example]
      tik_instance.BuildCCE(kernel_name="simple_add",inputs=[data_A,data_B],outputs=[data_C])
  8. Return a TIK instance.
    return tik_instance