TIK Operator Generalization

Overview

Introduction to TIK gives an example operator that both the data type and the number of data inputs are fixed. Additionally, all tensors have a fixed size and the TIK API parameters are set to fixed values. However, if we want to write an operator that supports any valid data type, data shape, and format, operator generalization is required. The current TIK version format-insensitive. This means only the generalization of the data type and data shape need to be considered during operator generalization.

You can generalize a TIK operator in either of the following ways:
  1. Method 1: Obtain the input shape and data type of the operator from the method declaration of the operator. In this way, the tensor size to be allocated, number of loops, and instruction execution configuration can be automatically calculated based on the input shape and data type, enabling operators to adapt to different data types and shapes. At build time, the corresponding shape and data type are passed as arguments. Different .o files are built from the same code based on different inputs with determinate shapes and data types. At run time, no additional runtime parameter is required, and only the input and output addresses are needed.
  2. Method 2: Allocate memory space based on the maximum requirement. During build, the shape range is passed as a build argument. The generated .o file is shape-oriented. At runtime, the input and output shapes need to be passed in addition to the input and output addresses.
The two implementation methods are compared as follows:
  1. Method 1: .o files built using the static build mode only support the computation of static shape. At build time, the tensor size, loop sequence, and instruction orders are determined to make full use of space and reduce unnecessary scalar operations, achieving optimal performance.

    TIK operators developed using this method can run on the network. In addition, the single-operator can be called using AscendCL APIs.

  2. Method 2: All problems are solved with fewer .o files. Operations such as selection judgment are performed based on running parameters. However, these operations require scalar operations, causing performance deterioration.

    TIK operators developed using this method cannot be used on the network but the single-operator can be called using AscendCL APIs.

You can select the build and running methods that best suit your needs.

The following describes only the precautions of operator generalization using method 1. For details about method 2, see TIK Custom Operator with Dynamic Shape.

Implementation

The following uses the Vadd operator running on the AI Core as an example to describe the precautions during operator generalization.

  1. Declare an operator.

    To enable operator to support all valid data types and shapes, such information must be obtained from the operator declaration. The operator declaration is defined as follows:

    def vadd_sample(input_x, input_y, output_z, kernel_name):

    input_x and input_y are Vadd's inputs in dictionary format, including shape, ori_shape, format, ori_format, and dtype. output_z is the output in dictionary format and is reserved.

    The names, quantities, and sequences of the input and output Tensors must be the same as those of Operator Prototype Definition. Optional inputs also need to be defined in this step. The compute logic determines whether data is transferred and processes the data accordingly.

  2. Obtain the space occupied by data types and the data shapes, and allocate space in the memory (Global Memory) for input and output tensors.

    The supported TIK data types include uint8, int8, uint16, int16, float16, uint32, int32, float32, uint64, and int64. The space occupied by these data types can be obtained from the character string of the type name. As such define a get_bit_len function to compute the space (in bits) occupied by the output dtype based on the input dtype name. The function is implemented as follows:

    def get_bit_len(dtype):
        index = 0
        for i in dtype:
            if i.isdigit():
                break
            index += 1
        return int(dtype[index:])

    Dynamically obtain the input shape and data type based on the input. If more parameters are required during the computation of the operator, obtain from the input likewise.

    class Vadd():
        def __init__(self, input_x, input_y, output_z, kernel_name="vadd_sample"):
            # Obtain the shape and data type of the input_x input tensor. For the Vadd operator, you must ensure that the dtype and shape of the two inputs are the same.
            self.shape_x = input_x.get("shape")
            self.dtype_x = input_x.get("dtype")
            self.shape_y = input_y.get("shape")
            self.dtype_y = input_y.get("dtype")
            self.shape_z = output_z.get("shape")
            self.dtype_z = output_z.get("dtype")
            self.kernel_name = kernel_name
            # Set this parameter based on the Ascend AI Processor version in use.
            soc_version="xxx"
            # The target core type is the AI Core by default.
            tbe_platform.set_current_compile_soc_info(soc_version)
            self.tik_instance = tik.Tik()

    Allocate memory space for the input and output tensors.

            self.input_x_gm = self.tik_instance.Tensor(
                self.dtype_x, self.shape_x, name="input_x_gm", scope=tik.scope_gm)
            self.input_y_gm = self.tik_instance.Tensor(
                self.dtype_y, self.shape_y, name="input_y_gm", scope=tik.scope_gm)
            self.output_z_gm = self.tik_instance.Tensor(
                self.dtype_z, self.shape_z, name="output_z_gm", scope=tik.scope_gm)
  3. The input data size is dynamically obtained from the operator declaration. Therefore, the number of loops and parameters for instruction computation are not determinate before computation. Before the computation in the Unified Buffer, related values must be computed for subsequent computations, for example, the number of AI Cores, the number of elements that can be stored in the Unified Buffer, and the number of elements to be computed per loop.
            # Obtain the number of AI Cores.
            self.aicore_num = tbe_platform.get_soc_spec("CORE_NUM")
     
            # The data read and write on the UB must be 32-byte aligned. This parameter is used to compute the tensor division and data movement instructions.
            block_byte_size = 32
            # Obtain the UB size in bytes.
            ub_size_bytes = tbe_platform.get_soc_spec("UB_SIZE")
     
            # Compute the number of bytes corresponding to the input data type.
            dtype_bytes_size = get_bit_len(self.dtype_x) // 8
            # Compute the number of elements per block based on the input data type.
            self.data_each_block = block_byte_size // dtype_bytes_size
     
            # Compute ub_size_bytes//dtype_bytes_size to obtain the maximum number of elements that can be stored in the UB. Because two input tensors need to be stored, the maximum number of elements that can be stored of each tensor is ub_size_bytes//dtype_bytes_size//2.
            # The data read from and written to the UB must be 32-byte aligned. Therefore, align the data on top of the result obtained from the previous step. Assume that the obtained result is data_per_tensor. Perform alignment according to data_per_tensor//data_each_block * data_each_block, obtaining the number of elements of each tensor in the UB.
            self.ub_tensor_size = (
                ub_size_bytes // dtype_bytes_size // 2 // self.data_each_block * self.data_each_block)
     
            # Compute the total number of input elements.
            self.input_num = functools_reduce(lambda x, y: x * y, self.shape_x)
     
            # Compute the data elements evenly scheduled to each AI Core and perform 32-byte alignment.
            self.data_num_each_core = self.input_num // self.aicore_num
     
            # Compute the number of elements that can be computed in each repeat. Each repeat of the Vector instruction computes a maximum of eight blocks (256 bytes). In this case, the mask value is the maximum value of the corresponding data type.
            self.vector_mask_max = 8 * self.data_each_block
  4. Enable multi-core computing.

    To make full use of AI Core, you need to use the for_range function to enable core parallelism computing. The tensor definition stored in the Unified Buffer needs to be written in the for_range multi-core loop, as shown in the following:

    def vadd_compute(self):
        with self.tik_instance.for_range(0, self.aicore_num, block_num=self.aicore_num) as index:
            # Create a tensor on the Unified Buffer. The shape is ub_tensor_size obtained in the previous step.
            self.input_x_ub = self.tik_instance.Tensor(self.dtype_x, (self.ub_tensor_size,),name="input_x_ub",scope=tik.scope_ubuf)
            self.input_y_ub = self.tik_instance.Tensor(self.dtype_y, (self.ub_tensor_size,),name="input_y_ub",scope=tik.scope_ubuf)
  5. Move the data in the Global Memory to the Unified Buffer and perform computation. Note that the offset of each movement is the number of processed data elements.
            # index is the index number of the AI Core.
            move_offset = index * self.data_num_each_core
            # Each AI Core is responsible for its own data tiles.
            self.vadd_compute_each_core(move_offset, self.data_num_each_core)

    The following describes the compute logic of each AI Core, that is, the vadd_compute_each_core function.

    The implementation of this function is also a difficulty in operator generalization, because the UB stores 248 KB data at most, that is, tensor A up to 124 KB and tensor B up to 124 KB. In addition, the amount of data calculated by one vec_add instruction is limited. In common cases, multiple loops are required.

    def vadd_compute_each_core(self, move_offset, move_num):
        # Compute the number of loops.
        loop_time = move_num // self.ub_tensor_size
        # If the rest data is insufficient for one loop, compute the reminder directly.
        if loop_time > 0:
            # Typical for_range loop
            with self.tik_instance.for_range(0, loop_time) as loop_index:
               # move_offset needs to be continuously updated during computation, which is equivalent to moving a pointer in the memory.
                move_offset = loop_index * self.ub_tensor_size
                # Pass the offset and computation amount to the loop function.
                self.vadd_compute_each_loop(move_offset, self.ub_tensor_size)
            move_offset = loop_time * self.ub_tensor_size
        # Process the data that is not full enough to be fed into the UB.
        last_num = move_num % self.ub_tensor_size
        if last_num > 0:
            self.vadd_compute_each_loop(move_offset, last_num)

    The loop function vadd_compute_each_loop is implemented as follows:

    def vadd_compute_each_loop(self, move_offset, move_num):
        # Move data from the memory to the UB.
        burst_len = math.ceil(move_num / self.data_each_block)
        self.tik_instance.data_move(self.input_x_ub, self.input_x_gm[move_offset], 0, 1, burst_len, 0, 0)
        self.tik_instance.data_move(self.input_y_ub, self.input_y_gm[move_offset], 0, 1, burst_len, 0, 0)
     
        # Call vec_add to execute the computation task.
        # Compute the number of vec_add instructions that the total data volume can fill.
        vadd_loop = move_num // (self.vector_mask_max * 255)
        add_offset = 0
        if vadd_loop > 0:
            # Loop over the vec_add instruction.
            with self.tik_instance.for_range(0, vadd_loop) as add_index:
                add_offset = add_index * self.vector_mask_max * 255
                self.tik_instance.vec_add(self.vector_mask_max, 
                                           self.input_x_ub[add_offset], 
                                           self.input_x_ub[add_offset], 
                                           self.input_y_ub[add_offset],  
                                           255, 8, 8, 8)
            add_offset = vadd_loop * vector_mask_max * 255
     
        # Compute the number of loops that the remaining data in the previous step can fill and call vec_add for computation.
        repeat_time = (move_num % (self.vector_mask_max * 255) // self.vector_mask_max)
        if repeat_time > 0:
            self.tik_instance.vec_add(self.vector_mask_max,
                                       self.input_x_ub[add_offset],
                                       self.input_x_ub[add_offset],
                                       self.input_y_ub[add_offset], 
                                       repeat_time, 8, 8, 8)
            add_offset += repeat_time * self.vector_mask_max
     
        # Compute the amount of data that is not processed after the previous step and pass it as the mask value to call vec_add for the last time.
        last_num = move_num % self.vector_mask_max
        if last_num > 0:
            self.tik_instance.vec_add(last_num, 
                                       self.input_x_ub[add_offset],
                                       self.input_x_ub[add_offset],
                                       self.input_y_ub[add_offset], 
                                       1, 8, 8, 8)
     
        # Move the compute result from the UB back to the memory.
        self.tik_instance.data_move(self.output_z_gm[move_offset],
                                        self.input_x_ub, 0, 1, burst_len, 0, 0)

    Till now, the Vadd operator is generalized. For details about the full sample code, see For Advanced.