Data Tiling for Computation

The Unified Buffer has a limited space. If the input data size is too large, the input data and output result cannot be stored as a whole. In this case, the input data needs to be tiled before moving into AI Core for computation. Pay attention to the following during data tiling:

The Unified Buffer must be properly used to reduce the number of movement times and improve the performance.
The data storage must be 32-byte aligned due to physical limitations of the Unified Buffer.
Data to be processed by the same instruction should be contiguously stored to implement as many iteration repeats as possible to improve the utilization of the Vector Unit.
Pay attention to the processing of the remainder elements. A single Vector instruction supports a maximum of 255 repeat times. For example, 128 elements of type float16 are processed each repeat. Vector computation can be divided into three steps.
1. If the number of elements is greater than 255 x 128, set repeat = 255 and mask = 128 to perform N repeats until all data is processed.
2. If the number of remainder elements is greater than 128 but less than 255 x 128, set mask = 128, calculate the value of repeat, and process the data by using one instruction.
3. If the number of remainder elements is less than 128, set mask and process the data using one instruction.

With tensor addition as an example, the following code describes the compute tiling solution in the scenario with large amount of data.

       # data_each_block indicates the number of elements of a specified type that can be stored in a block.
       # A Vector instruction process up to eight blocks every iteration. vector_mask_max is the maximum value of mask.
       vector_mask_max = 8 * data_each_block
 
       # move_num indicates the number of elements to be moved.
       # Calculate the number of vec_add calls in a loop when repeat_times = 255.
       vadd_loop = move_num // (vector_mask_max * 255)
       add_offset = 0
 
       if vadd_loop > 0:
       with tik_instance.for_range(0, vadd_loop) as add_index:
       add_offset = add_index * vector_mask_max * 255
       tik_instance.vec_add(vector_mask_max, 
                            input_x_ub[add_offset],
                            input_x_ub[add_offset], 
                            input_y_ub[add_offset], 255, 8, 8, 8)
 
       # For the remainder data, calculate the number of iterations required for each vec_add call when all Vector Units are used.
       repeat_time = (move_num % (vector_mask_max * 255) // vector_mask_max)
       if repeat_time > 0:
              add_offset = vadd_loop * vector_mask_max * 255
              tik_instance.vec_add(vector_mask_max,
                                    input_x_ub[add_offset],
                                    input_x_ub[add_offset],
                                    input_y_ub[add_offset], repeat_time, 8, 8, 8)
 
      # For the remainder data, calculate the number of Vector Units required for the last vec_add call.
       last_num = move_num % vector_mask_max
       if last_num > 0:
              add_offset += repeat_time * vector_mask_max
              tik_instance.vec_add(last_num, 
                                                  input_x_ub[add_offset],
                                    input_x_ub[add_offset],
                                    input_y_ub[add_offset], 1, 8, 8, 8)

Parent topic: Performance Optimization in TIK Mode