Tiling Principles

In general, the tiling policy should be designed according to the specific shape and the hardware capability of Ascend AI Processor. Specifically, you need try to:
  1. Give full play to the hardware advantages in vector computation.

    For Tensor Iterator Kernel (TIK) vector instructions, 256 bytes of data can be processed per clock cycle. The masking function is provided to skip certain elements in the computation, and the iteration function is provided for repeated data computation. The data storage must be 32-byte aligned due to physical limitations of the Unified buffer. Data to be processed by the same instruction should be contiguously stored to implement as many iteration repeats as possible. For example, for data in NCHW format, if the C dimension is far greater than that the product of the H and W dimensions, convert the CHW format into HWC so that data in the C dimension can be stored contiguously.

  2. Reduce data movements from or to the AI Cores as much as possible.

    Data is moved from and to the AI Cores over the CPU data bus. Frequent data movements compromise the performance. Try to move small NCHW data to the Unified Buffer all at once. For larger data, try to utilize the Unified Buffer as much as possible.

  3. Give full play to the AI Cores and pipeline advantages.

    More than one AI Core is provided on Ascend AI Processor. Try to evenly allocate the TIK computation among the AI Cores to maximize their computing power. Reserve necessary buffer space for the Cube and Vector units of each AI Core. Data movement from the Unified Buffer and Unified Buffer of each AI Core to the external is performed independently. Try to minimize the wait time between vector instructions by adopting double buffering, aka ping-pong buffering, in the design.

    Take the BatchNorm operator as an example. The Unified Buffer is divided into two parts. While processing a vector instruction in one part, the input data of the next vector instruction can be moved element by element into the other part to the in advance to save the wait time between vector instructions.