Design Guidelines

This section uses the BatchNorm operator as an example to describe how to develop an operator in TIK mode.

Operator Analysis

The BatchNorm operator consists of the following two operations:

Normalizes the input: x_norm = (x – μ)/σ, where μ and σ are the mean value and variance value, respectively.
Resizes and translates the normalized input: y = γ * x_norm + β

This sample focuses on the dynamic shape implementation. Therefore, only the normalization part is described.

Operator Specifications

Determine the specifications of the BatchNorm operator.

Specification	Description
Framework	Caffe
Format	NCHW (given the single-operator execution scenario)
Input data type	(User-defined) float16
N	(User-defined) N = 1
C	(User-defined) 0–1024 channels
H	(User-defined) 0–1024 pixels
W	(User-defined) 0–1024 pixels
shape	Any shape

Tiling Policy Design

The tiling policies of the BatchNorm operator are designed as follows.

Scenario	Policy
Channel parallelism first, C > H * W	Tiling_1: Compute BatchNorm along the channels pixel-wise in parallel. Keep the channels 32-byte aligned. Apply ping-pong buffering and AI Core parallelism.
HW parallelism first, C < H * W	Tiling_2: For a single feature map, if H * W * C > 112 KB (C = 2), data of no more than one channel can be moved at a time. Therefore, tile the feat map along each channel for BatchNorm. Apply ping-pong buffering between the Global Memory and Unified Buffer. Note that the 112 KB size benchmark is half of the Unified Buffer space with a certain buffer space reserved and ping-pong buffering considered.
Tiling_3: For a single feature map, if H * W * C <= 112 KB (C > 1), data of more than one channel can be moved at a time. Apply ping-pong buffering between the Global Memory and Unified Buffer, and AI Core parallelism between channels.

Scenario

Policy

Channel parallelism first, C > H * W

Tiling_1: Compute BatchNorm along the channels pixel-wise in parallel. Keep the channels 32-byte aligned. Apply ping-pong buffering and AI Core parallelism.

HW parallelism first, C < H * W

Tiling_2: For a single feature map, if H * W * C > 112 KB (C = 2), data of no more than one channel can be moved at a time. Therefore, tile the feat map along each channel for BatchNorm. Apply ping-pong buffering between the Global Memory and Unified Buffer.

Note that the 112 KB size benchmark is half of the Unified Buffer space with a certain buffer space reserved and ping-pong buffering considered.

Tiling_3: For a single feature map, if H * W * C <= 112 KB (C > 1), data of more than one channel can be moved at a time. Apply ping-pong buffering between the Global Memory and Unified Buffer, and AI Core parallelism between channels.

The following gives detailed description of each policy:

Channel parallelism first
In the C > H * W scenario, for example, NCHW = 1 x 1024 x 8 x 6, channel parallelism is preferred. Therefore, BatchNorm should be computed along the channels pixel-wise in parallel. Also consider ping-pong buffering and AI Core parallelism to improve Unified Buffer utilization.
- If N * C * H * W <= 112 KB, data can be moved from the Global Memory to the Unified Buffer and then the Global Memory all at once. Therefore, tiling along the HW plane does not need to be considered.
- If N * C * H * W > 112 KB, data needs to be tiled along the HW plane before moving from the Global Memory to the Unified Buffer and then the Global Memory. If H * W or C is not 32-byte aligned, rounding-up is performed.
  
  Note: In the preceding figure, M is a multiple of 16 for rounding up the float16 data to the nearest multiple of 32 bytes in this example.
Single-channel HW parallelism first
- If H * W < 112 KB and H * W * 2 > 112 KB, pixel-wise BatchNorm is performed per feature map.
- If H * W > 112 KB, tiling of the feature map needs to be considered.
Consider ping-pong buffering along the Global Memory > Unified Buffer > Global Memory movement to save the wait time between vector instructions.
Multi-channel HW parallelism first
Given the Unified Buffer size, if H * W * C < 112 KB (C > 1), data of more than one channel can be moved at a time. Consider using multiple AI Cores for optimization between channels.

Consider ping-pong buffering along the Global Memory > Unified Buffer > Global Memory movement to save the wait time between vector instructions.

Parent topic: Sample Implementation