Data Layout Formats

A data layout format describes how a multi-dimensional tensor is stored in memory in deep learning.

Common data formats include ND, NHWC, and NCHW, which assign specific service semantics to each axis of a tensor.

In addition to the NHWC and NCHW formats, there are several special proprietary data formats, such as FRACTAL_NZ (NZ for short), NC1HWC0, FRACTAL_Z, NDC1HWC0, and FRACTAL_Z_3D. These formats are introduced to meet the high-performance computing requirements of the Cube Unit in the AI Core. By optimizing the memory layout, these formats can improve the computational efficiency. Their specific applications can be observed when you develop related operators using matrix multiplication and convolution APIs.

Common Formats

ND, NHWC, and NCHW
Data layout formats were originally used to represent how images are stored in memory. Common formats include ND, NHWC, and NCHW. Generally, all tensors are n-dimensional (ND). NHWC and NCHW assign specific service semantics to each axis of a four-dimensional tensor, such as height, width, and channels.

The main difference between NHWC and NCHW lies in the placement of the channel dimension:
- In NHWC format, the channel dimension is located at the last position.
- In NCHW format, the channel dimension is located before the height and width dimensions.
The meaning of each axis is described as follows:
- N: batch size, indicating the number of images.
- H: height of the image, that is, the number of pixels in the vertical direction.
- W: width of the image, that is, the number of pixels in the horizontal direction.
- C: number of channels. For example, an RGB image has 3 channels.
As shown in Figure 1, for an RGB image, the pixel values of each channel are clustered in sequence as RRRRRRGGGGGGBBBBBB with the NCHW format. However, with the NHWC format, the pixel values are interleaved as RGBRGBRGBRGBRGBRGB.

Figure 1 Example of NCHW and NHWC storage

Data access characteristics vary with the data storage sequence, though the stored data is the same. As such, the compute performance varies correspondingly even with same operation.

NDHWC and NCDHW
NDHWC and NCDHW are 5D tensor formats, which extend NHWC and NCHW by adding an additional dimension D. D indicates the feature depth, which represents the data extension along the depth axis, such as the time step of a video or the depth layers of a medical image. Therefore, this format facilitates convolution operations in the time dimension. The following figure shows the data format of NDHWC.

Special Formats for Matrix Multiplication

When the Mmad basic API is used to perform matrix multiplication, the input and output data formats of the matrices must meet certain requirements. As shown in the following figures, matrix A (in L0A Buffer) must be in FRACTAL_ZZ format, matrix B (in L0B Buffer) must be in FRACTAL_ZN format, and matrix C (in L0C Buffer) must be in FRACTAL_NZ format. These formats divide a matrix into fractals (Fractal Matrix), aligning with the hardware feature that the Cube Unit reads (16, 16) x (16, 16) data for computation each time (using the half data type as an example), thereby improving the matrix computation efficiency. The size of a fractal is related to the data type and the storage location. For details, please refer to the following description.

FRACTAL_NZ/NZ

The FRACTAL_NZ (NZ for short) format is obtained by performing padding, reshaping, and transposing operations on the two lowest dimensions of a tensor (where higher-order dimensions are on the left and lower-order dimensions on the right). The following shows the conversion process:

A matrix of size (M, N) is divided into M1 x N1 fractals, which are arranged in column-major order, forming an N-shaped pattern. Each fractal contains M0 x N0 elements, which are arranged in row-major order, forming a Z-shaped pattern. That is why this data format is called the NZ format. (M0, N0) indicates the size of a fractal.

The formula is as follows:

(..., B, M, N)->pad->(..., B, M1 * M0, N1 * N0)->reshape->(..., B, M1, M0, N1, N0)->transpose->(..., B, N1, M1, M0, N0)

Generally, the NZ format is used in different scenarios in L0C Buffer and L1 Buffer.

In L0C Buffer, the NZ format is used to store the results of matrix multiplication. Each fractal has a shape of 16 x 16, containing 256 elements. This structure is ideal for the Cube Unit to perform efficient matrix multiplication.
In L1 Buffer, the NZ format is adopted to facilitate the conversion to the corresponding ZZ and ZN formats when data is moved to L0A Buffer and L0B Buffer. In this case, the fractal shape is 16 x (32B / sizeof(Datatype)), with a size of 512 bytes.

Therefore, when data is moved from L0C Buffer to L1 Buffer, the fractal size may change.

The following example describes how the ND format is converted to the NZ format.

The shape of the original tensor is (20, 28).

         
              data = [x for x in range(20 * 28)]
data_a = data * np.ones((20 * 28), dtype="float16")
tensor_a = data_a.reshape((20, 28))
print(tensor_a)

The original tensor data is printed as follows:

         
          
            
            
              [[  0.   1.   2.   3.   4.   5.   6.   7.   8.   9.  10.  11.  12.  13.
   14.  15.  16.  17.  18.  19.  20.  21.  22.  23.  24.  25.  26.  27.]
 [ 28.  29.  30.  31.  32.  33.  34.  35.  36.  37.  38.  39.  40.  41.
   42.  43.  44.  45.  46.  47.  48.  49.  50.  51.  52.  53.  54.  55.]
 [ 56.  57.  58.  59.  60.  61.  62.  63.  64.  65.  66.  67.  68.  69.
   70.  71.  72.  73.  74.  75.  76.  77.  78.  79.  80.  81.  82.  83.]
 [ 84.  85.  86.  87.  88.  89.  90.  91.  92.  93.  94.  95.  96.  97.
   98.  99. 100. 101. 102. 103. 104. 105. 106. 107. 108. 109. 110. 111.]
 [112. 113. 114. 115. 116. 117. 118. 119. 120. 121. 122. 123. 124. 125.
  126. 127. 128. 129. 130. 131. 132. 133. 134. 135. 136. 137. 138. 139.]
 [140. 141. 142. 143. 144. 145. 146. 147. 148. 149. 150. 151. 152. 153.
  154. 155. 156. 157. 158. 159. 160. 161. 162. 163. 164. 165. 166. 167.]
 [168. 169. 170. 171. 172. 173. 174. 175. 176. 177. 178. 179. 180. 181.
  182. 183. 184. 185. 186. 187. 188. 189. 190. 191. 192. 193. 194. 195.]
 [196. 197. 198. 199. 200. 201. 202. 203. 204. 205. 206. 207. 208. 209.
  210. 211. 212. 213. 214. 215. 216. 217. 218. 219. 220. 221. 222. 223.]
 [224. 225. 226. 227. 228. 229. 230. 231. 232. 233. 234. 235. 236. 237.
  238. 239. 240. 241. 242. 243. 244. 245. 246. 247. 248. 249. 250. 251.]
 [252. 253. 254. 255. 256. 257. 258. 259. 260. 261. 262. 263. 264. 265.
  266. 267. 268. 269. 270. 271. 272. 273. 274. 275. 276. 277. 278. 279.]
 [280. 281. 282. 283. 284. 285. 286. 287. 288. 289. 290. 291. 292. 293.
  294. 295. 296. 297. 298. 299. 300. 301. 302. 303. 304. 305. 306. 307.]
 [308. 309. 310. 311. 312. 313. 314. 315. 316. 317. 318. 319. 320. 321.
  322. 323. 324. 325. 326. 327. 328. 329. 330. 331. 332. 333. 334. 335.]
 [336. 337. 338. 339. 340. 341. 342. 343. 344. 345. 346. 347. 348. 349.
  350. 351. 352. 353. 354. 355. 356. 357. 358. 359. 360. 361. 362. 363.]
 [364. 365. 366. 367. 368. 369. 370. 371. 372. 373. 374. 375. 376. 377.
  378. 379. 380. 381. 382. 383. 384. 385. 386. 387. 388. 389. 390. 391.]
 [392. 393. 394. 395. 396. 397. 398. 399. 400. 401. 402. 403. 404. 405.
  406. 407. 408. 409. 410. 411. 412. 413. 414. 415. 416. 417. 418. 419.]
 [420. 421. 422. 423. 424. 425. 426. 427. 428. 429. 430. 431. 432. 433.
  434. 435. 436. 437. 438. 439. 440. 441. 442. 443. 444. 445. 446. 447.]
 [448. 449. 450. 451. 452. 453. 454. 455. 456. 457. 458. 459. 460. 461.
  462. 463. 464. 465. 466. 467. 468. 469. 470. 471. 472. 473. 474. 475.]
 [476. 477. 478. 479. 480. 481. 482. 483. 484. 485. 486. 487. 488. 489.
  490. 491. 492. 493. 494. 495. 496. 497. 498. 499. 500. 501. 502. 503.]
 [504. 505. 506. 507. 508. 509. 510. 511. 512. 513. 514. 515. 516. 517.
  518. 519. 520. 521. 522. 523. 524. 525. 526. 527. 528. 529. 530. 531.]
 [532. 533. 534. 535. 536. 537. 538. 539. 540. 541. 542. 543. 544. 545.
  546. 547. 548. 549. 550. 551. 552. 553. 554. 555. 556. 557. 558. 559.]]

             

           

         
        

The conversion process is expressed by pseudocode as follows:

N0 = 16
N1 = (28 + N0 - 1) // N0
pad_n = N1 * N0 - 28
M0 = 16
M1 = (20 + M0 - 1) // M0
pad_m = M1 * M0 - 20
tensor_b = np.pad(tensor_a, [[0, pad_m], [0, pad_n]])
tensor_b = tensor_b.reshape((M1, M0, N1, N0))
tensor_b = tensor_b.transpose((2, 0, 1, 3))
print(tensor_b)

The following figure shows the conversion process.

After the conversion, the tensor is printed as follows:

[[[[  0.   1.   2. ...  13.  14.  15.]
   [ 28.  29.  30. ...  41.  42.  43.]
   [ 56.  57.  58. ...  69.  70.  71.]
   ...
   [364. 365. 366. ... 377. 378. 379.]
   [392. 393. 394. ... 405. 406. 407.]
   [420. 421. 422. ... 433. 434. 435.]]

  [[448. 449. 450. ... 461. 462. 463.]
   [476. 477. 478. ... 489. 490. 491.]
   [504. 505. 506. ... 517. 518. 519.]
   ...
   [  0.   0.   0. ...   0.   0.   0.]
   [  0.   0.   0. ...   0.   0.   0.]
   [  0.   0.   0. ...   0.   0.   0.]]]


 [[[ 16.  17.  18. ...   0.   0.   0.]
   [ 44.  45.  46. ...   0.   0.   0.]
   [ 72.  73.  74. ...   0.   0.   0.]
   ...
   [380. 381. 382. ...   0.   0.   0.]
   [408. 409. 410. ...   0.   0.   0.]
   [436. 437. 438. ...   0.   0.   0.]]

  [[464. 465. 466. ...   0.   0.   0.]
   [492. 493. 494. ...   0.   0.   0.]
   [520. 521. 522. ...   0.   0.   0.]
   ...
   [  0.   0.   0. ...   0.   0.   0.]
   [  0.   0.   0. ...   0.   0.   0.]
   [  0.   0.   0. ...   0.   0.   0.]]]]

FRACTAL_ZZ/ZZ
The FRACTAL_ZZ (ZZ for short) format is obtained by performing padding, reshaping, and transposing operations on the two lowest dimensions of a tensor (where higher-order dimensions are on the left and lower-order dimensions on the right). The following shows the conversion process:

A matrix of size (M, K) is divided into M1 x K1 fractals, which are arranged in row-major order, forming a Z-shaped pattern. Each fractal contains M0 x K0 elements, which are also arranged in row-major order, forming a Z-shaped pattern. That is why this data format is called the ZZ format. (M0, K0) indicates the size of a fractal. The fractal shape is 16 x (32B/sizeof(Datatype)), with a size of 512 bytes.

The conversion process is expressed by the following formula:
```
(..., B, M, K)->pad->(..., B, M1 * M0, K1 * K0)->reshape->(..., B, M1, M0, K1, K0)->transpose->(..., B, M1, K1, M0, K0)
```
The values of M0 and K0 vary depending on the data type.
- For a data type with 4-bit width, M0 = 16 and K0 = 64.
- For a data type with 8-bit width, M0 = 16 and K0 = 32.
- For a data type with 16-bit width, M0 = 16 and K0 = 16.
- For a data type with 32-bit width, M0 = 16 and K0 = 8.

FRACTAL_ZN/ZN
The FRACTAL_ZN (ZN for short) format is obtained by performing padding, reshaping, and transposing operations on the two lowest dimensions of a tensor (where higher-order dimensions are on the left and lower-order dimensions on the right). The following shows the conversion process:

A matrix of size (K, N) is divided into K1 x N1 fractals, which are arranged in row-major order, forming a Z-shaped pattern. Each fractal contains K0 x N0 elements, which are arranged in column-major order, forming an N-shaped pattern. That is why this data format is called the ZN format. (K0, N0) indicates the size of a fractal. The fractal shape is (32B/sizeof(Datatype)) x 16, with a size of 512 bytes.

The conversion process is expressed by the following formula:
```
(..., B, K, N)->pad->(..., B, K1 * K0, N1 * N0)->reshape->(..., B, K1, K0, N1, N0)->transpose->(..., B, K1, N1, N0, K0)
```
The values of K0 and N0 vary depending on the data type.
- For a data type with 4-bit width, K0 = 64 and N0 = 16.
- For a data type with 8-bit width, K0 = 32 and N0 = 16.
- For a data type with 16-bit width, K0 = 16 and N0 = 16.
- For a data type with 32-bit width, K0 = 8 and N0 = 16.

Special Formats for Convolution

NC1HWC0
To improve data access efficiency of General Matrix Multiply (GEMM) data blocks, the tensor data on Ascend AI Processor is stored in NC1HWC0, a 5D format. C0, closely related to the micro architecture, is the size of the Cube Unit in the AI Core.

C1 = (C + C0 – 1)/C0. When the division is not exact, the result is rounded down.

Steps of NHWC/NCHW -> NC1HWC0 conversion: Tile data into C1 pieces of NHWC0/NC0HW along the C dimension, and arrange them in the memory into NC1HWC0, as shown in the following figure.
- Formula for NHWC -> NC1HWC0 conversion:
```
Tensor.reshape( [N, H, W, C1, C0]).transpose( [0, 3, 1, 2, 4] )
```
- Formula for NCHW -> NC1HWC0 conversion:
```
Tensor.reshape( [N, C1, C0, H, W]).transpose( [0, 1, 3, 4, 2] )
```

FRACTAL_Z
FRACTAL_Z is a format to define convolution weights, which is converted from the Filter Matrix. It is transferred to the Cube Unit in 4D format of "C1HW,N1,N0,C0".

The data is tiled into two layers, as shown in the following figure.

The data of input layer, related to the cube size, is contiguously stored in column-major order (n format). The data of the second layer, related to the matrix size, is contiguously stored in row-major order (Z format).

For example, HWCN = (2, 2, 32, 32) can be reshaped into FRACTAL_Z(C1HW, N1, N0, C0) = (8, 2, 16, 16).

HWCN-to-FRACTAL_Z conversion:
```
Tensor.padding([ [0,0], [0,0], [0,(C0-C%C0)%C0], [0,(N0-N%N0)%N0] ]).reshape( [H, W, C1, C0, N1, N0]).transpose( [2, 0, 1, 4, 5, 3] ).reshape( [C1*H*W, N1, N0, C0])
```
NCHW-to-FRACTAL_Z conversion:
```
Tensor.padding([ [0,(N0-N%N0)%N0], [0,(C0-C%C0)%C0], [0,0], [0,0] ]).reshape( [N1, N0, C1, C0, H, W,]).transpose( [2, 4, 5, 0, 1, 3] ).reshape( [C1*H*W, N1, N0, C0])
```

NDC1HWC0
To improve the access efficiency of the data block for matrix multiplication, the NDHWC format is converted to the NDC1HWC0 format. C0, closely related to the micro architecture, is the size of the Cube Unit in the AI Core. C0 is 16 for float16_t or 32 for int8_t. C0 needs to be stored contiguously.

C1 = (C + C0 – 1)/C0. When the division is not exact, the result is rounded down.

Steps of NDHWC –> NDC1HWC0 conversion: Tile data into C1 pieces of NDHWC0 along the C dimension, and arrange them in the memory into NDC1HWC0, as shown in the following figure.
FRACTAL_Z_3D
FRACTAL_Z_3D is a format to define 3D convolution weights. For example, the Conv3D operator uses this format to represent the weights of 3D convolution.

The NDHWC –> FRACTAL_Z_3D conversion process is expressed by the following formula:
```
(..., N, D, H, W, C)->pad->(..., N1 * N0, D, H, W, C1 * C0)->reshape->(..., N1, N0, D, H, W, C1, C0)->transpose->(D, C1, H, W, N1, N0, C0)->reshape->(..., D * C1 * H * W, N1, N0, C0)
```
The values of C0 and N0 vary depending on the data type.
- For a data type with 4-bit width, C0 = 64 and N0 = 16.
- For a data type with 8-bit width, C0 = 32 and N0 = 16.
- For a data type with 16-bit width, C0 = 16 and N0 = 16.
- For a data type with 32-bit width, C0 = 8 and N0 = 16.
Input a tensor in NDHWC format with shape (48, 2, 2, 2, 32).

After the conversion, the FRACTAL_Z_3D format is as follows:

Formats Relating to Matmul High-Level APIs

BSH/SBH: B indicates the batch size, S indicates the sequence length, and H = N × D, where N is the number of heads and D is the head size. BSH/SBH is usually used for matrix multiplication (Matmul). The following figures show these two data layout formats.

BMNK: It is a general data format. B indicates the batch processing size. M, N, and K indicate the dimensions of the matrix multiplication [M, K] × [K, N]. The following figure shows this data layout format.

BSNGD: It is the reshaped form of the original BSH shape. S and D indicate the M axis (or N axis) and K axis of matrix multiplication of a single batch. Each SD block represents the computation data of a batch. This format is usually used for matrix multiplication (Matmul). The following figure shows this data layout format.
SBNGD: It is the reshaped form of the original SBH shape. S and D indicate the M axis (or N axis) and K axis of matrix multiplication for a single batch. Each SD block represents the computation data of a batch. This format is usually used for matrix multiplication (Matmul). The following figure shows this data layout format.

BNGS1S2: It is usually the matrix multiplication output of the first two data layout formats. The S1S2 data is stored contiguously and each S1S2 block corresponds to the computation data of a batch. This format is usually used for matrix multiplication (Matmul). The following figure shows this data layout format.

ND_ALIGN: It is a transformed data format derived from the ND data format. When the result matrix C of matrix multiplication is output, ND_ALIGN is used to configure the layout of matrix C based on the 32-byte alignment rule along the N direction.
The following figure shows how ND is transformed to ND_ALIGN. Assume that the result matrix C has data type int32_t and is output to VECOUT. If the original matrix is not 32-byte aligned in the N direction, ND_ALIGN adds zeros to the end of the matrix to ensure 32-byte alignment.
VECTOR: It is a data format used in the General Matrix-Vector Multiply (GEMV) scenario. When a matrix is configured to use the VECTOR data layout format, the input data is represented as a vector.
Figure 2 Input matrix A in VECTOR format in the GEMV scenario

Parent topic: Neural Networks and Operators