Data Layout Formats

The data layout format describes the storage mode of multi-dimensional tensors in memory in deep learning.

Common data formats include ND, NHWC, and NCHW, which assign specific service semantics to each axis of a tensor.

In addition to the NHWC and NCHW formats, there are some special proprietary data formats, such as FRACTAL_NZ (also called NZ), NC1HWC0, FRACTAL_Z, NDC1HWC0, and FRACTAL_Z_3D. These formats are introduced to meet the high-performance computing requirements of the Cube Unit in the AI Core. By optimizing the memory layout, these formats can improve the computing efficiency. You can see the specific application of these formats when developing related operators using matrix multiplication and convolution APIs.

Common formats

  • ND, NHWC, and NCHW

    The data layout format was initially used to represent the storage mode of images in memory. Common formats include ND, NHWC, and NCHW. Generally, all tensors are N-dimensional (ND). NHWC and NCHW assign specific service semantics to each axis of a four-dimensional tensor, such as height, width, and number of channels.

    The main difference between NHWC and NCHW is the position of the channel dimension.

    • In NHWC format, the channel dimension is located at the last position.
    • In NCHW format, the channel dimension is located before the height and width.

    The meaning of each axis is described as follows:

    • N: Number of batches, that is, the number of images.
    • H: Height, that is, the number of pixels in the vertical direction.
    • W: Width, that is, the number of pixels in the horizontal direction.
    • C: Number of channels, for example, 3 for an RGB image.

    As shown in Figure 1, for an RGB image, the pixel values of each channel are clustered in sequence as RRRRRRGGGGGGBBBBBB with the NCHW layout. However, with the NHWC layout, the pixel values are interleaved as RGBRGBRGBRGBRGBRGB.

    Figure 1 Example of NCHW and NHWC storage

    Data access characteristics vary with the data storage sequence, though the stored data is the same. As such, the compute performance varies correspondingly even with same operation.

  • NDHWC and NCDHW

    NDHWC and NCDHW are five-dimensional tensors, which have one more dimension D than NHWC and NCHW. D indicates the feature depth, which represents the extension of data in the depth direction, such as the time step of a video or the depth layer of a medical image. Therefore, this format facilitates convolution operations in the time dimension. The following figure shows the data format of NDHWC.

Special formats related to matrix multiplication

When the Mmad basic API is used to perform matrix multiplication, the input and output data formats of the matrices must meet certain requirements. As shown in the following figure, matrix A (in the L0A buffer) must be in FRACTAL_ZZ format, matrix B (in the L0B buffer) must be in FRACTAL_ZN format, and matrix C (in the L0C buffer) must be in FRACTAL_NZ format. These formats divide a matrix into fractals (Fractal Matrix), which adapts to the hardware feature that the Cube Unit reads (16, 16) x (16, 16) data for computation each time (using the half data type as an example), thereby improving the matrix computation efficiency. The size of a fractal is related to the data type and the storage location. For details, see the following description.

  • FRACTAL_NZ/NZ

    The FRACTAL_NZ format is obtained by performing padding, reshaping, and transposing operations on the two lowest dimensions of a tensor. (All dimensions of a tensor are considered, with the rightmost being the lowest dimension and the leftmost being the highest dimension.) The conversion process is as follows:

    A matrix of size (M, N) is divided into M1 x N1 fractals, which are arranged in column-major order, forming an N-shaped pattern. Each fractal contains M0 x N0 elements, which are arranged in row-major order, forming a Z-shaped pattern. Therefore, this data format is called the NZ format. (M0, N0) indicates the size of a fractal.

    The formula is as follows:

    (..., B, M, N)->pad->(..., B, M1 * M0, N1 * N0)->reshape->(..., B, M1, M0, N1, N0)->transpose->(..., B, N1, M1, M0, N0)

    Generally, the NZ format is used in different scenarios in the L0C Buffer and L1 Buffer.

    • In the L0C Buffer, the NZ format is used to store the result of matrix multiplication. Its fractal shape is 16 x 16, containing 256 elements. This structure is very suitable for the Cube Unit to perform efficient matrix multiplication.
    • In the L1 Buffer, the NZ format is used so that the data can be easily converted to the corresponding ZZ and ZN formats when being moved to the L0A Buffer and L0B Buffer. In this case, the fractal shape is 16 x (32B / sizeof(Datatype)), and the size is 512 bytes.

    Therefore, when data is moved from the L0C Buffer to the L1 Buffer, the fractal size may change.

    The following example describes how to convert the ND format to the NZ format.

    The shape of the original tensor is (20, 28).

    1
    2
    3
    4
    data = [x for x in range(20 * 28)]
    data_a = data * np.ones((20 * 28), dtype="float16")
    tensor_a = data_a.reshape((20, 28))
    print(tensor_a)
    

    The original tensor data is printed as follows:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    [[  0.   1.   2.   3.   4.   5.   6.   7.   8.   9.  10.  11.  12.  13.
       14.  15.  16.  17.  18.  19.  20.  21.  22.  23.  24.  25.  26.  27.]
     [ 28.  29.  30.  31.  32.  33.  34.  35.  36.  37.  38.  39.  40.  41.
       42.  43.  44.  45.  46.  47.  48.  49.  50.  51.  52.  53.  54.  55.]
     [ 56.  57.  58.  59.  60.  61.  62.  63.  64.  65.  66.  67.  68.  69.
       70.  71.  72.  73.  74.  75.  76.  77.  78.  79.  80.  81.  82.  83.]
     [ 84.  85.  86.  87.  88.  89.  90.  91.  92.  93.  94.  95.  96.  97.
       98.  99. 100. 101. 102. 103. 104. 105. 106. 107. 108. 109. 110. 111.]
     [112. 113. 114. 115. 116. 117. 118. 119. 120. 121. 122. 123. 124. 125.
      126. 127. 128. 129. 130. 131. 132. 133. 134. 135. 136. 137. 138. 139.]
     [140. 141. 142. 143. 144. 145. 146. 147. 148. 149. 150. 151. 152. 153.
      154. 155. 156. 157. 158. 159. 160. 161. 162. 163. 164. 165. 166. 167.]
     [168. 169. 170. 171. 172. 173. 174. 175. 176. 177. 178. 179. 180. 181.
      182. 183. 184. 185. 186. 187. 188. 189. 190. 191. 192. 193. 194. 195.]
     [196. 197. 198. 199. 200. 201. 202. 203. 204. 205. 206. 207. 208. 209.
      210. 211. 212. 213. 214. 215. 216. 217. 218. 219. 220. 221. 222. 223.]
     [224. 225. 226. 227. 228. 229. 230. 231. 232. 233. 234. 235. 236. 237.
      238. 239. 240. 241. 242. 243. 244. 245. 246. 247. 248. 249. 250. 251.]
     [252. 253. 254. 255. 256. 257. 258. 259. 260. 261. 262. 263. 264. 265.
      266. 267. 268. 269. 270. 271. 272. 273. 274. 275. 276. 277. 278. 279.]
     [280. 281. 282. 283. 284. 285. 286. 287. 288. 289. 290. 291. 292. 293.
      294. 295. 296. 297. 298. 299. 300. 301. 302. 303. 304. 305. 306. 307.]
     [308. 309. 310. 311. 312. 313. 314. 315. 316. 317. 318. 319. 320. 321.
      322. 323. 324. 325. 326. 327. 328. 329. 330. 331. 332. 333. 334. 335.]
     [336. 337. 338. 339. 340. 341. 342. 343. 344. 345. 346. 347. 348. 349.
      350. 351. 352. 353. 354. 355. 356. 357. 358. 359. 360. 361. 362. 363.]
     [364. 365. 366. 367. 368. 369. 370. 371. 372. 373. 374. 375. 376. 377.
      378. 379. 380. 381. 382. 383. 384. 385. 386. 387. 388. 389. 390. 391.]
     [392. 393. 394. 395. 396. 397. 398. 399. 400. 401. 402. 403. 404. 405.
      406. 407. 408. 409. 410. 411. 412. 413. 414. 415. 416. 417. 418. 419.]
     [420. 421. 422. 423. 424. 425. 426. 427. 428. 429. 430. 431. 432. 433.
      434. 435. 436. 437. 438. 439. 440. 441. 442. 443. 444. 445. 446. 447.]
     [448. 449. 450. 451. 452. 453. 454. 455. 456. 457. 458. 459. 460. 461.
      462. 463. 464. 465. 466. 467. 468. 469. 470. 471. 472. 473. 474. 475.]
     [476. 477. 478. 479. 480. 481. 482. 483. 484. 485. 486. 487. 488. 489.
      490. 491. 492. 493. 494. 495. 496. 497. 498. 499. 500. 501. 502. 503.]
     [504. 505. 506. 507. 508. 509. 510. 511. 512. 513. 514. 515. 516. 517.
      518. 519. 520. 521. 522. 523. 524. 525. 526. 527. 528. 529. 530. 531.]
     [532. 533. 534. 535. 536. 537. 538. 539. 540. 541. 542. 543. 544. 545.
      546. 547. 548. 549. 550. 551. 552. 553. 554. 555. 556. 557. 558. 559.]]
    

    The conversion process is expressed by pseudocode as follows:

    N0 = 16
    N1 = (28 + N0 - 1) // N0
    pad_n = N1 * N0 - 28
    M0 = 16
    M1 = (20 + M0 - 1) // M0
    pad_m = M1 * M0 - 20
    tensor_b = np.pad(tensor_a, [[0, pad_m], [0, pad_n]])
    tensor_b = tensor_b.reshape((M1, M0, N1, N0))
    tensor_b = tensor_b.transpose((2, 0, 1, 3))
    print(tensor_b)

    The following figure shows the conversion process.

    After the conversion, the tensor is printed as follows:

    [[[[  0.   1.   2. ...  13.  14.  15.]
       [ 28.  29.  30. ...  41.  42.  43.]
       [ 56.  57.  58. ...  69.  70.  71.]
       ...
       [364. 365. 366. ... 377. 378. 379.]
       [392. 393. 394. ... 405. 406. 407.]
       [420. 421. 422. ... 433. 434. 435.]]
    
      [[448. 449. 450. ... 461. 462. 463.]
       [476. 477. 478. ... 489. 490. 491.]
       [504. 505. 506. ... 517. 518. 519.]
       ...
       [  0.   0.   0. ...   0.   0.   0.]
       [  0.   0.   0. ...   0.   0.   0.]
       [  0.   0.   0. ...   0.   0.   0.]]]
    
    
     [[[ 16.  17.  18. ...   0.   0.   0.]
       [ 44.  45.  46. ...   0.   0.   0.]
       [ 72.  73.  74. ...   0.   0.   0.]
       ...
       [380. 381. 382. ...   0.   0.   0.]
       [408. 409. 410. ...   0.   0.   0.]
       [436. 437. 438. ...   0.   0.   0.]]
    
      [[464. 465. 466. ...   0.   0.   0.]
       [492. 493. 494. ...   0.   0.   0.]
       [520. 521. 522. ...   0.   0.   0.]
       ...
       [  0.   0.   0. ...   0.   0.   0.]
       [  0.   0.   0. ...   0.   0.   0.]
       [  0.   0.   0. ...   0.   0.   0.]]]]
  • FRACTAL_ZZ/ZZ

    FRACTAL_ZZ is a format obtained by performing padding, reshaping, and transposing operations on the two lowest dimensions of a tensor. (All dimensions of a tensor are considered, with the rightmost being the lowest dimension and the leftmost being the highest dimension.) The conversion process is as follows:

    A matrix of size (M, K) is divided into M1 x K1 fractals, which are arranged in row-major order, and the shape is like a Z. Each fractal contains M0 x K0 elements, which are also arranged in row-major order, and the shape is like a Z. Therefore, this data format is called ZZ format. (M0, K0) indicates the size of a fractal. The fractal shape is 16 x (32B/sizeof(Datatype)), and the size is 512 bytes.

    The conversion process is expressed by the following formula:

    (..., B, M, K)->pad->(..., B, M1 * M0, K1 * K0)->reshape->(..., B, M1, M0, K1, K0)->transpose->(..., B, M1, K1, M0, K0)

    The values of M0 and K0 vary with the data type:

    • For the data type with 4-bit width, M0 = 16 and K0 = 64.
    • For the data type with 8-bit width, M0 = 16 and K0 = 32.
    • For the data type with 16-bit width, M0 = 16 and K0 = 16.
    • For the data type with 32-bit width, M0 = 16 and K0 = 8.
  • FRACTAL_ZN/ZN

    FRACTAL_ZN is a format obtained by performing padding, reshaping, and transposing operations on the two lowest dimensions of a tensor. (All dimensions of a tensor are considered, with the rightmost being the lowest dimension and the leftmost being the highest dimension.) The conversion process is as follows:

    A matrix of size (K, N) is divided into K1 x N1 fractals, which are arranged in row-major order, and the shape is like a Z. Each fractal contains K0 x N0 elements, which are arranged in column-major order, and the shape is like an N. Therefore, this data format is called ZN format. (K0, N0) indicates the size of a fractal. The fractal shape is (32B/sizeof(Datatype)) x 16, and the size is 512 bytes.

    The conversion process is expressed by the following formula:

    (..., B, K, N)->pad->(..., B, K1 * K0, N1 * N0)->reshape->(..., B, K1, K0, N1, N0)->transpose->(..., B, K1, N1, N0, K0)

    The values of K0 and N0 vary according to the data type.

    • Data type with a bit width of 4: K0=64, N0=16;
    • Data type with 8-bit width: K0=32, N0=16;
    • Data type with 16-bit width: K0=16, N0=16;
    • Data type with 32-bit width: K0=8, N0=16.

Convolution-related special formats

  • NC1HWC0

    To improve data access efficiency of General Matrix Multiply (GEMM) data blocks, the tensor data on Ascend AI Processor is stored in NC1HWC0, a 5D format. C0, closely related to the micro architecture, is the size of the Cube Unit in the AI Core.

    C1 = (C + C0 – 1)/C0. When the division is not exact, the result is rounded down.

    Steps of NHWC/NCHW -> NC1HWC0 conversion: Tile data into C1 pieces of NHWC0/NC0HW along the C dimension, and arrange them in the memory into NC1HWC0, as shown in the following figure.

    • Formula for NHWC -> NC1HWC0 conversion:
      Tensor.reshape( [N, H, W, C1, C0]).transpose( [0, 3, 1, 2, 4] )
    • Formula for NCHW -> NC1HWC0 conversion:
      Tensor.reshape( [N, C1, C0, H, W]).transpose( [0, 1, 3, 4, 2] )
  • FRACTAL_Z

    FRACTAL_Z is a format to define convolution weights, which is converted from the Filter Matrix. It is transferred to the Cube Unit in 4D format of "C1HW,N1,N0,C0".

    The data is tiled into two layers, as shown in the following figure.

    The data of input layer, related to the cube size, is contiguously stored in column-major order (n format). The data of the second layer, related to the matrix size, is contiguously stored in row-major order (Z format).

    For example, HWCN = (2, 2, 32, 32) can be reshaped into FRACTAL_Z (C1HW, N1, N0, C0) = (8, 2, 16, 16).

    HWCN-to-FRACTAL_Z conversion:

    Tensor.padding([ [0,0], [0,0], [0,(C0-C%C0)%C0], [0,(N0-N%N0)%N0] ]).reshape( [H, W, C1, C0, N1, N0]).transpose( [2, 0, 1, 4, 5, 3] ).reshape( [C1*H*W, N1, N0, C0])

    NCHW-to-FRACTAL_Z conversion:

    Tensor.padding([ [0,(N0-N%N0)%N0], [0,(C0-C%C0)%C0], [0,0], [0,0] ]).reshape( [N1, N0, C1, C0, H, W,]).transpose( [2, 4, 5, 0, 1, 3] ).reshape( [C1*H*W, N1, N0, C0])
  • NDC1HWC0

    To improve the access efficiency of the data block for matrix multiplication, the NDHWC format is converted to the NDC1HWC0 format. C0, closely related to the micro architecture, is the size of the Cube Unit in the AI Core. C0 is 16 for float16_t or 32 for int8_t. C0 needs to be stored contiguously.

    C1 = (C + C0 – 1)/C0. When the division is not exact, the result is rounded down.

    The process of converting NDHWC to NDHWC -> NDC 1HWC0 is as follows: Split the data in the C dimension into C1 copies of NDHWC0, and then arrange the C1 copies of NDHWC0 in the memory into NDC1HWC0. The following figure shows the format conversion.

  • FRACTAL_Z_3D

    FRACTAL_Z_3D is a 3D convolution weight format. For example, the Conv3D operator uses this format to express the weights of 3D convolution.

    The transformation from NDHWC to FRACTAL_Z_3D is expressed by the following formula:

    (..., N, D, H, W, C)->pad->(..., N1 * N0, D, H, W, C1 * C0)->reshape->(..., N1, N0, D, H, W, C1, C0)->transpose->(D, C1, H, W, N1, N0, C0)->reshape->(..., D * C1 * H * W, N1, N0, C0)

    The sizes of C0 and N0 vary with the data type.

    • For data types with 4-bit width: C0 = 64, N0 = 16;
    • For data types with 8-bit width: C0 = 32, N0 = 16;
    • Data type with 16-bit width: C0=16, N0=16;
    • Data type with 32-bit width: C0=8, N0=16.

    Input an NDHWC tensor with shape (48, 2, 2, 2, 32).

    After the conversion, the FRACTAL_Z_3D format is as follows:

Matmul high-level API formats

  • BSH/SBH: B indicates the batch size, S indicates the sequence length, and H = N x D, where N indicates the number of heads and D indicates the head size. This format is usually used for Matmul matrix multiplication. The following figure shows the data layout format.

  • BMNK: general data format; B: batch size; M. N and K are matrix dimensions of matrix multiplication [M, K]*[K, N]. The data arrangement format is as follows:

  • BSNGD: reshape of the original BSH shape. S and D are the M (or N) and K axes of single-batch matrix multiplication. One SD is the computation data of one batch. This format is usually used for Matmul matrix multiplication. The following figure shows the data layout format.

  • SBNGD: reshape of the original SBH shape. S and D are the M (or N) and K axes of single-batch matrix multiplication. One SD is the computation data of one batch. This format is usually used for Matmul matrix multiplication. The following figure shows the data layout format.

  • BNGS1S2: output of matrix multiplication based on the first two data layouts. The S1S2 data is stored continuously, and each S1S2 is the computing data of a batch. This format is usually used for Matmul matrix multiplication. The following figure shows the data layout format.

  • ND_ALIGN: ND_ALIGN is a transform data format of the ND data format. When the result matrix C of matrix multiplication is output, it is used to configure the output of matrix C based on the 32-byte alignment rule in the N direction.

    ND-> The following figure shows the ND_ALIGN transformation process. Assume that the data type of the matrix multiplication result matrix C is int32_t and is output to VECOUT. The original matrix is not 32-byte aligned in the N direction. After ND_ALIGN is set, 0s are added to the end of the matrix, align it to 32 bytes.

  • VECTOR: a data format used in the GEMV (general matrix-vector multiply) scenario. If the matrix is configured to the vector data layout format, the input data is a vector.
    Figure 2 Input matrix A in vector format in the GEMV scenario