Avoiding Bank Conflicts in the Unified Buffer

[Priority] High

This performance optimization guide applies to the following product models:

  • Atlas A3 training products / Atlas A3 inference products
  • Atlas A2 training products / Atlas A2 inference products

[Description] To improve data access efficiency and throughput, the Unified Buffer adopts a bank (memory modules of the same size) structure. The Unified Buffer has a total size of 192 KB and is divided into 48 banks. Each bank consists of 128 rows, with each row being 32 bytes long. These 48 banks are further organized into 16 bank groups, with each bank group containing three banks. For example, bank 15, bank 31, and bank 47 form a bank group.

Figure 1 Bank structure (the arrow direction indicates the memory layout sequence)

Each bank can independently read and write data, and multiple data requests can be processed at the same time. However, when a plurality of read/write operations attempt to access a same bank or bank group at the same time, due to a limitation of hardware resources, these operations need to wait in a queue, resulting in a bank conflict and performance deterioration.

Specifically, the Vector Unit can read or write a row of data from or into each bank group in each cycle (an instruction cycle). If multiple operations in the same API attempt to access the same bank or bank group at the same time, the Vector Unit cannot process all requests in the same cycle. As a result, these requests wait in a queue. This queuing increases the data access latency and reduces the overall system performance.

Typical Scenarios of Bank Conflicts

Bank conflicts can be classified into the following three scenarios:

  • Read/Write conflict: A read operation and a write operation attempt to access the same bank at the same time.
  • Write/Write conflict: Multiple write operations attempt to access the same bank group at the same time..
  • Read/Read conflict: Multiple read operations attempt to access the same bank group at the same time..

The following provides some specific examples. Assume that 0x10000 is on bank 16, 0x10020 is on bank 17, and 0x20020 is on bank 33, as shown in the following figure.

Figure 2 Address allocation
  • Example of a read/write conflict
  • Example of a write/write conflict
    When eight data blocks (block0 to block7) corresponding to the destination operand dst of the Vector instruction are written to the same bank group, a write/write conflict occurs. The details are as follows:
    Table 1 Example of a write/write conflict

    No.

    dst addr

    blk_stride

    block0_addr

    block1_addr

    block2_addr

    ...

    Conclusion

    Example 1

    0x1FE00

    16

    0x1FE00

    0x20000

    0x20200

    ...

    All eight data blocks are in the same bank group. Therefore, conflicts occur. One repeat is written in eight cycles.

    Example 2

    0x1FE00

    8

    0x1FE00

    0x1FF00

    0x20000

    ...

    block0 and block2 are in the same bank group and conflict with each other. One repeat is written in four cycles.

  • Read/Read conflict
    • When multiple source operands of the Vector instruction are read to the same bank group at the same time, a read/read conflict occurs. The analysis is as follows:
      Table 2 Example of a read/read conflict with two source addresses

      No.

      src0 addr

      src1 addr

      bank

      bank group

      Conclusion

      Example 1

      0x10020

      0x20020

      bank_id0 != bank_id1

      bank_group_id0 == bank_group_id1

      Conflict

      Example 2

      0x10020

      0x10000

      bank_id0 != bank_id1

      bank_group_id0 != bank_group_id1

      No conflict

    • When the eight data blocks (block0 to block7) corresponding to a source operand of the Vector instruction are read to the same bank group, a read/read conflict occurs. The analysis is as follows:
      Table 3 Example of a read/read conflict with a single source address

      No.

      src addr

      blk_stride

      block0_addr

      block1_addr

      block2_addr

      ...

      Conclusion

      Example 1

      0x1FE00

      16

      0x1FE00

      0x20000

      0x20200

      ...

      All eight data blocks are in the same bank group. Therefore, conflicts occur. One repeat is read in eight cycles.

      Example 2

      0x1FE00

      8

      0x1FE00

      0x1FF00

      0x20000

      ...

      block 0 and block 2 are in the same bank group. Therefore, a conflict occurs. One repeat is completed in four cycles.

The msProf tool can be used to collect profile data related to the resource conflict ratio.

For details about how to use the tool, see msProf (Operator Tuning). For details about the profile data file, see ResourceConflictRatio (Resource Conflict Ratio).

How to Avoid Bank Conflicts

There are two methods to avoid bank conflicts: optimizing the computation logic and optimizing address allocation.

  • Optimizing the computation logic

    Implementation

    Original Implementation

    Optimized Implementation

    Implementation method

    Read skipping, continuous write

    The eight data blocks input in the same repeat are in the same bank group, causing a read/read conflict.

    Continuous read, skip write

    The eight data blocks input in the same repeat are not in the same bank group, avoiding a read/read conflict.

    Diagram

    Sample code

    1
    2
    3
    4
    5
    6
    7
    uint64_t mask = 128;
    UnaryRepeatParams params;
    params.dstBlkStride  = 1;
    params.srcBlkStride = 16;
    for(uint32_t i=0; i<16; i++)   {
        AscendC::Adds(dstLocal[i * 128], srcLocal[i * 16], 0, mask, 1, params);
    }
    
    1
    2
    3
    4
    5
    6
    7
    uint64_t mask = 128;
    UnaryRepeatParams params;
    params.dstBlkStride  = 8;
    params.srcBlkStride = 1;
    for(uint32_t i=0; i<8; i++)   {
        AscendC::Adds(dstLocal[i * 16], srcLocal[i * 256], 0, mask, 2, params);
    }
    
  • Optimizing address allocation

    Implement the addition of 4096 consecutive float elements (z = x + y). By allocating more memory, ensure that x and y do not appear in the same bank group at the same time, and x/y and z do not appear in the same bank at the same time within a repeat. For details about the complete example, see sample of avoiding bank conflicts.

    Implementation

    Original Implementation

    Optimized Implementation

    Implementation method

    No address optimization is performed. InitBuffer is directly used to allocate memory. The addresses of each tensor are as follows:

    x: The start address is 0x0, and the tensor length is 4096 x sizeof(float) bytes.

    y: The start address is 0x4000, and the tensor length is 4096 x sizeof(float) bytes.

    z: The start address is 0x8000, and the tensor length is 4096 x sizeof(float) bytes.

    In a repeat, x and y read the same bank group at the same time, and x/y and z read and write the same bank at the same time.

    Optimize the address. When InitBuffer is used to allocate memory, properly allocate more memory. The addresses of each tensor are as follows:

    x: The start address is 0x0, and the tensor length is (4096 x sizeof(float) + 256) bytes.

    y: The start address is 0x4100, and the tensor length is (64 x 1024 – (4096 x sizeof(float) + 256)) bytes.

    z: The start address is 0x10000, and the tensor length is 4096 x sizeof(float) bytes.

    Allocate 256 bytes more for x to prevent x and y from reading the same bank group in a repeat. Allocate more space for y to ensure that z does not fall into the same bank as x or y.

    Diagram

    Sample code

    1
    2
    3
    pipe.InitBuffer(inQueueX, 1, 4096 * sizeof(float));
    pipe.InitBuffer(inQueueY, 1, 4096 * sizeof(float));
    pipe.InitBuffer(outQueueZ, 1, 4096 * sizeof(float));
    
    1
    2
    3
    pipe.InitBuffer(inQueueX, 1, 4096 * sizeof(float) + 256); // Allocate 256 bytes more.
    pipe.InitBuffer(inQueueY, 1, 64 * 1024 - (4096 * sizeof(float) + 256)); // Allocate more space to ensure that z does not fall into the same bank as x/y. 64 * 1024 is the space of 16 bank groups, and 4096 * sizeof(float) + 256 is the space occupied by x.
    pipe.InitBuffer(outQueueZ, 1, 4096 * sizeof(float));