Bank Allocation Optimization to Improve Read and Write Performance

[Priority] High

[Description] The following figure shows the bank structure in the Unified Buffer (UB). The total size of the UB (assuming 192 KB) is divided into 48 banks. A bank consists of 128 rows (bank_depth = 128), and the length of each row is C0 = 32 bytes, that is, a 2D structure of bank_depth x C0. There are 16 bank groups (bank_group = 16). Each group has three banks (bank_num = 3), that is, a 2D structure of bank_group x bank_num.

Figure 1 Bank structure

The Vector unit can read one copy of data from each bank group in each cycle (one cycle is an instruction cycle). Therefore, the maximum of data that can be read in each cycle is as follows: 16 groups of x C0 (32 bytes) = 512 bytes. For data write, the maximum of data that can be written in each cycle is also: 16 groups of x C0 (32 bytes) = 512 bytes. When multiple operations attempt to access the same bank or bank group at the same time, a bank conflict may occur. As a result, the accesses queue and the performance deteriorates.

Bank conflicts occur in the following scenarios:

  • Read/Write conflict: When a bank is read and written at the same time, a bank conflict occurs.
  • Write/Write conflict: When a bank group is written at the same time, a bank conflict occurs.
  • Read/Read conflict: When a bank group is read at the same time, a bank conflict occurs.

Assume that 0x10000 is in bank 16, 0x10020 is in bank 17, and 0x20020 is in bank 33, as shown in the following figure.

Figure 2 Address allocation
  • Read/Write conflict: When a bank is read and written at the same time by the source and destination addresses of a Vector instruction, a read/write conflict occurs. The following is an example.
    Table 1 Example of a read/write conflict

    No.

    src0 addr

    src1 addr

    bank

    bank_group

    Conclusion

    Example 1

    0x10020

    0x10000

    bank_id0 != bank_id1

    bank_group_id0 != bank_group_id1

    No conflict

    Example 2

    0x10020

    0x10E20

    bank_id0 == bank_id1

    bank_group_id0 == bank_group_id1

    Conflict

  • Write/Write conflict: A Vector instruction can write eight data blocks at the same time in a cycle. If data blocks 0 to 7 are written to the same bank_group at the same time, a write/write conflict occurs. The following is an example.
    Table 2 Example of a write/write conflict

    No.

    dst addr

    blk_stride

    block0_addr

    block1_addr

    block2_addr

    ...

    Conclusion

    Example 1

    0x1FE00

    16

    0x1FE00

    0x20000

    0x20200

    ...

    Eight data blocks conflict with each other, and one repeat is completed in eight cycles.

    Example 2

    0x1FE00

    8

    0x1FE00

    0x1FF00

    0x20000

    ...

    Data block 0 and data block 2 conflict, and one repeat is completed in four cycles.

  • Read/Read conflict
    • If there are two source addresses, for the Add instruction which has two inputs, src0 and src1 occupy two separate read ports. If src0 and src1 read the same bank_group at the same time, a read/read conflict occurs. The following is an example.
      Table 3 Example of a read/read conflict with two source addresses

      No.

      src0 addr

      src1 addr

      bank

      bank_group

      Conclusion

      Example 1

      0x10020

      0x20020

      bank_id0 != bank_id1

      bank_group_id0 == bank_group_id1

      Conflict

      Example 2

      0x10020

      0x10000

      bank_id0 != bank_id1

      bank_group_id0 != bank_group_id1

      No conflict

    • If there is only one source address, a Vector instruction can read eight data blocks at the same time in a cycle. If data blocks 0 to 7 are read to a bank group at the same time, a read/read conflict occurs. The following is an example.
      Table 4 Example of a read/read conflict with a single source address

      No.

      src addr

      blk_stride

      block0_addr

      block1_addr

      block2_addr

      ...

      Conclusion

      Example 1

      0x1FE00

      16

      0x1FE00

      0x20000

      0x20200

      ...

      Eight data blocks conflict with each other, and one repeat is completed in eight cycles.

      Example 2

      0x1FE00

      8

      0x1FE00

      0x1FF00

      0x20000

      ...

      Data block 0 and data block 2 conflict, and one repeat is completed in four cycles.

The msProf tool can be used to collect profile data related to the resource conflict ratio.

For details about how to use the tool, see Tool Usage. For details about the profile data file, see ResourceConflictRatio (Resource Conflict Ratio).

[Negative Example]

For an input or output tensor, the high-dimensional tensor tiling API is used to implement read and write skipping. When the value of dataBlockStride is an integer multiple of 16, a read/read conflict occurs. Assume that you want to perform (1, 0, 2) transpose on an input with shape (8, 16, 16). Then, the output shape is (16, 8, 16).

The following code shows skipping read and continuous write. The eight data blocks input in the same repeat are in the same bank group, causing a read/read conflict.

1
2
3
4
5
6
7
uint64_t mask = 128;
UnaryRepeatParams params;
params.dstBlkStride  = 1;
params.srcBlkStride = 16;
for(uint32_t i=0; i<16; i++)   {
    AscendC::Adds(dstLocal[i * 128], srcLocal[i * 16], 0, mask, 1, params);
}
Figure 3 Skipping read and continuous write

[Positive Example]

Change the rule to "continuous read and skipping write" to avoid conflicts. The sample code is as follows:

1
2
3
4
5
6
7
uint64_t mask = 128;
UnaryRepeatParams params;
params.dstBlkStride  = 8;
params.srcBlkStride = 1;
for(uint32_t i=0; i<8; i++)   {
    AscendC::Adds(dstLocal[i * 16], srcLocal[i * 256], 0, mask, 2, params);
}
Figure 4 Continuous read and skipping write

[Positive Example]

If the allocated workBuffer has two input vectors, the start addresses of the two vectors cannot be in the same bank group. You can allocate an extra 32-byte LocalTensor to ensure that the two inputs of the workBuffer are separated by 32 bytes.

For example, when z = x + y is calculated, x starts from the workBuffer0 address and has a length of 8 KB, and y starts from the 8 KB address and has a length of 8 KB. In this case, the physical addresses of x and y fall within the same bank. Add a certain length to the allocated address to avoid bank conflicts. The following shows the sample code and address allocation.

1
2
3
4
LocalTensor<float> srcLocal;
LocalTensor<float> dstLocal;
UnaryRepeatParams params;
AscendC::Add(dstLocal, srcLocal[0], srcLocal[(8 * 1024 + 32) / sizeof(float)], mask, (8 * 1024) / 256 , params);
Figure 5 Address allocation