Bank Allocation Optimization to Improve Read and Write Performance
[Priority] High
[Description] The following figure shows the bank structure in the Unified Buffer (UB). The total size of the UB (assuming 192 KB) is divided into 48 banks. A bank consists of 128 rows (bank_depth = 128), and the length of each row is C0 = 32 bytes, that is, a 2D structure of bank_depth x C0. There are 16 bank groups (bank_group = 16). Each group has three banks (bank_num = 3), that is, a 2D structure of bank_group x bank_num.
The Vector unit can read one copy of data from each bank group in each cycle (one cycle is an instruction cycle). Therefore, the maximum of data that can be read in each cycle is as follows: 16 groups of x C0 (32 bytes) = 512 bytes. For data write, the maximum of data that can be written in each cycle is also: 16 groups of x C0 (32 bytes) = 512 bytes. When multiple operations attempt to access the same bank or bank group at the same time, a bank conflict may occur. As a result, the accesses queue and the performance deteriorates.
Bank conflicts occur in the following scenarios:
- Read/Write conflict: When a bank is read and written at the same time, a bank conflict occurs.
- Write/Write conflict: When a bank group is written at the same time, a bank conflict occurs.
- Read/Read conflict: When a bank group is read at the same time, a bank conflict occurs.
Assume that 0x10000 is in bank 16, 0x10020 is in bank 17, and 0x20020 is in bank 33, as shown in the following figure.
- Read/Write conflict: When a bank is read and written at the same time by the source and destination addresses of a Vector instruction, a read/write conflict occurs. The following is an example.
Table 1 Example of a read/write conflict No.
src0 addr
src1 addr
bank
bank_group
Conclusion
Example 1
0x10020
0x10000
bank_id0 != bank_id1
bank_group_id0 != bank_group_id1
No conflict
Example 2
0x10020
0x10E20
bank_id0 == bank_id1
bank_group_id0 == bank_group_id1
Conflict
- Write/Write conflict: A Vector instruction can write eight data blocks at the same time in a cycle. If data blocks 0 to 7 are written to the same bank_group at the same time, a write/write conflict occurs. The following is an example.
Table 2 Example of a write/write conflict No.
dst addr
blk_stride
block0_addr
block1_addr
block2_addr
...
Conclusion
Example 1
0x1FE00
16
0x1FE00
0x20000
0x20200
...
Eight data blocks conflict with each other, and one repeat is completed in eight cycles.
Example 2
0x1FE00
8
0x1FE00
0x1FF00
0x20000
...
Data block 0 and data block 2 conflict, and one repeat is completed in four cycles.
- Read/Read conflict
- If there are two source addresses, for the Add instruction which has two inputs, src0 and src1 occupy two separate read ports. If src0 and src1 read the same bank_group at the same time, a read/read conflict occurs. The following is an example.
Table 3 Example of a read/read conflict with two source addresses No.
src0 addr
src1 addr
bank
bank_group
Conclusion
Example 1
0x10020
0x20020
bank_id0 != bank_id1
bank_group_id0 == bank_group_id1
Conflict
Example 2
0x10020
0x10000
bank_id0 != bank_id1
bank_group_id0 != bank_group_id1
No conflict
- If there is only one source address, a Vector instruction can read eight data blocks at the same time in a cycle. If data blocks 0 to 7 are read to a bank group at the same time, a read/read conflict occurs. The following is an example.
Table 4 Example of a read/read conflict with a single source address No.
src addr
blk_stride
block0_addr
block1_addr
block2_addr
...
Conclusion
Example 1
0x1FE00
16
0x1FE00
0x20000
0x20200
...
Eight data blocks conflict with each other, and one repeat is completed in eight cycles.
Example 2
0x1FE00
8
0x1FE00
0x1FF00
0x20000
...
Data block 0 and data block 2 conflict, and one repeat is completed in four cycles.
- If there are two source addresses, for the Add instruction which has two inputs, src0 and src1 occupy two separate read ports. If src0 and src1 read the same bank_group at the same time, a read/read conflict occurs. The following is an example.
[Negative Example]
For an input or output tensor, the high-dimensional tensor tiling API is used to implement read and write skipping. When the value of dataBlockStride is an integer multiple of 16, a read/read conflict occurs. Assume that you want to perform (1, 0, 2) transpose on an input with shape (8, 16, 16). Then, the output shape is (16, 8, 16).
The following code shows skipping read and continuous write. The eight data blocks input in the same repeat are in the same bank group, causing a read/read conflict.
1 2 3 4 5 6 7 |
uint64_t mask = 128; UnaryRepeatParams params; params.dstBlkStride = 1; params.srcBlkStride = 16; for(uint32_t i=0; i<16; i++) { AscendC::Adds(dstLocal[i * 128], srcLocal[i * 16], 0, mask, 1, params); } |
[Positive Example]
Change the rule to "continuous read and skipping write" to avoid conflicts. The sample code is as follows:
1 2 3 4 5 6 7 |
uint64_t mask = 128; UnaryRepeatParams params; params.dstBlkStride = 8; params.srcBlkStride = 1; for(uint32_t i=0; i<8; i++) { AscendC::Adds(dstLocal[i * 16], srcLocal[i * 256], 0, mask, 2, params); } |
[Positive Example]
If the allocated workBuffer has two input vectors, the start addresses of the two vectors cannot be in the same bank group. You can allocate an extra 32-byte LocalTensor to ensure that the two inputs of the workBuffer are separated by 32 bytes.
For example, when z = x + y is calculated, x starts from the workBuffer0 address and has a length of 8 KB, and y starts from the 8 KB address and has a length of 8 KB. In this case, the physical addresses of x and y fall within the same bank. Add a certain length to the allocated address to avoid bank conflicts. The following shows the sample code and address allocation.
1 2 3 4 |
LocalTensor<float> srcLocal; LocalTensor<float> dstLocal; UnaryRepeatParams params; AscendC::Add(dstLocal, srcLocal[0], srcLocal[(8 * 1024 + 32) / sizeof(float)], mask, (8 * 1024) / 256 , params); |