Inter-core Load Balancing
[Priority] Medium
[Description] The number of physical cores of the AI Processor is fixed. After the L2 cache is tiled, some cores may have computation tailing. That is, the computation amount of all cores divided by the amount of data processed by each core cannot be exactly divided by the number of cores. As a result, some tail cores are required to compute the tail block data. However, during tail core computation, some cores are always in an idle state, leading to deteriorated operator performance. As shown in Figure 1, if the total data size is TotalSize, the L2 cache is divided into two parts (TotalSize/2). The computation amount of each core is TotalSize/2/25, that is, 25 cores are required for processing. Because the number of cores of the AI Processor is 20, during each computation, each of cores 1 to 5 needs to compute one more piece of data. As a result, tailing occurs.
[Negative Example]
[Positive Example]
For the foregoing tiling strategy, global load optimization can be achieved after relocation of the tail cores, as shown in Figure 2. When all computations are complete, one more data block is computed for 1 to 10 cores, achieving optimal global load.

