Avoiding Same-Address Access

[Priority] High

This performance optimization guide applies to the following product models:

Atlas A3 training products / Atlas A3 inference products
Atlas A2 training products / Atlas A2 inference products

[Description] When units such as MTE2, MTE3, and Scalar access data in the global memory, their address requests are processed after being 512-byte aligned. When data in the global memory is accessed at the same time and the addresses are within a consecutive 512-byte range, multiple requests are processed in serial mode due to data consistency, affecting the data movement efficiency.

The current operator execution mechanism ensures that the addresses of the kernel input parameters (including workspace and tiling) are 512-byte aligned. Therefore, you only need to determine whether two addresses fall within a consecutive 512-byte range based on the address offset.

As shown in the following figure, each core in the AI Core sends read and write requests to the data in the global memory at the same time. Although addr0 to addr5 are different addresses, they fall within a consecutive 512-byte range and are therefore considered as the same address request. In this case, these data requests are processed in serial mode, which reduces the data access efficiency. The impact of same-address access depends on the number of cores that access the same address at the same time. The more cores that access the same address, the more severe the performance deterioration caused by serial processing.

The following two methods can be used to avoid same-address access: adjusting the data access sequence and modifying the tiling policy. For details about the related examples, see sample of avoiding access to the same address.

Adjusting the data access sequence

The following uses a float input with the shape of (8192, 128) as an example to describe how to compute Adds.

To reflect the impact of same-address conflicts, in the preceding scenario design, the data size of each row is 512 bytes (128 float s). Each core processes 512 x 8 bytes of data in each round of compute and performs full-core synchronization (which is not required in actual scenarios). Each round of compute can proceed to the next round only after all cores have completed the compute of the current data block.

Implementation

Original Implementation

Optimized Implementation

Implementation method

Use 16 cores for compute and tile the data by column. The total amount of data computed by each core is 8192 x 8. A single core executes the loop 16 times, with 512 x 8 bytes of data computed each time. The following figure shows the loop sequence of each core. Column directions 0 to 15 indicate the data block execution sequence of each core.

Multiple cores access the same row of data (512 bytes) at the same time, causing address conflicts.

The addresses processed by each core in each round are in different rows, and the consecutive 512 bytes are not accessed at the same time. Therefore, the same-address access conflict does not occur.

Diagram

Sample code

          
               for (int32_t i = 0; i < tiling->loopOneCore; i++) {
    AscendC::SyncAll();
    CopyIn(i);
    Compute();
    AscendC::SyncAll();
    CopyOut(i);
}

          
               for (int32_t i = 0; i < tiling->loopOneCore; i++) {
    int32_t newProgress = (i + AscendC::GetBlockIdx()) % tiling->loopOneCore;
    AscendC::SyncAll();
    CopyIn(newProgress);
    Compute();
    AscendC::SyncAll();
    CopyOut(newProgress);
}

Modifying a tiling strategy

The following uses a float input with the shape of (8192, 128) as an example to describe how to compute Adds.

To reflect the impact of same-address conflicts, in the preceding scenario design, the data size of each row is 512 bytes (128 floats). Each core processes 512 x 8 bytes of data in each round of compute and performs full-core synchronization (which is not required in actual scenarios). Each round of compute can proceed to the next round only after all cores have completed the compute of the current data block.

Implementation

Original Implementation

Optimized Implementation

Implementation method

Multiple cores access the same row of data (512 bytes) at the same time, causing address conflicts.

Use 16 cores for compute and tile the data by row. The total amount of data computed by each core is 512 x 128. A single core executes the loop 16 times, with 512 x 8 bytes of data computed each time. The following figure shows the loop sequence of each core (by row), with blocks 0 to 15.

The addresses processed by each core in each round are in different rows, and the consecutive 512 bytes are not accessed at the same time. Therefore, the same-address access conflict does not occur.

Diagram

Sample code

          
               __aicore__ inline void Init(GM_ADDR x, GM_ADDR z, AddsCustomTilingData* tilingPtr)
{
    tiling = tilingPtr;
    xGm.SetGlobalBuffer((__gm__ float *)x + AscendC::GetBlockIdx() * tiling->tileN);
    zGm.SetGlobalBuffer((__gm__ float *)z + AscendC::GetBlockIdx() * tiling->tileN);
   // we disable the L2 cache mode to highlight the influence of the gm address conflict    
    xGm.SetL2CacheHint(AscendC::CacheMode::CACHE_MODE_DISABLE);
    zGm.SetL2CacheHint(AscendC::CacheMode::CACHE_MODE_DISABLE);
    pipe.InitBuffer(inQueueX, BUFFER_NUM, tiling->tileM * tiling->tileN * sizeof(float));
    pipe.InitBuffer(outQueueZ, BUFFER_NUM, tiling->tileM * tiling->tileN * sizeof(float));
}

          
               __aicore__ inline void Init(GM_ADDR x, GM_ADDR z, AddsCustomTilingData* tilingPtr)
{
    tiling = tilingPtr;
    // change the tile method from column split to row split
    xGm.SetGlobalBuffer((__gm__ float *)x + AscendC::GetBlockIdx() * tiling->tileM * tiling->n);
    zGm.SetGlobalBuffer((__gm__ float *)z + AscendC::GetBlockIdx() * tiling->tileM * tiling->n);
    // we disable the L2 cache mode to highlight the influence of the gm address conflict
    xGm.SetL2CacheHint(AscendC::CacheMode::CACHE_MODE_DISABLE);
    zGm.SetL2CacheHint(AscendC::CacheMode::CACHE_MODE_DISABLE);
    pipe.InitBuffer(inQueueX, BUFFER_NUM, tiling->tileM * tiling->tileN * sizeof(float));
    pipe.InitBuffer(outQueueZ, BUFFER_NUM, tiling->tileM * tiling->tileN * sizeof(float));
}

You can run the following command to use the msprof tool to obtain the profile data of the preceding example for comparison.

msprof op --launch-count=3 --output=./prof ./execute_adds_op

Pay attention to the time consumed by the aiv_mte2_time(us) and aiv_mte3_time(us) transfer instructions in the PipeUtilization.csv file.

Parent topic: Memory Access