Setting a Proper L2 CacheMode

[Priority] High

This performance optimization guide applies to the following product models:

Atlas A3 training products / Atlas A3 inference products
Atlas A2 training products / Atlas A2 inference products

[Description] The L2 cache is used to cache frequently accessed data. The following figure shows the physical location of the L2 cache.

The bandwidth of the L2 cache is several times higher than that of the GM. Therefore, when data hits the L2 cache, the data movement time is reduced by several times. Generally, a higher L2 cache hit ratio indicates better operator performance. In actual access, you need to set a proper L2 cache mode to ensure that the data that is repeatedly read is cached in the L2 cache.

L2 Cache Access Principle and Cache Mode

When data is moved in through MTE2, the typical process of accessing the L2 cache is as follows:

When data is moved out through MTE3 or Fixpipe, the typical process of accessing the L2 cache is as follows:

As shown in the preceding process, when the total amount of data accessed exceeds the capacity of the L2 cache, the AI Core replaces data in the L2 cache. Due to cache consistency requirements, the old data needs to be written back to the GM during the replacement (this process occupies the GM bandwidth). After the old data is written back, the new data can enter the L2 cache.

You can set the cache mode for the data to be accessed. For the global memory data that is accessed only once, you can set its access status to not enter the L2 cache. In this way, the L2 cache can be used more efficiently to cache data that needs to be repeatedly read, preventing the data accessed at a time from replacing valid data.

Setting the L2 Cache Mode

Ascend C provides the SetL2CacheHint API based on GlobalTensor. You can specify CacheMode as required.

Consider the following scenario: Construct the compute of two tensors. The input shape of x is (5120, 5120), the input shape of y is (5120, 15360), and the output shape of z is (5120, 15360). Because the shapes of the two tensors are different, x is added to the three data blocks of y in sequence. This solution is mainly used to demonstrate the functions of CacheMode. In the sample code, the implementation of repeatedly transferring x is intentionally used. This solution is not required in actual design. For details, refer to the sample of setting a proper L2 CacheMode.

Implementation

Original Implementation

Optimized Implementation

Implementation method

The total data amount is 700 MB, where x is 100 MB, y is 300 MB, and z is 300 MB.

Use 40 cores for compute and tile the data by column.

Set CacheMode of the GlobalTensors corresponding to x, y, and z to CACHE_MODE_NORMAL and pass through the L2 cache. The total data amount that needs to enter the L2 cache is 700 MB.

The total data amount is 700 MB, where x is 100 MB, y is 300 MB, and z is 300 MB.

Use 40 cores for compute and tile the data by column.

Set CacheMode of GlobalTensor corresponding to x to CACHE_MODE_NORMAL, and CacheMode of GlobalTensor corresponding to y and z to CACHE_MODE_DISABLE. Only x that needs to be frequently accessed is set to pass through the L2 cache. The total data amount that needs to enter the L2 cache is 100 MB.

Sample code

           
                xGm.SetGlobalBuffer((__gm__ float *)x + AscendC::GetBlockIdx() * TILE_N);
yGm.SetGlobalBuffer((__gm__ float *)y + AscendC::GetBlockIdx() * TILE_N);
zGm.SetGlobalBuffer((__gm__ float *)z + AscendC::GetBlockIdx() * TILE_N);

           
                xGm.SetGlobalBuffer((__gm__ float *)x + AscendC::GetBlockIdx() * TILE_N);
yGm.SetGlobalBuffer((__gm__ float *)y + AscendC::GetBlockIdx() * TILE_N);
zGm.SetGlobalBuffer((__gm__ float *)z + AscendC::GetBlockIdx() * TILE_N);
// disable the L2 cache mode of y and z
yGm.SetL2CacheHint(AscendC::CacheMode::CACHE_MODE_DISABLE);
zGm.SetL2CacheHint(AscendC::CacheMode::CACHE_MODE_DISABLE);

You can run the following command to use the msprof tool to obtain the profile data of the preceding example for comparison.

msprof op --launch-count=2 --output=./prof ./execute_add_op

Pay attention to the aiv_gm_to_ub_bw (GB/s) and aiv_main_mem_write_bw (GB/s) bandwidth rates in the Memory.csv file.

Parent topic: Memory Access