DataCacheCleanAndInvalid

Supported Products

Product

Supported (Prototype Supporting dcciDst Configuration)

Supported or Not (The prototype of dcciDst cannot be configured.)

Atlas A3 training products/Atlas A3 inference products

Atlas A2 training products/Atlas A2 inference products

Atlas 200I/500 A2 inference products

Atlas inference product's AI Core

x

Atlas inference product's Vector Core

x

x

Atlas training products

x

x

Function Usage

In the AI Core, both the scalar unit and DMA unit may access the global memory.

Figure 1 Memory layers of the DataCache

As shown in the preceding figure:

  • The DMA movement unit reads and writes the global memory. Data is exchanged between the local memory (such as the UB) and the global memory through APIs such as DataCopy. There is no cache consistency issue.
  • When accessing the global memory, the scalar unit first accesses the data cache in each core. Therefore, the cache consistency issue between the data cache and the global memory exists.

This API is used to refresh the cache to ensure cache consistency. The application scenarios are as follows:

  • Read the data in the global memory. However, the data may be modified by other cores externally. In this case, you need to use the DataCacheCleanAndInvalid API to directly access the global memory to obtain the latest data.
  • If you want to write the data in the global memory immediately by using the scalar unit, you also need to use the DataCacheCleanAndInvalid API.

Prototype

  • Ensure the consistency between the data cache and GM storage by setting dcciDst.
    1
    2
    template <typename T, CacheLine entireType, DcciDst dcciDst>
    __aicore__ inline void DataCacheCleanAndInvalid(const GlobalTensor<T>& dst)
    
  • This API is reserved for future use.
    1
    2
    template <typename T, CacheLine entireType, DcciDst dcciDst>
    __aicore__ inline void DataCacheCleanAndInvalid(const LocalTensor<T>& dst)
    
  • dcciDst cannot be set. The data cache and GM must be consistent.
    1
    2
    template <typename T, CacheLine entireType>
    __aicore__ inline void DataCacheCleanAndInvalid(const GlobalTensor<T>& dst)
    

Parameters

Table 1 Parameters in the template

Parameter

Description

T

Data type of dst.

entireType

Command operation mode. The options are as follows:

SINGLE_CACHE_LINE: Only the cache line where the input address is located is refreshed. If the address is not 64-byte aligned, only the part from the input address to the 64-byte aligned part is operated.

ENTIRE_DATA_CACHE: In this case, the input address is invalid, and the entire data cache is refreshed in the core. However, this operation takes a long time. Exercise caution when performing this operation in performance-sensitive scenarios.

dcciDst

Cache with which the data cache is consistent. The type is DcciDst.

  • CACHELINE_ALL: The effect is the same as that of CACHELINE_OUT.
  • CACHELINE_UB: Reserved.
  • CACHELINE_OUT: Ensure the consistency between the data cache and the global memory.
  • CACHELINE_ATOMIC:
    • Atlas A3 training products/Atlas A3 inference products: This parameter is reserved and is not supported currently.
    • Atlas A2 training products/Atlas A2 inference products: This parameter is reserved and is not supported currently.
    • Atlas 200I/500 A2 inference products: This parameter is reserved and is not supported currently.
    • Atlas inference product's AI Core: This parameter is reserved and is not supported currently.
Table 2 Parameters

Parameter

Input/Output

Description

dst

Input

Tensor for which the cache needs to be refreshed.

Returns

None

Constraints

None

Example

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
// Example 1: In SINGLE_CACHE_LINE mode, assume that mmAddr_ is 0x40 (64-byte aligned).
AscendC::GlobalTensor<uint64_t> global;
global.SetGlobalBuffer((__gm__ uint64_t*)mmAddr_ + AscendC::GetBlockIdx() * 1024);
for( int i = 0; i < 8; i++) {
   global.SetValue(i, AscendC::GetBlockIdx());
}
// Because the start address is 64-byte aligned, the first 8 digits are updated immediately after DataCacheCleanAndInvalid is called.
AscendC::DataCacheCleanAndInvalid<uint64_t, AscendC::CacheLine::SINGLE_CACHE_LINE, AscendC::DcciDst::CACHELINE_OUT>(global);
// Example 2: In SINGLE_CACHE_LINE mode, assume that mmAddr_ is 0x20 (not 64-byte aligned).
AscendC::GlobalTensor<uint64_t> global;
global.SetGlobalBuffer((__gm__ uint64_t*)mmAddr_ + AscendC::GetBlockIdx() * 1024);
for( int i = 0; i < 8; i++) {
   global.SetValue(i, AscendC::GetBlockIdx());
}
// Because the start address is not 64-byte aligned, only the part from the start address to the 64-byte aligned part (the first 4 digits) is updated when one instruction is called.
AscendC::DataCacheCleanAndInvalid<uint64_t, AscendC::CacheLine::SINGLE_CACHE_LINE, AscendC::DcciDst::CACHELINE_OUT>(global);
// DataCacheCleanAndInvalid instruction needs to be called again to update the last 4 digits.
AscendC::DataCacheCleanAndInvalid<uint64_t, AscendC::CacheLine::SINGLE_CACHE_LINE, AscendC::DcciDst::CACHELINE_OUT>(global[4]);
// Example 3: In SINGLE_CACHE_LINE mode, assume that mmAddr_ is 0x40 (64-byte aligned) in the multi-core processing scenario. (This example is used only as an example to help developers understand usage restrictions.)
AscendC::GlobalTensor<uint64_t> global;
global.SetGlobalBuffer((__gm__ uint64_t*)mmAddr_);
global.SetValue(AscendC::GetBlockIdx(), AscendC::GetBlockIdx());
//In the operator, although the multi-core operations are not performed on the same address, they are performed on the same cache line. As a result, data is randomly overwritten, which is different from the behavior of the general-purpose CPU.
// After DataCacheCleanAndInvalid is called, the final result is random because the operation time of multiple cores is different. The result of the core executed later overwrites that of the previous core.
AscendC::DataCacheCleanAndInvalid<uint64_t, AscendC::CacheLine::SINGLE_CACHE_LINE, AscendC::DcciDst::CACHELINE_OUT>(global);
// Example 4: In ENTIRE_DATA_CACHE mode, assume that mmAddr_ is 0x20 (not 64-byte aligned).
// This example is used only as an example for developers to understand the usage restrictions.
AscendC::GlobalTensor<uint64_t> global;
global.SetGlobalBuffer((__gm__ uint64_t*)mmAddr_ + AscendC::GetBlockIdx() * 1024);
for( int i = 0; i < 8; i++) {
   global.SetValue(i, AscendC::GetBlockIdx());
}
// Refresh the entire data cache. The performance is poor.
AscendC::DataCacheCleanAndInvalid<uint64_t, AscendC::CacheLine::ENTIRE_DATA_CACHE, AscendC::DcciDst::CACHELINE_OUT>(global);