DataCacheCleanAndInvalid

Product Support

Product	Supported (Prototype Supporting dcciDst Configuration)	Supported (Prototype Not Supporting dcciDst Configuration)
Atlas A3 training products/Atlas A3 inference products	√	√
Atlas A2 training products/Atlas A2 inference products	√	√
Atlas 200I/500 A2 inference products	√	√
Atlas inference product's AI Core	x	√
Atlas inference product's Vector Core	x	x
Atlas training products	x	x

Function

In the AI Core, both the scalar unit and DMA unit may access the global memory.

Figure 1 Memory layers of the DataCache

As shown in the preceding figure:

The DMA movement unit reads and writes the global memory. Data is exchanged between the local memory (such as the UB) and the global memory through APIs such as DataCopy. There is no cache consistency issue.
When accessing the global memory, the scalar unit first accesses the data cache in each core. Therefore, the cache consistency issue between the data cache and the global memory exists.

This API is used to refresh the cache to ensure cache consistency. The application scenarios are as follows:

Read the data in the global memory. However, the data may be modified by other cores externally. In this case, you need to use the DataCacheCleanAndInvalid API to directly access the global memory to obtain the latest data.

If you want to write the data in the global memory immediately by using the scalar unit, you also need to use the DataCacheCleanAndInvalid API.

Prototype

Ensure the consistency between the data cache and global memory storage by setting dcciDst.

template <typename T, CacheLine entireType, DcciDst dcciDst>
__aicore__ inline void DataCacheCleanAndInvalid(const GlobalTensor<T>& dst)

This API is reserved for future use.

template <typename T, CacheLine entireType, DcciDst dcciDst>
__aicore__ inline void DataCacheCleanAndInvalid(const LocalTensor<T>& dst)

dcciDst cannot be set. The data cache and global memory must be consistent.

template <typename T, CacheLine entireType>
__aicore__ inline void DataCacheCleanAndInvalid(const GlobalTensor<T>& dst)

Parameters

**Table 1** Template parameters
Parameter	Description
T	Data type of dst.
entireType	Mode of command operation. The options are as follows: SINGLE_CACHE_LINE: Only the cache line with the input address is refreshed. Note that if the address is not 64-byte aligned, only the 64-byte aligned part of the input address is operated. ENTIRE_DATA_CACHE: In this case, the input address is invalid, and the entire data cache is refreshed in the core. However, this operation takes a long time. Exercise caution when performing this operation in performance-sensitive scenarios.
dcciDst	Cache with which the data cache is consistent. The value is of the DcciDst type. CACHELINE_ALL: The effect is the same as that of CACHELINE_OUT. CACHELINE_UB: Reserved parameter, which is not supported currently. CACHELINE_OUT: Ensures the consistency between the data cache and the global memory. CACHELINE_ATOMIC: Atlas A3 training products/Atlas A3 inference products: Reserved parameter, which is not supported currently. Atlas A2 training products/Atlas A2 inference products: Reserved parameter, which is not supported currently. Atlas 200I/500 A2 inference products: Reserved parameter, which is not supported currently. Atlas inference product's AI Core: Reserved parameter, which is not supported currently.

**Table 2** Parameters
Parameter	Input/Output	Description
dst	Input	Tensor for which the cache needs to be refreshed.

Returns

None

Restrictions

None

Example

// Example 1: In SINGLE_CACHE_LINE mode, assume that mmAddr_ is 0x40 (64-byte aligned).
AscendC::GlobalTensor<uint64_t> global;
global.SetGlobalBuffer((__gm__ uint64_t*)mmAddr_ + AscendC::GetBlockIdx() * 1024);
for( int i = 0; i < 8; i++) {
   global.SetValue(i, AscendC::GetBlockIdx());
}
// Because the start address is 64-byte aligned, the first 8 digits are updated immediately after DataCacheCleanAndInvalid is called.
AscendC::DataCacheCleanAndInvalid<uint64_t, AscendC::CacheLine::SINGLE_CACHE_LINE, AscendC::DcciDst::CACHELINE_OUT>(global);
// Example 2: In SINGLE_CACHE_LINE mode, assume that mmAddr_ is 0x20 (not 64-byte aligned).
AscendC::GlobalTensor<uint64_t> global;
global.SetGlobalBuffer((__gm__ uint64_t*)mmAddr_ + AscendC::GetBlockIdx() * 1024);
for( int i = 0; i < 8; i++) {
   global.SetValue(i, AscendC::GetBlockIdx());
}
// Because the start address is not 64-byte aligned, only the part from the start address to the 64-byte aligned part (the first 4 digits) is updated when one instruction is called.
AscendC::DataCacheCleanAndInvalid<uint64_t, AscendC::CacheLine::SINGLE_CACHE_LINE, AscendC::DcciDst::CACHELINE_OUT>(global);
// DataCacheCleanAndInvalid instruction needs to be called again to update the last 4 digits.
AscendC::DataCacheCleanAndInvalid<uint64_t, AscendC::CacheLine::SINGLE_CACHE_LINE, AscendC::DcciDst::CACHELINE_OUT>(global[4]);
// Example 3: In SINGLE_CACHE_LINE mode, assume that mmAddr_ is 0x40 (64-byte aligned) in the multi-core processing scenario. (This example is used only as an example to help developers understand usage restrictions.)
AscendC::GlobalTensor<uint64_t> global;
global.SetGlobalBuffer((__gm__ uint64_t*)mmAddr_);
global.SetValue(AscendC::GetBlockIdx(), AscendC::GetBlockIdx());
// Although multi-core operations in the operator are not performed at the same address, data is randomly overwritten in the same cache line, which is different from the behavior of the general CPU.
// After DataCacheCleanAndInvalid is called, the final result is random because the operation time of multiple cores is different. The result of the core executed later overwrites that of the previous core.
AscendC::DataCacheCleanAndInvalid<uint64_t, AscendC::CacheLine::SINGLE_CACHE_LINE, AscendC::DcciDst::CACHELINE_OUT>(global);
// Example 4: In ENTIRE_DATA_CACHE mode, assume that mmAddr_ is 0x20 (not 64-byte aligned).
// This example is used only as an example for developers to understand the usage restrictions.
AscendC::GlobalTensor<uint64_t> global;
global.SetGlobalBuffer((__gm__ uint64_t*)mmAddr_ + AscendC::GetBlockIdx() * 1024);
for( int i = 0; i < 8; i++) {
   global.SetValue(i, AscendC::GetBlockIdx());
}
// Refresh the entire data cache. The performance is poor.
AscendC::DataCacheCleanAndInvalid<uint64_t, AscendC::CacheLine::ENTIRE_DATA_CACHE, AscendC::DcciDst::CACHELINE_OUT>(global);

Parent topic: Cache Control