DataCacheCleanAndInvalid
Supported Products
Product |
Supported (Prototype Supporting dcciDst Configuration) |
Supported or Not (The prototype of dcciDst cannot be configured.) |
|---|---|---|
√ |
√ |
|
√ |
√ |
|
√ |
√ |
|
x |
√ |
|
x |
x |
|
x |
x |
Function Usage
In the AI Core, both the scalar unit and DMA unit may access the global memory.

As shown in the preceding figure:
- The DMA movement unit reads and writes the global memory. Data is exchanged between the local memory (such as the UB) and the global memory through APIs such as DataCopy. There is no cache consistency issue.
- When accessing the global memory, the scalar unit first accesses the data cache in each core. Therefore, the cache consistency issue between the data cache and the global memory exists.
This API is used to refresh the cache to ensure cache consistency. The application scenarios are as follows:
- Read the data in the global memory. However, the data may be modified by other cores externally. In this case, you need to use the DataCacheCleanAndInvalid API to directly access the global memory to obtain the latest data.
- If you want to write the data in the global memory immediately by using the scalar unit, you also need to use the DataCacheCleanAndInvalid API.
Prototype
- Ensure the consistency between the data cache and GM storage by setting dcciDst.
1 2
template <typename T, CacheLine entireType, DcciDst dcciDst> __aicore__ inline void DataCacheCleanAndInvalid(const GlobalTensor<T>& dst)
- This API is reserved for future use.
1 2
template <typename T, CacheLine entireType, DcciDst dcciDst> __aicore__ inline void DataCacheCleanAndInvalid(const LocalTensor<T>& dst)
- dcciDst cannot be set. The data cache and GM must be consistent.
1 2
template <typename T, CacheLine entireType> __aicore__ inline void DataCacheCleanAndInvalid(const GlobalTensor<T>& dst)
Parameters
Parameter |
Description |
|---|---|
T |
Data type of dst. |
entireType |
Command operation mode. The options are as follows: SINGLE_CACHE_LINE: Only the cache line where the input address is located is refreshed. If the address is not 64-byte aligned, only the part from the input address to the 64-byte aligned part is operated. ENTIRE_DATA_CACHE: In this case, the input address is invalid, and the entire data cache is refreshed in the core. However, this operation takes a long time. Exercise caution when performing this operation in performance-sensitive scenarios. |
dcciDst |
Cache with which the data cache is consistent. The type is DcciDst.
|
Parameter |
Input/Output |
Description |
|---|---|---|
dst |
Input |
Tensor for which the cache needs to be refreshed. |
Returns
None
Constraints
None
Example
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | // Example 1: In SINGLE_CACHE_LINE mode, assume that mmAddr_ is 0x40 (64-byte aligned). AscendC::GlobalTensor<uint64_t> global; global.SetGlobalBuffer((__gm__ uint64_t*)mmAddr_ + AscendC::GetBlockIdx() * 1024); for( int i = 0; i < 8; i++) { global.SetValue(i, AscendC::GetBlockIdx()); } // Because the start address is 64-byte aligned, the first 8 digits are updated immediately after DataCacheCleanAndInvalid is called. AscendC::DataCacheCleanAndInvalid<uint64_t, AscendC::CacheLine::SINGLE_CACHE_LINE, AscendC::DcciDst::CACHELINE_OUT>(global); // Example 2: In SINGLE_CACHE_LINE mode, assume that mmAddr_ is 0x20 (not 64-byte aligned). AscendC::GlobalTensor<uint64_t> global; global.SetGlobalBuffer((__gm__ uint64_t*)mmAddr_ + AscendC::GetBlockIdx() * 1024); for( int i = 0; i < 8; i++) { global.SetValue(i, AscendC::GetBlockIdx()); } // Because the start address is not 64-byte aligned, only the part from the start address to the 64-byte aligned part (the first 4 digits) is updated when one instruction is called. AscendC::DataCacheCleanAndInvalid<uint64_t, AscendC::CacheLine::SINGLE_CACHE_LINE, AscendC::DcciDst::CACHELINE_OUT>(global); // DataCacheCleanAndInvalid instruction needs to be called again to update the last 4 digits. AscendC::DataCacheCleanAndInvalid<uint64_t, AscendC::CacheLine::SINGLE_CACHE_LINE, AscendC::DcciDst::CACHELINE_OUT>(global[4]); // Example 3: In SINGLE_CACHE_LINE mode, assume that mmAddr_ is 0x40 (64-byte aligned) in the multi-core processing scenario. (This example is used only as an example to help developers understand usage restrictions.) AscendC::GlobalTensor<uint64_t> global; global.SetGlobalBuffer((__gm__ uint64_t*)mmAddr_); global.SetValue(AscendC::GetBlockIdx(), AscendC::GetBlockIdx()); //In the operator, although the multi-core operations are not performed on the same address, they are performed on the same cache line. As a result, data is randomly overwritten, which is different from the behavior of the general-purpose CPU. // After DataCacheCleanAndInvalid is called, the final result is random because the operation time of multiple cores is different. The result of the core executed later overwrites that of the previous core. AscendC::DataCacheCleanAndInvalid<uint64_t, AscendC::CacheLine::SINGLE_CACHE_LINE, AscendC::DcciDst::CACHELINE_OUT>(global); // Example 4: In ENTIRE_DATA_CACHE mode, assume that mmAddr_ is 0x20 (not 64-byte aligned). // This example is used only as an example for developers to understand the usage restrictions. AscendC::GlobalTensor<uint64_t> global; global.SetGlobalBuffer((__gm__ uint64_t*)mmAddr_ + AscendC::GetBlockIdx() * 1024); for( int i = 0; i < 8; i++) { global.SetValue(i, AscendC::GetBlockIdx()); } // Refresh the entire data cache. The performance is poor. AscendC::DataCacheCleanAndInvalid<uint64_t, AscendC::CacheLine::ENTIRE_DATA_CACHE, AscendC::DcciDst::CACHELINE_OUT>(global); |