Common Data Movement
Function Usage
Common data movement APIs apply to continuous and discontinuous data movements.
Prototype
- The source operand is GlobalTensor, and the destination operand is LocalTensor.
1 2 3 4 5 6 7
//Continuous and discontinuous data movements template <typename T> __aicore__ inline void DataCopy(const LocalTensor<T>& dstLocal, const GlobalTensor<T>& srcGlobal, const DataCopyParams& repeatParams); //Continuous data movements template <typename T> __aicore__ inline void DataCopy(const LocalTensor<T>& dstLocal, const GlobalTensor<T>& srcGlobal, const uint32_t calCount);
The prototype supports the following data paths and types:
Table 1 Data paths and types (GlobalTensor as the source operand and LocalTensor as the destination operand) Model
Data Path (Expressed Using TPosition)
Data Types of the Source and Destination Operands (Same)
Atlas Training Series Product GM -> VECIN
int8_t / uint8_t / int16_t / uint16_t / int32_t / uint32_t / int64_t / uint64_t / half / float / double
Atlas Training Series Product GM -> A1 / B1
int8_t / uint8_t / int16_t / uint16_t / int32_t / uint32_t / int64_t / uint64_t / half / float / double
- Both the source operand and destination operand are LocalTensor.
1 2 3 4 5 6
//Continuous and discontinuous data movements template <typename T> __aicore__ inline void DataCopy(const LocalTensor<T>& dstLocal, const LocalTensor<T>& srcLocal, const DataCopyParams& repeatParams) //Continuous data movements template <typename T> __aicore__ inline void DataCopy(const LocalTensor<T>& dstLocal, const LocalTensor<T>& srcLocal, const uint32_t calCount);
The prototype supports the following data paths and types:
Table 2 Data paths and types (LocalTensor as the source operand and destination operand) Model
Data Path (Expressed Using TPosition)
Data Types of the Source and Destination Operands (Same)
Atlas Training Series Product VECIN -> VECCALC, VECCALC->VECOUT
int8_t / uint8_t / int16_t / uint16_t / int32_t / uint32_t / int64_t / uint64_t / half / float / double
- The source operand is LocalTensor, and the destination operand is GlobalTensor.
1 2 3 4 5 6
//Continuous and discontinuous data movements template <typename T> __aicore__ inline void DataCopy(const GlobalTensor <T>& dstGlobal, const LocalTensor <T>& srcLocal, const DataCopyParams& repeatParams); //Continuous data movements template <typename T> __aicore__ inline void DataCopy(const GlobalTensor <T>& dstGlobal, const LocalTensor <T>& srcLocal, const uint32_t calCount);
The prototype supports the following data paths and types:
Table 3 Data paths and types (LocalTensor as the source operand and GlobalTensor as the destination operand) Model
Data Path (Expressed Using TPosition)
Data Types of the Source and Destination Operands (Same)
Atlas Training Series Product VECOUT -> GM
int8_t / uint8_t / int16_t / uint16_t / int32_t / uint32_t / int64_t / uint64_t / half / float / double
Parameters
Parameter |
Input/Output |
Meaning |
||
|---|---|---|---|---|
dstLocal, dstGlobal |
Output |
Destination operand of type LocalTensor or GlobalTensor. When dstLocal is located in C2, the start address must be 64-byte-aligned. When dstLocal is located in C2PIPE2GM, the start address must be 128-byte-aligned. In other cases, the start address must be 32-byte-aligned. |
||
srcLocal, srcGlobal |
Input |
Source operand of type LocalTensor or GlobalTensor. |
||
repeatParams |
Input |
Movement parameter, DataCopyParams type. The definition is described below and the parameter detail is provided in Table 5.
|
||
calCount |
Input |
Number of elements involved in the movement. NOTE:
The data movement amount of DataCopy must be a multiple of 32 bytes. Therefore, when the common data movement API (continuous data movement, including the calCount parameter) is used, calCount x sizeof(T) must be 32-byte aligned. Otherwise, the data movement amount is rounded down to the nearest integer. |
Parameter |
Meaning |
|---|---|
blockCount |
Specifies the number of data chunks to be consecutively transmitted in the command. The value range is [1, 4095]. |
blockLen |
The length of each data chunk to be consecutively transmitted. The unit is data block (32 bytes). The value range is [1, 65535]. Particularly, when dstLocal is located in C2PIPE2GM, the unit is 128 bytes; when dstLocal is located in C2, the unit is 64 bytes. |
srcStride |
Interval between adjacent consecutive data chunks of the source operand (the interval between the tail of the previous data chunk and the header of the subsequent data chunk). The unit is data block (32 bytes). The data type is uint16_t. The value of srcStride cannot exceed the value range of this data type. |
dstStride |
Interval between adjacent consecutive data chunks of the destination operand (the interval between the tail of the previous data chunk and the header of the subsequent data chunk). The unit is data block (32 bytes). The data type is uint16_t. The value of dstStride cannot exceed the value range of this data type. Particularly, when dstLocal is located in C2PIPE2GM, the unit is 128 bytes; when dstLocal is located in C2, the unit is 64 bytes. |
The following example shows how to use the DataCopyParams structure. In the example, two consecutive data chunks are moved, each data chunk contains eight data blocks. There is no data block between adjacent data chunks of the source operand, while there is one data block between the tail and header of the adjacent data chunks of the destination operand.

Availability
Precautions
- Hardware uses data blocks as the basic unit during data movements (1 data block = 32 bytes). Therefore, you can process data that is an integer multiple of 32 bytes each time to improve the instruction execution efficiency.
- If multiple DataCopy instructions need to be executed and the destination addresses overlap, call PipeBarrier(ISASI) to insert synchronization instructions to ensure serialization of multiple instructions and prevent abnormal data. As shown in the following figure on the left, when two DataCopy instructions are executed, the destination GM addresses overlap. The MTE3 output pipeline needs to be synchronized between the two commands by calling PipeBarrier<PIPE_MTE3>(). As shown in the As shown in the following figure on the right, the destination address Unified Buffer overlaps. The MTE2 input pipeline needs to be synchronized between the two commands by calling PipeBarrier<PIPE_MTE2>().

Example
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 | #include "kernel_operator.h" class KernelDataCopy { public: __aicore__ inline KernelDataCopy() {} __aicore__ inline void Init(__gm__ uint8_t* src0Gm, __gm__ uint8_t* src1Gm, __gm__ uint8_t* dstGm) { src0Global.SetGlobalBuffer((__gm__ half*)src0Gm); src1Global.SetGlobalBuffer((__gm__ half*)src1Gm); dstGlobal.SetGlobalBuffer((__gm__ half*)dstGm); pipe.InitBuffer(inQueueSrc0, 1, 512 * sizeof(half)); pipe.InitBuffer(inQueueSrc1, 1, 512 * sizeof(half)); pipe.InitBuffer(outQueueDst, 1, 512 * sizeof(half)); } __aicore__ inline void Process() { CopyIn(); Compute(); CopyOut(); } private: __aicore__ inline void CopyIn() { AscendC::LocalTensor<half> src0Local = inQueueSrc0.AllocTensor<half>(); AscendC::LocalTensor<half> src1Local = inQueueSrc1.AllocTensor<half>(); AscendC::DataCopy(src0Local, src0Global, 512); AscendC::DataCopy(src1Local, src1Global, 512); inQueueSrc0.EnQue(src0Local); inQueueSrc1.EnQue(src1Local); } __aicore__ inline void Compute() { AscendC::LocalTensor<half> src0Local = inQueueSrc0.DeQue<half>(); AscendC::LocalTensor<half> src1Local = inQueueSrc1.DeQue<half>(); AscendC::LocalTensor<half> dstLocal = outQueueDst.AllocTensor<half>(); AscendC::Add(dstLocal, src0Local, src1Local, 512); outQueueDst.EnQue<half>(dstLocal); inQueueSrc0.FreeTensor(src0Local); inQueueSrc1.FreeTensor(src1Local); } __aicore__ inline void CopyOut() { AscendC::LocalTensor<half> dstLocal = outQueueDst.DeQue<half>(); AscendC::DataCopy(dstGlobal, dstLocal, 512); outQueueDst.FreeTensor(dstLocal); } private: AscendC::TPipe pipe; AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueueSrc0, inQueueSrc1; AscendC::TQue<AscendC::QuePosition::VECOUT, 1> outQueueDst; AscendC::GlobalTensor<half> src0Global, src1Global, dstGlobal; }; extern "C" __global__ __aicore__ void data_copy_kernel(__gm__ uint8_t* src0Gm, __gm__ uint8_t* src1Gm, __gm__ uint8_t* dstGm) { KernelDataCopy op; op.Init(src0Gm, src1Gm, dstGm); op.Process(); } |
Input (src0Global): [1 2 3 ... 512] Input (src1Global): [1 2 3 ... 512] Output (dstGlobal):[2 4 6 ... 1024]