Common Data Movement

Function Usage

Common data movement APIs apply to continuous and discontinuous data movements.

Prototype

The source operand is GlobalTensor, and the destination operand is LocalTensor.

//Continuous and discontinuous data movements
template <typename T>
__aicore__ inline void DataCopy(const LocalTensor<T>& dstLocal, const GlobalTensor<T>& srcGlobal, const DataCopyParams& repeatParams);

//Continuous data movements
template <typename T>
__aicore__ inline void DataCopy(const LocalTensor<T>& dstLocal, const GlobalTensor<T>& srcGlobal, const uint32_t calCount);

The prototype supports the following data paths and types:

**Table 1** Data paths and types (GlobalTensor as the source operand and LocalTensor as the destination operand)
Model	Data Path (Expressed Using TPosition)	Data Types of the Source and Destination Operands (Same)
Atlas Training Series Product	GM -> VECIN	int8_t / uint8_t / int16_t / uint16_t / int32_t / uint32_t / int64_t / uint64_t / half / float / double
Atlas Training Series Product	GM -> A1 / B1	int8_t / uint8_t / int16_t / uint16_t / int32_t / uint32_t / int64_t / uint64_t / half / float / double

Both the source operand and destination operand are LocalTensor.

//Continuous and discontinuous data movements
template <typename T>
__aicore__ inline void DataCopy(const LocalTensor<T>& dstLocal, const LocalTensor<T>& srcLocal, const DataCopyParams& repeatParams)
//Continuous data movements
template <typename T>
__aicore__ inline void DataCopy(const LocalTensor<T>& dstLocal, const LocalTensor<T>& srcLocal, const uint32_t calCount);

The prototype supports the following data paths and types:

**Table 2** Data paths and types (LocalTensor as the source operand and destination operand)
Model	Data Path (Expressed Using TPosition)	Data Types of the Source and Destination Operands (Same)
Atlas Training Series Product	VECIN -> VECCALC, VECCALC->VECOUT	int8_t / uint8_t / int16_t / uint16_t / int32_t / uint32_t / int64_t / uint64_t / half / float / double

The source operand is LocalTensor, and the destination operand is GlobalTensor.

//Continuous and discontinuous data movements
template <typename T>
__aicore__ inline void DataCopy(const GlobalTensor <T>& dstGlobal, const LocalTensor <T>& srcLocal, const DataCopyParams& repeatParams);
//Continuous data movements
template <typename T>
__aicore__ inline void DataCopy(const GlobalTensor <T>& dstGlobal, const LocalTensor <T>& srcLocal, const uint32_t calCount);

The prototype supports the following data paths and types:

**Table 3** Data paths and types (LocalTensor as the source operand and GlobalTensor as the destination operand)
Model	Data Path (Expressed Using TPosition)	Data Types of the Source and Destination Operands (Same)
Atlas Training Series Product	VECOUT -> GM	int8_t / uint8_t / int16_t / uint16_t / int32_t / uint32_t / int64_t / uint64_t / half / float / double

Parameters

Table 4 Parameters of the common data movement APIs

Parameter

Input/Output

Meaning

dstLocal, dstGlobal

Output

Destination operand of type LocalTensor or GlobalTensor. When dstLocal is located in C2, the start address must be 64-byte-aligned. When dstLocal is located in C2PIPE2GM, the start address must be 128-byte-aligned. In other cases, the start address must be 32-byte-aligned.

srcLocal, srcGlobal

Input

Source operand of type LocalTensor or GlobalTensor.

repeatParams

Input

Movement parameter, DataCopyParams type. The definition is described below and the parameter detail is provided in Table 5.

struct DataCopyParams {
    uint16_t blockCount = 0;
    uint16_t blockLen = 0;
    uint16_t srcStride = 0;
    uint16_t dstStride = 0;
};

calCount

Input

Number of elements involved in the movement.

NOTE:

The data movement amount of DataCopy must be a multiple of 32 bytes. Therefore, when the common data movement API (continuous data movement, including the calCount parameter) is used, calCount x sizeof(T) must be 32-byte aligned. Otherwise, the data movement amount is rounded down to the nearest integer.

**Table 5** Parameters in the DataCopyParams structure
Parameter	Meaning
blockCount	Specifies the number of data chunks to be consecutively transmitted in the command. The value range is [1, 4095].
blockLen	The length of each data chunk to be consecutively transmitted. The unit is data block (32 bytes). The value range is [1, 65535]. Particularly, when dstLocal is located in C2PIPE2GM, the unit is 128 bytes; when dstLocal is located in C2, the unit is 64 bytes.
srcStride	Interval between adjacent consecutive data chunks of the source operand (the interval between the tail of the previous data chunk and the header of the subsequent data chunk). The unit is data block (32 bytes). The data type is uint16_t. The value of srcStride cannot exceed the value range of this data type.
dstStride	Interval between adjacent consecutive data chunks of the destination operand (the interval between the tail of the previous data chunk and the header of the subsequent data chunk). The unit is data block (32 bytes). The data type is uint16_t. The value of dstStride cannot exceed the value range of this data type. Particularly, when dstLocal is located in C2PIPE2GM, the unit is 128 bytes; when dstLocal is located in C2, the unit is 64 bytes.

The following example shows how to use the DataCopyParams structure. In the example, two consecutive data chunks are moved, each data chunk contains eight data blocks. There is no data block between adjacent data chunks of the source operand, while there is one data block between the tail and header of the adjacent data chunks of the destination operand.

Availability

Atlas Training Series Product

Precautions

Hardware uses data blocks as the basic unit during data movements (1 data block = 32 bytes). Therefore, you can process data that is an integer multiple of 32 bytes each time to improve the instruction execution efficiency.
If multiple DataCopy instructions need to be executed and the destination addresses overlap, call PipeBarrier(ISASI) to insert synchronization instructions to ensure serialization of multiple instructions and prevent abnormal data. As shown in the following figure on the left, when two DataCopy instructions are executed, the destination GM addresses overlap. The MTE3 output pipeline needs to be synchronized between the two commands by calling PipeBarrier<PIPE_MTE3>(). As shown in the As shown in the following figure on the right, the destination address Unified Buffer overlaps. The MTE2 input pipeline needs to be synchronized between the two commands by calling PipeBarrier<PIPE_MTE2>().

Example

#include "kernel_operator.h"
class KernelDataCopy {
public:
    __aicore__ inline KernelDataCopy() {}
    __aicore__ inline void Init(__gm__ uint8_t* src0Gm, __gm__ uint8_t* src1Gm, __gm__ uint8_t* dstGm)
    {
        src0Global.SetGlobalBuffer((__gm__ half*)src0Gm);
        src1Global.SetGlobalBuffer((__gm__ half*)src1Gm);
        dstGlobal.SetGlobalBuffer((__gm__ half*)dstGm);
        pipe.InitBuffer(inQueueSrc0, 1, 512 * sizeof(half));
        pipe.InitBuffer(inQueueSrc1, 1, 512 * sizeof(half));
        pipe.InitBuffer(outQueueDst, 1, 512 * sizeof(half));
    }
    __aicore__ inline void Process()
    {
        CopyIn();
        Compute();
        CopyOut();
    }
private:
    __aicore__ inline void CopyIn()
    {
        AscendC::LocalTensor<half> src0Local = inQueueSrc0.AllocTensor<half>();
        AscendC::LocalTensor<half> src1Local = inQueueSrc1.AllocTensor<half>();
        AscendC::DataCopy(src0Local, src0Global, 512);
        AscendC::DataCopy(src1Local, src1Global, 512);
        inQueueSrc0.EnQue(src0Local);
        inQueueSrc1.EnQue(src1Local);
    }
    __aicore__ inline void Compute()
    {
        AscendC::LocalTensor<half> src0Local = inQueueSrc0.DeQue<half>();
        AscendC::LocalTensor<half> src1Local = inQueueSrc1.DeQue<half>();
        AscendC::LocalTensor<half> dstLocal = outQueueDst.AllocTensor<half>();
        AscendC::Add(dstLocal, src0Local, src1Local, 512);
        outQueueDst.EnQue<half>(dstLocal);
        inQueueSrc0.FreeTensor(src0Local);
        inQueueSrc1.FreeTensor(src1Local);
    }
    __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<half> dstLocal = outQueueDst.DeQue<half>();
        AscendC::DataCopy(dstGlobal, dstLocal, 512);
        outQueueDst.FreeTensor(dstLocal);
    }
private:
    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueueSrc0, inQueueSrc1;
    AscendC::TQue<AscendC::QuePosition::VECOUT, 1> outQueueDst;
    AscendC::GlobalTensor<half> src0Global, src1Global, dstGlobal;
};
extern "C" __global__ __aicore__ void data_copy_kernel(__gm__ uint8_t* src0Gm, __gm__ uint8_t* src1Gm, __gm__ uint8_t* dstGm)
{
    KernelDataCopy op;
    op.Init(src0Gm, src1Gm, dstGm);
    op.Process();
}

Result example:

Input (src0Global): [1 2 3 ... 512]
Input (src1Global): [1 2 3 ... 512]
Output (dstGlobal):[2 4 6 ... 1024]

Parent topic: DataCopy