Common Data Movement

Function Usage

Common data movement APIs apply to continuous and discontinuous data movements.

Prototype

  • The source operand is GlobalTensor, and the destination operand is LocalTensor.
    1
    2
    3
    4
    5
    6
    7
    //Continuous and discontinuous data movements
    template <typename T>
    __aicore__ inline void DataCopy(const LocalTensor<T>& dstLocal, const GlobalTensor<T>& srcGlobal, const DataCopyParams& repeatParams);
    
    //Continuous data movements
    template <typename T>
    __aicore__ inline void DataCopy(const LocalTensor<T>& dstLocal, const GlobalTensor<T>& srcGlobal, const uint32_t calCount);
    

    The prototype supports the following data paths and types:

    Table 1 Data paths and types (GlobalTensor as the source operand and LocalTensor as the destination operand)

    Model

    Data Path (Expressed Using TPosition)

    Data Types of the Source and Destination Operands (Same)

    Atlas Training Series Product

    GM -> VECIN

    int8_t / uint8_t / int16_t / uint16_t / int32_t / uint32_t / int64_t / uint64_t / half / float / double

    Atlas Training Series Product

    GM -> A1 / B1

    int8_t / uint8_t / int16_t / uint16_t / int32_t / uint32_t / int64_t / uint64_t / half / float / double

  • Both the source operand and destination operand are LocalTensor.
    1
    2
    3
    4
    5
    6
    //Continuous and discontinuous data movements
    template <typename T>
    __aicore__ inline void DataCopy(const LocalTensor<T>& dstLocal, const LocalTensor<T>& srcLocal, const DataCopyParams& repeatParams)
    //Continuous data movements
    template <typename T>
    __aicore__ inline void DataCopy(const LocalTensor<T>& dstLocal, const LocalTensor<T>& srcLocal, const uint32_t calCount);
    

    The prototype supports the following data paths and types:

    Table 2 Data paths and types (LocalTensor as the source operand and destination operand)

    Model

    Data Path (Expressed Using TPosition)

    Data Types of the Source and Destination Operands (Same)

    Atlas Training Series Product

    VECIN -> VECCALC, VECCALC->VECOUT

    int8_t / uint8_t / int16_t / uint16_t / int32_t / uint32_t / int64_t / uint64_t / half / float / double

  • The source operand is LocalTensor, and the destination operand is GlobalTensor.
    1
    2
    3
    4
    5
    6
    //Continuous and discontinuous data movements
    template <typename T>
    __aicore__ inline void DataCopy(const GlobalTensor <T>& dstGlobal, const LocalTensor <T>& srcLocal, const DataCopyParams& repeatParams);
    //Continuous data movements
    template <typename T>
    __aicore__ inline void DataCopy(const GlobalTensor <T>& dstGlobal, const LocalTensor <T>& srcLocal, const uint32_t calCount);
    

    The prototype supports the following data paths and types:

    Table 3 Data paths and types (LocalTensor as the source operand and GlobalTensor as the destination operand)

    Model

    Data Path (Expressed Using TPosition)

    Data Types of the Source and Destination Operands (Same)

    Atlas Training Series Product

    VECOUT -> GM

    int8_t / uint8_t / int16_t / uint16_t / int32_t / uint32_t / int64_t / uint64_t / half / float / double

Parameters

Table 4 Parameters of the common data movement APIs

Parameter

Input/Output

Meaning

dstLocal, dstGlobal

Output

Destination operand of type LocalTensor or GlobalTensor. When dstLocal is located in C2, the start address must be 64-byte-aligned. When dstLocal is located in C2PIPE2GM, the start address must be 128-byte-aligned. In other cases, the start address must be 32-byte-aligned.

srcLocal, srcGlobal

Input

Source operand of type LocalTensor or GlobalTensor.

repeatParams

Input

Movement parameter, DataCopyParams type. The definition is described below and the parameter detail is provided in Table 5.

1
2
3
4
5
6
struct DataCopyParams {
    uint16_t blockCount = 0;
    uint16_t blockLen = 0;
    uint16_t srcStride = 0;
    uint16_t dstStride = 0;
};

calCount

Input

Number of elements involved in the movement.

NOTE:

The data movement amount of DataCopy must be a multiple of 32 bytes. Therefore, when the common data movement API (continuous data movement, including the calCount parameter) is used, calCount x sizeof(T) must be 32-byte aligned. Otherwise, the data movement amount is rounded down to the nearest integer.

Table 5 Parameters in the DataCopyParams structure

Parameter

Meaning

blockCount

Specifies the number of data chunks to be consecutively transmitted in the command. The value range is [1, 4095].

blockLen

The length of each data chunk to be consecutively transmitted. The unit is data block (32 bytes). The value range is [1, 65535].

Particularly, when dstLocal is located in C2PIPE2GM, the unit is 128 bytes; when dstLocal is located in C2, the unit is 64 bytes.

srcStride

Interval between adjacent consecutive data chunks of the source operand (the interval between the tail of the previous data chunk and the header of the subsequent data chunk). The unit is data block (32 bytes). The data type is uint16_t. The value of srcStride cannot exceed the value range of this data type.

dstStride

Interval between adjacent consecutive data chunks of the destination operand (the interval between the tail of the previous data chunk and the header of the subsequent data chunk). The unit is data block (32 bytes). The data type is uint16_t. The value of dstStride cannot exceed the value range of this data type.

Particularly, when dstLocal is located in C2PIPE2GM, the unit is 128 bytes; when dstLocal is located in C2, the unit is 64 bytes.

The following example shows how to use the DataCopyParams structure. In the example, two consecutive data chunks are moved, each data chunk contains eight data blocks. There is no data block between adjacent data chunks of the source operand, while there is one data block between the tail and header of the adjacent data chunks of the destination operand.

Availability

Atlas Training Series Product

Precautions

  • Hardware uses data blocks as the basic unit during data movements (1 data block = 32 bytes). Therefore, you can process data that is an integer multiple of 32 bytes each time to improve the instruction execution efficiency.
  • If multiple DataCopy instructions need to be executed and the destination addresses overlap, call PipeBarrier(ISASI) to insert synchronization instructions to ensure serialization of multiple instructions and prevent abnormal data. As shown in the following figure on the left, when two DataCopy instructions are executed, the destination GM addresses overlap. The MTE3 output pipeline needs to be synchronized between the two commands by calling PipeBarrier<PIPE_MTE3>(). As shown in the As shown in the following figure on the right, the destination address Unified Buffer overlaps. The MTE2 input pipeline needs to be synchronized between the two commands by calling PipeBarrier<PIPE_MTE2>().

Example

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
#include "kernel_operator.h"
class KernelDataCopy {
public:
    __aicore__ inline KernelDataCopy() {}
    __aicore__ inline void Init(__gm__ uint8_t* src0Gm, __gm__ uint8_t* src1Gm, __gm__ uint8_t* dstGm)
    {
        src0Global.SetGlobalBuffer((__gm__ half*)src0Gm);
        src1Global.SetGlobalBuffer((__gm__ half*)src1Gm);
        dstGlobal.SetGlobalBuffer((__gm__ half*)dstGm);
        pipe.InitBuffer(inQueueSrc0, 1, 512 * sizeof(half));
        pipe.InitBuffer(inQueueSrc1, 1, 512 * sizeof(half));
        pipe.InitBuffer(outQueueDst, 1, 512 * sizeof(half));
    }
    __aicore__ inline void Process()
    {
        CopyIn();
        Compute();
        CopyOut();
    }
private:
    __aicore__ inline void CopyIn()
    {
        AscendC::LocalTensor<half> src0Local = inQueueSrc0.AllocTensor<half>();
        AscendC::LocalTensor<half> src1Local = inQueueSrc1.AllocTensor<half>();
        AscendC::DataCopy(src0Local, src0Global, 512);
        AscendC::DataCopy(src1Local, src1Global, 512);
        inQueueSrc0.EnQue(src0Local);
        inQueueSrc1.EnQue(src1Local);
    }
    __aicore__ inline void Compute()
    {
        AscendC::LocalTensor<half> src0Local = inQueueSrc0.DeQue<half>();
        AscendC::LocalTensor<half> src1Local = inQueueSrc1.DeQue<half>();
        AscendC::LocalTensor<half> dstLocal = outQueueDst.AllocTensor<half>();
        AscendC::Add(dstLocal, src0Local, src1Local, 512);
        outQueueDst.EnQue<half>(dstLocal);
        inQueueSrc0.FreeTensor(src0Local);
        inQueueSrc1.FreeTensor(src1Local);
    }
    __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<half> dstLocal = outQueueDst.DeQue<half>();
        AscendC::DataCopy(dstGlobal, dstLocal, 512);
        outQueueDst.FreeTensor(dstLocal);
    }
private:
    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueueSrc0, inQueueSrc1;
    AscendC::TQue<AscendC::QuePosition::VECOUT, 1> outQueueDst;
    AscendC::GlobalTensor<half> src0Global, src1Global, dstGlobal;
};
extern "C" __global__ __aicore__ void data_copy_kernel(__gm__ uint8_t* src0Gm, __gm__ uint8_t* src1Gm, __gm__ uint8_t* dstGm)
{
    KernelDataCopy op;
    op.Init(src0Gm, src1Gm, dstGm);
    op.Process();
}
Result example:
Input (src0Global): [1 2 3 ... 512]
Input (src1Global): [1 2 3 ... 512]
Output (dstGlobal):[2 4 6 ... 1024]