Basic Data Movement

Applicability

Product

Supported/Unsupported

Source and Destination Operands

Prototype with consistent data types

Supported/Unsupported

Source and Destination Operands

Prototype with inconsistent data types

Atlas A3 training products / Atlas A3 inference products

Atlas A2 training products / Atlas A2 inference products

Atlas 200I/500 A2 inference products

x

Atlas inference product 's AI Core

x

Atlas inference product 's Vector Core

x

Atlas training products

x

Functions

Provides basic data movement capabilities. During data transmission, the original format and content of the data remain unchanged. Both contiguous and non-contiguous data movement is supported.

Prototype

  • Global Memory -> Local Memory
    1
    2
    3
    4
    5
    6
    7
    # Consecutive movement
    template <typename T>
    __aicore__ inline void DataCopy(const LocalTensor<T>& dst, const GlobalTensor<T>& src, const uint32_t count)
    
    // Both continuous and discontinuous movement are supported.
    template <typename T>
    __aicore__ inline void DataCopy(const LocalTensor<T>& dst, const GlobalTensor<T>& src, const DataCopyParams& repeatParams)
    
  • Local Memory -> Local Memory
    1
    2
    3
    4
    5
    6
    7
    # Consecutive movement
    template <typename T>
    __aicore__ inline void DataCopy(const LocalTensor<T>& dst, const LocalTensor<T>& src, const uint32_t count)
    
    // Both continuous and discontinuous movement are supported.
    template <typename T>
    __aicore__ inline void DataCopy(const LocalTensor<T>& dst, const LocalTensor<T>& src, const DataCopyParams& repeatParams)
    
  • Local Memory -> Global Memory
    1
    2
    3
    4
    5
    6
    7
    # Consecutive movement
    template <typename T>
    __aicore__ inline void DataCopy(const GlobalTensor<T>& dst, const LocalTensor<T>& src, const uint32_t count)
    
    // Both discontinuous and continuous movement are supported.
    template <typename T>
    __aicore__ inline void DataCopy(const GlobalTensor<T>& dst, const LocalTensor<T>& src, const DataCopyParams& repeatParams)
    
  • Local Memory -> Local Memory, supporting the scenario where the data types of the source and destination operands are different.
    1
    2
    3
    // Both discontinuous and continuous movement are supported.
    template <typename T, typename U>
    __aicore__ inline void DataCopy(const LocalTensor<T>& dst, const LocalTensor<U>& src, const DataCopyParams& repeatParams)
    

For details about the supported data paths and data types of each prototype, see Supported Channels and Data Types.

Parameters

Table 1 Parameters in the template

Parameter

Description

T, U

Operand data type. For details about the supported data types, see Supported Channels and Data Types.

Table 2 Parameters

Parameter

Input/Output

Meaning

dst

Output

Destination operand of the LocalTensor or GlobalTensor type.

If the LocalTensor is located in C2, the start address must be 64-byte aligned. If the LocalTensor is located in C2PIPE2GM, the start address must be 128-byte aligned. In other cases, the start address must be 32-byte aligned.

The start address of the GlobalTensor must be aligned with the number of bytes occupied by the corresponding data type.

src

Input

Source operand of the LocalTensor or GlobalTensor type.

The start address of the LocalTensor must be 32-byte aligned.

The start address of the GlobalTensor must be aligned with the number of bytes occupied by the corresponding data type.

repeatParams

Input

Transfer parameters, of the DataCopyParams type. This parameter can be used to configure the size, number, and interval of the data blocks to be transferred. Both discontinuous and continuous transfer are supported.

For details, see ${INSTALL_DIR}/include/ascendc/basic_api/interface/kernel_struct_data_copy.h. Replace ${INSTALL_DIR} with the actual CANN component directory.

count

Input

Number of elements involved in the movement.

NOTE:

The value of count * sizeof(T) must be 32-byte aligned. If the value is not 32-byte aligned, the transfer volume is rounded down to the nearest multiple of 32 bytes.

Table 3 Parameters in the DataCopyParams structure

Field

Meaning

blockCount

Number of consecutive data blocks to be transferred. The value is of the uint16_t type. The value range is as follows: blockCount ∈ [1, 4095].

blockLen

Length of each consecutive data block to be transferred, in the unit of DataBlock (32 bytes). The value is of the uint16_t type. The value range is as follows: blockLen ∈ [1, 65535].

Specifically, when dst is located in C2PIPE2GM, the unit is 128 bytes. When dst is located in C2, the unit is 64 bytes, indicating the length of the consecutive data block to be transferred of the source operand.

srcGap

Interval between adjacent consecutive data blocks of the source operand (the interval between the end of the previous data block and the beginning of the next data block), in the unit of DataBlock (32 bytes). The value is of the uint16_t type. The value of srcGap must be within the value range of this data type.

In the L1 Buffer -> Fixpipe Buffer scenario, srcGap refers to the interval between adjacent consecutive data blocks of the source operand (the interval between the beginning of the previous data block and the beginning of the next data block), in the unit of DataBlock (32 bytes). The value is of the uint16_t type. The value of srcGap must be within the value range of this data type.

dstGap

Interval between adjacent consecutive data blocks of the destination operand (the interval between the end of the previous data block and the beginning of the next data block), in the unit of DataBlock (32 bytes). The value is of the uint16_t type. The value of dstGap must be within the value range of this data type.

Specifically, when dstLocal is located in C2PIPE2GM, the unit is 128 bytes. When dstLocal is located in C2, the unit is 64 bytes.

In the L1 Buffer -> Fixpipe Buffer scenario, dstGap refers to the interval between adjacent consecutive data blocks of the source operand (the interval between the beginning of the previous data block and the beginning of the next data block), in the unit of DataBlock (32 bytes). The value is of the uint16_t type. The value of dstGap must be within the value range of this data type.

The following example shows how to use the DataCopyParams structure. In the example, two consecutive data chunks are moved, and each data chunk contains eight data blocks. There is no stride between adjacent data chunks of the source operand, while there is a stride of one data block between the tail and head of the adjacent data chunks of the destination operand.

Returns

None

Restrictions

  • If multiple DataCopy instructions need to be executed and the destination addresses overlap, call PipeBarrier(ISASI) to insert synchronization instructions to ensure serialization of multiple instructions and prevent abnormal data. As shown in the following figure on the left, when two DataCopy instructions are executed, the destination GM addresses overlap. The MTE3 output pipeline needs to be synchronized between the two commands by calling PipeBarrier<PIPE_MTE3>(). As shown in the following figure on the right, the destination Unified Buffer addresses overlap. The MTE2 input pipeline needs to be synchronized between the two commands by calling PipeBarrier<PIPE_MTE2>().

  • For the following product models:

    Atlas A2 training products / Atlas A2 inference products

    Atlas A3 training products / Atlas A3 inference products

    In the cross-device communication operator development scenario, DataCopy APIs support cross-device data transfer. Only HCCS physical links are supported. During development, you need to pay attention to the physical links related to inter-card communication. You can run the npu-smi info -t topo command to query the HCCS physical links..

Supported Channels and Data Types

The data paths in the following sections are expressed by using the logical location TPosition, and the corresponding physical paths are marked. For details about the mapping between TPosition and physical memory, see Table 1.

Table 4 Global Memory -> Local Memory Channels and Supported Data Types

Internal Model

Datapath

Data Types of the Source and Destination Operands (Same)

Atlas training products

  • GM -> VECIN (GM -> UB )
  • GM -> A1, B1 (GM -> L1 Buffer )

int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, uint64_t, half, float, double

Atlas inference product 's AI Core

  • GM -> VECIN (GM -> UB )
  • GM -> A1, B1 (GM -> L1 Buffer )

int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, uint64_t, half, float, double

Atlas inference product 's Vector Core

  • GM -> VECIN (GM -> UB )

int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, uint64_t, half, float, double

Atlas A2 training products / Atlas A2 inference products

  • GM -> VECIN (GM -> UB )
  • GM -> A1, B1, C1 (GM -> L1 Buffer )

int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, uint64_t, half, bfloat16_t, float, double

Atlas A3 training products / Atlas A3 inference products

  • GM -> VECIN (GM -> UB )
  • GM -> A1, B1, C1 (GM -> L1 Buffer )

int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, uint64_t, half, bfloat16_t, float, double

Atlas 200I/500 A2 inference products

  • GM -> VECIN (GM -> UB )

int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, uint64_t, half, bfloat16_t, float, double

Table 5 Local memory to local memory data transfer and supported data types

Internal Model

Datapath

Data Types of the Source and Destination Operands (Same)

Atlas training products

  • VECIN -> VECCALC or VECCALC -> VECOUT (UB -> UB)

int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, uint64_t, half, float, double

Atlas inference product 's AI Core

  • VECIN -> VECCALC or VECCALC -> VECOUT (UB -> UB)
  • VECIN, VECCALC, VECOUT -> A1, B1 (UB -> L1 Buffer)

int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, uint64_t, half, float, double

Atlas A2 training products / Atlas A2 inference products

  • VECIN -> VECCALC or VECCALC-> VECOUT (UB -> UB)
  • VECIN, VECCALC, VECOUT -> TSCM (UB -> L1 Buffer)
  • A1, B1, C1-> C2PIPE2GM (L1 Buffer -> Fixpipe Buffer)

int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, uint64_t, half, bfloat16_t, float, double

  • C1 -> C2 (L1 Buffer -> BiasTable Buffer)

int32_t, float

Atlas A3 training products / Atlas A3 inference products

  • VECIN -> VECCALC or VECCALC-> VECOUT (UB -> UB)
  • VECIN, VECCALC, VECOUT -> TSCM (UB -> L1 Buffer)
  • A1, B1, C1-> C2PIPE2GM (L1 Buffer -> Fixpipe Buffer)

int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, uint64_t, half, bfloat16_t, float, double

  • C1 -> C2 (L1 Buffer -> BiasTable Buffer)

int32_t, float

Table 6 Local-to-global memory transfer and supported data types

Internal Model

Datapath

Data Types of the Source and Destination Operands (Same)

Atlas training products

  • VECOUT -> GM (UB -> GM)

int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, uint64_t, half, float, double

Atlas inference product 's AI Core

  • VECOUT, CO2 -> GM (UB -> GM)

int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, uint64_t, half, float, double

Atlas inference product 's Vector Core

  • VECOUT -> GM (UB -> GM)

int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, uint64_t, half, float, double

Atlas A2 training products / Atlas A2 inference products

  • VECOUT -> GM (UB -> GM)
  • A1, B1 -> GM (L1 Buffer -> GM)

int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, uint64_t, half, bfloat16_t, float, double

Atlas A3 training products / Atlas A3 inference products

  • VECOUT -> GM (UB -> GM)
  • A1, B1 -> GM (L1 Buffer -> GM)

int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, uint64_t, half, bfloat16_t, float, double

Atlas 200I/500 A2 inference products

  • VECOUT -> GM (UB -> GM)

int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, uint64_t, half, bfloat16_t, float, double

Table 7 Local memory to local memory data transfer and supported data types (The source and destination operands can have different data types.)

Internal Model

Datapath

Source Operand

Destination Operand

Atlas A2 training products / Atlas A2 inference products

C1 -> C2 (L1 Buffer -> BiasTable Buffer)

half

float

Atlas A3 training products / Atlas A3 inference products

C1 -> C2 (L1 Buffer -> BiasTable Buffer)

half

float

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
AscendC::TPipe pipe;
AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueSrc;
AscendC::TQue<AscendC::TPosition::VECOUT, 1> outQueueDst;
AscendC::GlobalTensor<half> srcGlobal, dstGlobal;
pipe.InitBuffer(inQueueSrc, 1, 512 * sizeof(half));
pipe.InitBuffer(outQueueDst, 1, 512 * sizeof(half));
AscendC::LocalTensor<half> srcLocal = inQueueSrc.AllocTensor<half>();
AscendC::LocalTensor<half> dstLocal = outQueueDst.AllocTensor<half>();
// Use the transfer API with the input count parameter to complete continuous transfer.
AscendC::DataCopy(srcLocal, srcGlobal, 512);
AscendC::DataCopy(dstLocal , srcLocal, 512);
AscendC::DataCopy(dstGlobal, dstLocal, 512);
// Use the transfer API with the input DataCopyParams parameter to support both continuous and discontinuous transfer.
// DataCopyParams intriParams;
// AscendC::DataCopy(srcLocal, srcGlobal, intriParams);
// AscendC::DataCopy(dstLocal , srcLocal, intriParams);
// AscendC::DataCopy(dstGlobal, dstLocal, intriParams);
Result example:
Input data srcGlobal: [1 2 3... 512]
Output data dstGlobal: [1 2 3... 512]