Basic Data Movement
Applicability
|
Product |
Supported/Unsupported Source and Destination Operands Prototype with consistent data types |
Supported/Unsupported Source and Destination Operands Prototype with inconsistent data types |
|---|---|---|
|
|
√ |
√ |
|
|
√ |
√ |
|
|
√ |
x |
|
|
√ |
x |
|
|
√ |
x |
|
|
√ |
x |
Functions
Provides basic data movement capabilities. During data transmission, the original format and content of the data remain unchanged. Both contiguous and non-contiguous data movement is supported.
Prototype
- Global Memory -> Local Memory
1 2 3 4 5 6 7
# Consecutive movement template <typename T> __aicore__ inline void DataCopy(const LocalTensor<T>& dst, const GlobalTensor<T>& src, const uint32_t count) // Both continuous and discontinuous movement are supported. template <typename T> __aicore__ inline void DataCopy(const LocalTensor<T>& dst, const GlobalTensor<T>& src, const DataCopyParams& repeatParams)
- Local Memory -> Local Memory
1 2 3 4 5 6 7
# Consecutive movement template <typename T> __aicore__ inline void DataCopy(const LocalTensor<T>& dst, const LocalTensor<T>& src, const uint32_t count) // Both continuous and discontinuous movement are supported. template <typename T> __aicore__ inline void DataCopy(const LocalTensor<T>& dst, const LocalTensor<T>& src, const DataCopyParams& repeatParams)
- Local Memory -> Global Memory
1 2 3 4 5 6 7
# Consecutive movement template <typename T> __aicore__ inline void DataCopy(const GlobalTensor<T>& dst, const LocalTensor<T>& src, const uint32_t count) // Both discontinuous and continuous movement are supported. template <typename T> __aicore__ inline void DataCopy(const GlobalTensor<T>& dst, const LocalTensor<T>& src, const DataCopyParams& repeatParams)
- Local Memory -> Local Memory, supporting the scenario where the data types of the source and destination operands are different.
1 2 3
// Both discontinuous and continuous movement are supported. template <typename T, typename U> __aicore__ inline void DataCopy(const LocalTensor<T>& dst, const LocalTensor<U>& src, const DataCopyParams& repeatParams)
For details about the supported data paths and data types of each prototype, see Supported Channels and Data Types.
Parameters
|
Parameter |
Description |
|---|---|
|
T, U |
Operand data type. For details about the supported data types, see Supported Channels and Data Types. |
|
Parameter |
Input/Output |
Meaning |
|---|---|---|
|
dst |
Output |
Destination operand of the LocalTensor or GlobalTensor type. If the LocalTensor is located in C2, the start address must be 64-byte aligned. If the LocalTensor is located in C2PIPE2GM, the start address must be 128-byte aligned. In other cases, the start address must be 32-byte aligned. The start address of the GlobalTensor must be aligned with the number of bytes occupied by the corresponding data type. |
|
src |
Input |
Source operand of the LocalTensor or GlobalTensor type. The start address of the LocalTensor must be 32-byte aligned. The start address of the GlobalTensor must be aligned with the number of bytes occupied by the corresponding data type. |
|
repeatParams |
Input |
Transfer parameters, of the DataCopyParams type. This parameter can be used to configure the size, number, and interval of the data blocks to be transferred. Both discontinuous and continuous transfer are supported. For details, see ${INSTALL_DIR}/include/ascendc/basic_api/interface/kernel_struct_data_copy.h. Replace ${INSTALL_DIR} with the actual CANN component directory. |
|
count |
Input |
Number of elements involved in the movement.
NOTE:
The value of count * sizeof(T) must be 32-byte aligned. If the value is not 32-byte aligned, the transfer volume is rounded down to the nearest multiple of 32 bytes. |
|
Field |
Meaning |
|---|---|
|
blockCount |
Number of consecutive data blocks to be transferred. The value is of the uint16_t type. The value range is as follows: blockCount ∈ [1, 4095]. |
|
blockLen |
Length of each consecutive data block to be transferred, in the unit of DataBlock (32 bytes). The value is of the uint16_t type. The value range is as follows: blockLen ∈ [1, 65535]. Specifically, when dst is located in C2PIPE2GM, the unit is 128 bytes. When dst is located in C2, the unit is 64 bytes, indicating the length of the consecutive data block to be transferred of the source operand. |
|
srcGap |
Interval between adjacent consecutive data blocks of the source operand (the interval between the end of the previous data block and the beginning of the next data block), in the unit of DataBlock (32 bytes). The value is of the uint16_t type. The value of srcGap must be within the value range of this data type. In the L1 Buffer -> Fixpipe Buffer scenario, srcGap refers to the interval between adjacent consecutive data blocks of the source operand (the interval between the beginning of the previous data block and the beginning of the next data block), in the unit of DataBlock (32 bytes). The value is of the uint16_t type. The value of srcGap must be within the value range of this data type. |
|
dstGap |
Interval between adjacent consecutive data blocks of the destination operand (the interval between the end of the previous data block and the beginning of the next data block), in the unit of DataBlock (32 bytes). The value is of the uint16_t type. The value of dstGap must be within the value range of this data type. Specifically, when dstLocal is located in C2PIPE2GM, the unit is 128 bytes. When dstLocal is located in C2, the unit is 64 bytes. In the L1 Buffer -> Fixpipe Buffer scenario, dstGap refers to the interval between adjacent consecutive data blocks of the source operand (the interval between the beginning of the previous data block and the beginning of the next data block), in the unit of DataBlock (32 bytes). The value is of the uint16_t type. The value of dstGap must be within the value range of this data type. |
The following example shows how to use the DataCopyParams structure. In the example, two consecutive data chunks are moved, and each data chunk contains eight data blocks. There is no stride between adjacent data chunks of the source operand, while there is a stride of one data block between the tail and head of the adjacent data chunks of the destination operand.

Returns
None
Restrictions
- If multiple DataCopy instructions need to be executed and the destination addresses overlap, call PipeBarrier(ISASI) to insert synchronization instructions to ensure serialization of multiple instructions and prevent abnormal data. As shown in the following figure on the left, when two DataCopy instructions are executed, the destination GM addresses overlap. The MTE3 output pipeline needs to be synchronized between the two commands by calling PipeBarrier<PIPE_MTE3>(). As shown in the following figure on the right, the destination Unified Buffer addresses overlap. The MTE2 input pipeline needs to be synchronized between the two commands by calling PipeBarrier<PIPE_MTE2>().

- For the following product models:
Atlas A2 training products /Atlas A2 inference products Atlas A3 training products /Atlas A3 inference products In the cross-device communication operator development scenario, DataCopy APIs support cross-device data transfer. Only HCCS physical links are supported. During development, you need to pay attention to the physical links related to inter-card communication. You can run the npu-smi info -t topo command to query the HCCS physical links..
Supported Channels and Data Types
The data paths in the following sections are expressed by using the logical location TPosition, and the corresponding physical paths are marked. For details about the mapping between TPosition and physical memory, see Table 1.
|
Internal Model |
Datapath |
Data Types of the Source and Destination Operands (Same) |
|---|---|---|
|
|
|
int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, uint64_t, half, float, double |
|
|
|
int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, uint64_t, half, float, double |
|
|
|
int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, uint64_t, half, float, double |
|
|
|
int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, uint64_t, half, bfloat16_t, float, double |
|
|
|
int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, uint64_t, half, bfloat16_t, float, double |
|
|
|
int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, uint64_t, half, bfloat16_t, float, double |
|
Internal Model |
Datapath |
Data Types of the Source and Destination Operands (Same) |
|---|---|---|
|
|
|
int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, uint64_t, half, float, double |
|
|
|
int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, uint64_t, half, float, double |
|
|
|
int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, uint64_t, half, bfloat16_t, float, double |
|
int32_t, float |
|
|
|
|
int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, uint64_t, half, bfloat16_t, float, double |
|
int32_t, float |
|
Internal Model |
Datapath |
Data Types of the Source and Destination Operands (Same) |
|---|---|---|
|
|
|
int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, uint64_t, half, float, double |
|
|
|
int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, uint64_t, half, float, double |
|
|
|
int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, uint64_t, half, float, double |
|
|
|
int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, uint64_t, half, bfloat16_t, float, double |
|
|
|
int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, uint64_t, half, bfloat16_t, float, double |
|
|
|
int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, uint64_t, half, bfloat16_t, float, double |
|
Internal Model |
Datapath |
Source Operand |
Destination Operand |
|---|---|---|---|
|
|
C1 -> C2 (L1 Buffer -> BiasTable Buffer) |
half |
float |
|
|
C1 -> C2 (L1 Buffer -> BiasTable Buffer) |
half |
float |
Examples
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
AscendC::TPipe pipe; AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueSrc; AscendC::TQue<AscendC::TPosition::VECOUT, 1> outQueueDst; AscendC::GlobalTensor<half> srcGlobal, dstGlobal; pipe.InitBuffer(inQueueSrc, 1, 512 * sizeof(half)); pipe.InitBuffer(outQueueDst, 1, 512 * sizeof(half)); AscendC::LocalTensor<half> srcLocal = inQueueSrc.AllocTensor<half>(); AscendC::LocalTensor<half> dstLocal = outQueueDst.AllocTensor<half>(); // Use the transfer API with the input count parameter to complete continuous transfer. AscendC::DataCopy(srcLocal, srcGlobal, 512); AscendC::DataCopy(dstLocal , srcLocal, 512); AscendC::DataCopy(dstGlobal, dstLocal, 512); // Use the transfer API with the input DataCopyParams parameter to support both continuous and discontinuous transfer. // DataCopyParams intriParams; // AscendC::DataCopy(srcLocal, srcGlobal, intriParams); // AscendC::DataCopy(dstLocal , srcLocal, intriParams); // AscendC::DataCopy(dstGlobal, dstLocal, intriParams); |
Input data srcGlobal: [1 2 3... 512] Output data dstGlobal: [1 2 3... 512]