Enhanced Data Transfer

Applicability

The enhanced data transfer function is supported only in the CO1 -> CO2 (L0C buffer -> UB) path of the Atlas inference product's AI Core. For other product models and paths, the API can be called normally, while the enhanced data transfer feature is disabled. It behaves the same as the basic data transfer.

Product	Supports Prototypes With Identical Data Types for Source and Destination Operands	Supports Prototypes With Different Data Types for Source and Destination Operands
Atlas A3 training products/Atlas A3 inference products	√	x
Atlas A2 training products/Atlas A2 inference products	√	x
Atlas 200I/500 A2 inference products	√	x
Atlas inference product's AI Core	√	√
Atlas inference product's Vector Core	√	x
Atlas training products	√	x

Function

Enhances the data transfer capability. Compared with basic data transfer APIs, the enhanced data transfer API adds on-the-fly computation over the CO1->CO2 path.

Prototype

Global Memory -> Local Memory

template <typename T>
__aicore__ inline void DataCopy(const LocalTensor<T>& dst, const GlobalTensor<T>& src, const DataCopyParams& intriParams, const DataCopyEnhancedParams& enhancedParams)

Local Memory -> Local Memory

template <typename T>
__aicore__ inline void DataCopy(const LocalTensor<T>& dst, const LocalTensor<T>& src, const DataCopyParams& intriParams, const DataCopyEnhancedParams& enhancedParams)

Local Memory -> Global Memory

template <typename T>
__aicore__ inline void DataCopy(const GlobalTensor<T>& dst, const LocalTensor<T>& src, const DataCopyParams& intriParams, const DataCopyEnhancedParams& enhancedParams)

Local Memory -> Local Memory: supporting different data types for source and destination operands

template <typename T, typename U>
__aicore__ inline void DataCopy(const LocalTensor<T>& dst, const LocalTensor<U>& src, const DataCopyParams& intriParams, const DataCopyEnhancedParams& enhancedParams)

For details about supported transfer paths and data types of each prototype, see Supported Paths and Data Types.

Parameters

**Table 1** Template parameters
Parameter	Description
T, U	Data type of the operand. For details about supported data types, see Supported Paths and Data Types.

**Table 2** Parameters
Parameter	Input/Output	Description
dst	Output	Destination operand, which is of the LocalTensor or GlobalTensor type.
src	Input	Source operand, which is of the LocalTensor or GlobalTensor type.
intriParams	Input	Transfer parameters, which is of the DataCopyParams type.
enhancedParams	Input	Enhanced information parameters, which is of the DataCopyEnhancedParams type. For details, see ${INSTALL_DIR}/include/ascendc/basic_api/interface/kernel_struct_data_copy.h. Replace ${INSTALL_DIR} with the CANN installation path.

**Table 3** Parameters in the DataCopyEnhancedParams structure
Parameter	Description
blockMode	Basic block shape for data transfer. This is an enumeration of type BlockMode, with the following options: BLOCK_MODE_NORMAL: The transfer unit is 32 bytes. Currently, this option is not supported. BLOCK_MODE_MATRIX: The transfer unit is a 16 × 16 cube block shape. BLOCK_MODE_VECTOR: The transfer unit is a 1 × 16 cube block shape. BLOCK_MODE_SMALL_CHANNEL: The transfer unit is a 16 × 4 cube block shape. Currently, this option is not supported. BLOCK_MODE_DEPTHWISE: The transfer unit is a 16 × 16 cube block shape, with the on-the-fly channel-split capability. Currently, this option is not supported. For details about the unit of parameters such as blockLen in each mode, see Table 4.
deqScale	Auxiliary parameter for on-the-fly precision conversion, namely the quantization mode. For available quantization modes and corresponding data types, see Table 5. For DEQ, DEQ8, and DEQ16 modes, you need to pass the quantization coefficient deqValue and configure the corresponding bits of deqValue. For VDEQ, VDEQ8, and VDEQ16 modes, you need to pass a quantization parameter vector consisting of 16 deqValue elements and configure the corresponding bits of deqTensorAddr. Meanwhile, ensure that each deqValue element of the dequantization parameter vector stored in DEQADDR conforms to expectations and usage restrictions. The length of the dequantization parameter vector is 32 bytes with 16 half elements in VDEQ mode, and 128 bytes with 16 64-bit dequantization elements in other modes.
deqValue	Quantization coefficient. For details about how to configure deqValue, see Table 6.
deqTensorAddr	Start address for storing the dequantization parameter vector in the UB. When deqScale is set to VDEQ, VDEQ8, or VDEQ16, the address of the parameter vector for dequantization computation must be passed in. The address must be 32-byte aligned. In VDEQ mode, this address points to a 32-byte dequantization parameter vector, where each element is 16 bits (half). In VDEQ8 and VDEQ16 modes, each element in the dequantization parameter vector is 64 bits. During the transfer, blockCount consecutive data blocks are transferred, and the length of each data block is blockLen. Each data block corresponds to a 128-byte dequantization vector. For the same data block, 16 elements in the dequantization parameter vector are continuously reused. Different data blocks correspond to different dequantization parameter vectors, and the address is offset by 128 bytes accordingly. For example, suppose the base address is A, the start address of the 128-byte dequantization parameter vector for the first data block is A, and the start address of the 128-byte dequantization parameter vector for the second data block is A + 128 bytes. The MCB flag bits of each element in the same dequantization parameter vector must be identical.
sidStoreMode	Storage mode when deqScale is DEQ8 or VDEQ8. It controls how the dequantization result is stored in the dst address. For details about the configuration, see Figure 1. 0: Data in dst is stored in the first half of each DataBlock, that is, the upper 16 bytes of every 32 bytes. 1: Data in dst is stored in the second half of each DataBlock, that is, the lower 16 bytes of every 32 bytes. 2: Data in dst is stored in a complete DataBlock, that is, the entire 32 bytes.
isRelu	Whether to perform on-the-fly ReLU operation. When deqValue is configured, if this parameter is set to true, the ReLU flag of deqValue is updated to 1. If this parameter is set to false, the ReLU flag is not modified. When deqTensorAddr is configured, the ReLU flag in the dequantization parameter vector element does not take effect. The value of isRelu is used. If only isRelu is configured and the quantization parameter is not configured (that is, deqValue is set to DEQ_NONE), the following combinations of data types of src and dst are supported: {half, half}, {float, float}, {int32_t, int32_t}, and {float, half}. If both isRelu and quantization parameters are configured, see Table 5 for the supported data type combinations.
padMode	Reserved parameter, which is not supported currently.

**Table 4** Parameter units corresponding to different blockMode values
blockMode	src	dst	Data Type	blockLen Unit	srcStride Unit	dstStride Unit
BLOCK_MODE_NORMAL	GM	A1	int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, uint64_t, half, bfloat16_t, float, double	32B	32B	32B
	GM	B1	int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, uint64_t, half, bfloat16_t, float, double	32B	32B	32B
	GM	VECIN	int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, uint64_t, half, bfloat16_t, float, double	32B	32B	32B
	VECOUT	GM	int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, uint64_t, half, bfloat16_t, float, double	32B	32B	32B
	VECIN	VECOUT	int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, uint64_t, half, bfloat16_t, float, double	32B	32B	32B
BLOCK_MODE_MATRIX	CO1	CO2	half, int16_t, uint16_t	512B	512B	32B
BLOCK_MODE_MATRIX	CO1	CO2	float, int32_t, uint32_t	1024B	1024B	32B
BLOCK_MODE_VECTOR	CO1	CO2	half, int16_t, uint16_t	32B	512B	32B
BLOCK_MODE_VECTOR	CO1	CO2	float, int32_t, uint32_t	64B	1024B	32B

**Table 5** deqScale parameters
Quantization Mode	src.dtype	dst.dtype	Parameters That Are Used Together
DEQ	int32_t	half	Variable M in deqValue
DEQ	half	half	Variable M in deqValue
DEQ8	int32_t	int8_t	deqValue Variable M Variable N MCB flag Offset Sign flag ReLU flag isRelu
DEQ8	int32_t	uint8_t
DEQ16	int32_t	half	deqValue Variable M Variable N MCB flag ReLU flag isRelu
DEQ16	int32_t	int16_t	deqValue Variable N ReLU flag isRelu
VDEQ	int32_t	half	For details about the parameters that can be configured for the deqValue element in the dequantization parameter vector stored in deqTensorAddr, see the descriptions of DEQ, DEQ8, and DEQ16, respectively. deqTensorAddr DEQADDR ReLU flag isRelu
VDEQ8	int32_t	int8_t
VDEQ8	int32_t	uint8_t
VDEQ16	int32_t	half
VDEQ16	int32_t	int16_t

**Table 6** deqValue configuration modes
Mode	Number of Bits	Variable	Description
DEQ8, VDEQ8, DEQ16, and VDEQ16	0–31	M	A 32-bit value, which is treated as a float and used as the multiplier for dequantization computation. The variable M does not take effect when src is int32_t and dst is int16_t.
	32–35	N	A 4-bit field with a value range of [1, 16] (binary 0000 corresponds to 1, binary 1111 corresponds to 16). For DEQ8 and VDEQ8 modes, the input value is right-shifted by N bits when the MCB flag is set to 1. For DEQ16 and VDEQ16 modes where the data type of dst is int16_t, the data is directly right-shifted by N bits regardless of the MCB flag.
	36	MCB flag	Mode control bit. If it is set to 0, the input int32_t data is directly converted to float. If it is set to 1, the input int32_t data is right-shifted by N bits, converted to int16_t, and then converted to float.
	37–45	Offset	A 9-bit integer value, which is added to the result of the dequantization computation src × M. This field is only used in DEQ8 and VDEQ8 modes. Set this field to 0 if the offset is not required.
	46	Sign flag	If it is set to 1, the dequantization result is signed(int8). If it is set to 0, the dequantization result is unsigned(uint8). This flag is only used in DEQ8 and VDEQ8 modes.
	47	ReLU flag	If it is set to 1, the RELU operation is performed on the final result. If it is set to 0, no additional operation is performed. For the conversion from int32_t to int8_t with ReLU enabled, offset must be set to –128. For the conversion from int32_t to uint8_t with ReLU enabled, offset must be set to 0.
	48–63	-	Reserved.
DEQ and VDEQ	Bits 0 to 15 correspond to variable M, this 16-bit field is interpreted as a half value and serve as the multiplier for dequantization computation.

Figure 1 sidStoreMode configuration

Returns

None

Restrictions

You must ensure that the configuration of the isRelu parameter in DataCopyEnhancedParams match that of the ReLU flag of the quantization coefficient deqValue or quantization parameter vector deqTensorAddr.
If on-the-fly precision conversion is enabled for the CO1->CO2 path, the blockLen unit of operands over the UB path must be halved.

Supported Paths and Data Types

The following data paths are expressed using logical positions TPosition, with the corresponding physical paths noted. For details about the mapping between TPosition and the physical memory, see Table 1.

**Table 7** Specific paths and supported data types of Local Memory -> Local Memory
Supported Model	Data Path	Data Types of the Source and Destination Operands (Same)
Atlas inference product's AI Core	CO1 -> CO2 (L0C Buffer -> UB)	half, float, int32_t, uint32_t

**Table 8** Specific paths and supported data types of Local Memory -> Local Memory (supporting different data types for source and destination operands)
Product Model	Data Path	Data Type of the Source Operand	Data Type of the Destination Operand
Atlas inference product's AI Core	CO1 -> CO2 (L0C Buffer -> UB)	float	half
Atlas inference product's AI Core	CO1 -> CO2 (L0C Buffer -> UB)	int32_t	int8_t, uint8_t, int16_t, half

**Table 9** Paths supported when enhancedParams does not take effect (In this case, the API can be called normally, while the enhanced data transfer feature is disabled. It behaves the same as the basic data transfer.)
Supported Model	Data Path
Atlas training products	GM -> VECIN GM -> A1, B1 VECIN -> VECCALC or VECCALC -> VECOUT VECOUT -> GM
Atlas inference product's AI Core	GM -> VECIN GM -> A1, B1 VECIN -> VECCALC or VECCALC -> VECOUT VECIN, VECCALC, VECOUT -> A1, B1 VECOUT, CO2 -> GM
Atlas inference product's Vector Core	GM -> VECIN VECOUT -> GM
Atlas A2 training products/Atlas A2 inference products	GM -> VECIN GM -> A1, B1 VECIN -> VECCALC or VECCALC -> VECOUT VECIN, VECCALC, VECOUT -> TSCM VECOUT -> GM A1, B1 -> GM
Atlas A3 training products/Atlas A3 inference products	GM -> VECIN GM -> A1, B1 VECIN -> VECCALC or VECCALC -> VECOUT VECIN, VECCALC, VECOUT -> TSCM VECOUT -> GM A1, B1 -> GM
Atlas 200I/500 A2 inference products	GM -> VECIN VECOUT -> GM

Example

AscendC::TPipe pipe;
AscendC::TQue<AscendC::TPosition::CO1, 1> inQueueSrc;
AscendC::TQue<AscendC::TPosition::CO2, 1> outQueueDst;
...
AscendC::LocalTensor<half> srcLocal = inQueueSrc.AllocTensor<half>();
AscendC::LocalTensor<half> dstLocal = outQueueDst.AllocTensor<half>();
DataCopyParams intriParams;
DataCopyEnhancedParams enhancedParams;
enhancedParams.blockMode = BlockMode::BLOCK_MODE_MATRIX;
AscendC::DataCopy(dstLocal , srcLocal , intriParams, enhancedParams);
...

Result example:

Input (srcLocal): [1 2 3 ... 512]
Output (dstLocal): [1 2 3 ... 512]

Parent topic: DataCopy