Enhanced Data Movement

Function Usage

Compared with the common data movement APIs, the enhanced data movement APIs add path-associated computation through the DataCopyEnhancedParams parameter in the CO1->CO2 path. In other paths, the DataCopyEnhancedParams parameter does not take effect and enhanced data movement APIs are equivalent to the common data movement APIs.

Prototype

The source operand is GlobalTensor, and the destination operand is LocalTensor.

template <typename T>
__aicore__ inline void DataCopy(const LocalTensor<T>& dstLocal, const GlobalTensor<T>& srcGlobal, const DataCopyParams& intriParams, const DataCopyEnhancedParams& enhancedParams)

The prototype supports the following data paths and types:

**Table 1** Data paths and types (GlobalTensor as the source operand and LocalTensor as the destination operand)
Model	Data Path	Data Types of the Source and Destination Operands (Same)	Does DatacopyEnhanceParams Take Effect
Atlas Training Series Product	GM -> VECIN	int8_t / uint8_t / int16_t / uint16_t / int32_t / uint32_t / int64_t / uint64_t / half / float / double	No
Atlas Training Series Product	GM -> A1/B1	int8_t / uint8_t / int16_t / uint16_t / int32_t / uint32_t / int64_t / uint64_t / half / float / double	No

Both the source operand and destination operand are LocalTensor.

template <typename T>
__aicore__ inline void DataCopy(const LocalTensor<T>& dstLocal, const LocalTensor<T>& srcLocal, const DataCopyParams& intriParams, const DataCopyEnhancedParams& enhancedParams)

The prototype supports the following data paths and types:

**Table 2** Data paths and types (LocalTensor as the source operand and destination operand)
Model	Data Path	Data Types of the Source and Destination Operands (Same)	Does DatacopyEnhanceParams Take Effect
Atlas Training Series Product	VECIN -> VECCALC, VECCALC->VECOUT	int8_t / uint8_t / int16_t / uint16_t / int32_t / uint32_t / int64_t / uint64_t / half / float / double	No

The source operand is LocalTensor, and the destination operand is GlobalTensor.

template <typename T>
__aicore__ inline void DataCopy(const GlobalTensor<T>& dstGlobal, const LocalTensor<T>& srcLocal, const DataCopyParams& intriParams, const DataCopyEnhancedParams& enhancedParams)

The prototype supports the following data paths and types:

**Table 3** Data paths and types (LocalTensor as the source operand and GlobalTensor as the destination operand)
Model	Data Path	Data Types of the Source and Destination Operands (Same)	Does DatacopyEnhanceParams Take Effect
Atlas Training Series Product	VECOUT -> GM	int8_t / uint8_t / int16_t / uint16_t / int32_t / uint32_t / int64_t / uint64_t / half / float / double	No

Parameters

Table 4 Parameters of the enhanced data movement APIs

Parameter

Input/Output

Meaning

dstLocal, dstGlobal

Output

Destination operand of type LocalTensor or GlobalTensor. The supported data types are half, int16_t, uint16_t, float, int32_t, uint32_t, int8_t, and uint8_t.

srcLocal, srcGlobal

Input

Source operand of type LocalTensor or GlobalTensor. The supported data types are half, int16_t, uint16_t, float, int32_t, uint32_t, int8_t, and uint8_t.

intriParams

Input

Movement parameter. The type is DataCopyParams. For details about the structure definition of DataCopyParams, see Table 5.

enhancedParams

Input

Enhanced information parameter. DataCopyEnhancedParams type. The DataCopyEnhancedParams structure is defined as follows:

struct DataCopyEnhancedParams {
    BlockMode blockMode = BlockMode::BLOCK_MODE_NORMAL;
    DeqScale deqScale = DeqScale::DEQ_NONE;
    uint64_t deqValue = 0;
    uint8_t sidStoreMode = 0;
    bool isRelu = false;
    pad_t padMode = pad_t::PAD_NONE;
    uint64_t padValue = 0;
    uint64_t deqTensorAddr = 0;
};

For details about the parameters, see Table 5.

**Table 5** Parameters in the DataCopyEnhancedParams structure
Parameter	Meaning
blockMode	Unit fractal for moving data. BlockMode enumeration type. The following configurations are supported: BLOCK_MODE_NORMAL: The movement unit is 32 bytes. Currently, this option is not supported. BLOCK_MODE_MATRIX: a Cube fractal whose movement unit is 16 x 16. BLOCK_MODE_VECTOR: a Cube fractal whose movement unit is 1 x 16. BLOCK_MODE_SMALL_CHANNEL: a Cube fractal whose movement unit is 16 x 4. Currently, this option is not supported. BLOCK_MODE_DEPTHWISE: a Cube fractal whose movement unit is 16 x 16. This configuration provides the channel-split function. Currently, this option is not supported. For details about the unit of parameters such as blockLen in each mode, see Table 6.
deqScale	Auxiliary parameter for path-associated precision conversion, that is, quantization mode. For details about the supported quantization modes and corresponding data types, see Table 7. In DEQ, DEQ8, and DEQ16 modes, the deqValue quantization coefficient needs to be passed and the bit corresponding to deqValue needs to be set. In VDEQ, VDEQ8, and VDEQ16 modes, the quantized parameter vector containing 16 elements (deqValue) needs to be transferred and the bit corresponding to deqTensorAddr needs to be set. In addition, each element (deqValue) of the dequantized parameter vector stored in DEQADDR must meet the expectation and usage restrictions. In VDEQ mode, the length of the dequantized parameter vector is 32 bytes (16 half elements). In other modes, the length of the dequantized parameter vector is 128 bytes (16 64-bit dequantized elements).
deqValue	Quantization coefficient. For details about how to configure deqValue, see deqValue configuration mode.
deqTensorAddr	Start address for storing the dequantized parameter vector in the UB. When deqScale is set to VDEQ, VDEQ8, or VDEQ16, the vector address of the dequantization parameter needs to be passed. The address must be 32-byte aligned. In VDEQ mode, this address points to a 32-byte dequantized parameter vector. The size of each element is 16 bits (half). In VDEQ8 and VDEQ16 modes, the size of each element in the dequantized parameter vector is 64 bits. During movement, blockCount consecutive data chunks are moved. The length of each data chunk is blockLen. Each data chunk corresponds to a 128-byte dequantized vector. For a same data chunk, 16 elements in the dequantized parameter vector are continuously reused. Different data chunks correspond to different dequantized parameter vectors, and the address offset is 128 bytes. For example, if the start address is A, the start address of the 128B dequantized parameter vector of the first data chunk is A, and the start address of the 128B dequantized parameter vector of the second data chunk is A + 128B. The MCB flag of each element in the same dequantized parameter vector must be the same.
sidStoreMode	Storage mode when deqScale is DEQ8 or VDEQ8. It controls how the dequantization result is stored in the dst address. For details about the configuration, see sidStoreMode configuration. 0: The dst data is stored in the first half of each data block, that is, the upper 16 bytes of every 32 bytes. 1: The dst data is stored in the second half of each data block, that is, the lower 16 bytes of every 32 bytes. 2: The dst data is stored in a complete data block, that is, the entire 32 bytes.
isRelu	Whether linear rectification can be performed along channels. When deqValue is configured, if this parameter is set to true, the ReLU flag of deqValue is updated to 1. If this parameter is set to false, the ReLU flag is not modified. When deqTensorAddr is configured, the ReLU flag in the element of the dequantized parameter vector does not take effect, and the value of isRelu takes effect. If only isRelu is configured and no quantization parameters are configured (that is, deqValue is set to DEQ_NONE), the supported data type combinations of src and dst are {half, half}, {float, float}, {int32_t, int32_t}, and {float, half}. If both isRelu and quantization parameters are configured, refer to Table 7 to obtain the supported data type combinations.
padMode	Reserved.

**Table 6** Parameter units corresponding to different blockMode values
blockMode	src	dst	Data Type	blockLen unit	srcStride Unit	dstStride Unit
BLOCK_MODE_NORMAL	GM	A1	half/bfloat16_t/int16_t/uint16_t/float/int32_t/uint32_t/int8_t/uint8_t/int64_t/uint64_t/double	32B	32B	32B
	GM	B1	half/bfloat16_t/int16_t/uint16_t/float/int32_t/uint32_t/int8_t/uint8_t/int64_t/uint64_t/double	32B	32B	32B
	GM	VECIN	half/bfloat16_t/int16_t/uint16_t/float/int32_t/uint32_t/int8_t/uint8_t/int64_t/uint64_t/double	32B	32B	32B
	VECOUT	GM	half/bfloat16_t/int16_t/uint16_t/float/int32_t/uint32_t/int8_t/uint8_t/int64_t/uint64_t/double	32B	32B	32B
	VECIN	VECOUT	half/bfloat16_t/int16_t/uint16_t/float/int32_t/uint32_t/int8_t/uint8_t/int64_t/uint64_t/double	32B	32B	32B
BLOCK_MODE_MATRIX	CO1	CO2	half/int16_t/uint16_t	512B	512B	32B
BLOCK_MODE_MATRIX	CO1	CO2	float/int32_t/uint32_t	1024B	1024B	32B
BLOCK_MODE_VECTOR	CO1	CO2	half/int16_t/uint16_t	32B	512B	32B
BLOCK_MODE_VECTOR	CO1	CO2	float/int32_t/uint32_t	64B	1024B	32B

**Table 7** deqScale parameters
Quantization Mode	src.dtype	dst.dtype	Parameters That Are Used Together
DEQ	int32_t	half	Variable M in deqValue
DEQ	half	half	Variable M in deqValue
DEQ8	int32_t	int8_t	deqValue Variable M Variable N MCB flag Offset Sign flag ReLU flag isRelu
DEQ8	int32_t	uint8_t
DEQ16	int32_t	half	deqValue Variable M Variable N MCB flag ReLU flag isRelu
DEQ16	int32_t	int16_t	deqValue Variable N ReLU flag isRelu
VDEQ	int32_t	half	For details about the parameters that can be configured for the deqValue element in the dequantized parameter vector stored in the deqTensorAddr address, see the descriptions of DEQ, DEQ8, and DEQ16, respectively. deqTensorAddr DEQADDR ReLU flag isRelu
VDEQ8	int32_t	int8_t
VDEQ8	int32_t	uint8_t
VDEQ16	int32_t	half
VDEQ16	int32_t	int16_t

**Table 8** deqValue configuration mode
Mode	Number of Bits	Variable	Function
DEQ8, VDEQ8, DEQ16, and VDEQ16	0~31	M	The 32-bit value is considered as the float type and is used as the value to be multiplied for dequantization. Variable M does not take effect when the data types of src and dst are int32_t and int16_t, respectively.
	32~35	N	Four bits. The value range is [1, 16] (b'0000 indicates 1, and b'1111 indicates 16). When the mode is DEQ8 or VDEQ8 and the MCB flag is set to 1, the input value is shifted rightwards by N bits. When the mode is DEQ16 or VDEQ16 and the dst data type is int16_t, the input value is shifted rightwards by N bits without the settings of the MCB flag.
	36	MCB flag	Mode Control Bit. If it is set to 0, the input int32_t is directly converted to float. If it is set to 1, the input int32_t is shifted rightwards by N bits to convert into int16_t, and then convert into float.
	37~45	Offset	9-bit integer data. The value is added to the Offset after the computation result of src x M is dequantized. It is used only in DEQ8 and VDEQ8 modes. If offset is not used, set it to 0.
	46	Sign flag	If it is set to 1, the dequantization result is signed(int8). If it is set to 0, the dequantization result is unsigned(uint8). It is used only in DEQ8 and VDEQ8 modes.
	47	ReLU flag	If it is set to 1, ReLU computation is performed on the final result. If it is set to 0, no extra computation is performed. int32_t->int8_t: offset must be set to –128 when ReLU is configured. int32_t->uint8_t: offset must be set to 0 when ReLU is configured.
	48~63	-	Reserved
DEQ and VDEQ	0 to 15 correspond to the variable M. The 16-bit value is considered as the half type and is used as the value to be multiplied for dequantization.

Figure 1 sidStoreMode configuration

Precautions

Developers must ensure that the configuration of the isRelu parameter in DataCopyEnhancedParams is the same as that of the ReLU flag of the quantization coefficient deqValue/quantized parameter vector deqTensorAddr.
If precision conversion is performed along CO1->CO2 data movement, the unit of blockLen of the operand for UB data movements needs to be halved.

Availability

Atlas Training Series Product

Parent topic: DataCopy