Enhanced Data Movement

Applicability

The enhanced data movement function is supported only by the CO1 -> CO2 (L0C Buffer -> UB) channel of the Atlas inference product's AI Core product. For other models and channels, if the support is specified, it means that the API can be called but the enhanced data movement function does not take effect. The function is equivalent to the basic data movement function.

Product

Supported/Unsupported

Source and Destination Operands

Prototype with consistent types

Supported/Unsupported

Source and Destination Operands

Prototype with inconsistent types

Atlas A3 training products/Atlas A3 inference products

x

Atlas A2 training products/Atlas A2 inference products

x

Atlas 200I/500 A2 inference products

x

Atlas inference product's AI Core

Atlas inference product's Vector Core

x

Atlas training products

x

Functions

Enhances the data movement capability. Compared with the basic data movement APIs, the enhanced data movement APIs support the in-line computation of the CO1->CO2 channel.

Prototype

  • Global Memory -> Local Memory
    1
    2
    template <typename T>
    __aicore__ inline void DataCopy(const LocalTensor<T>& dst, const GlobalTensor<T>& src, const DataCopyParams& intriParams, const DataCopyEnhancedParams& enhancedParams)
    
  • Local Memory -> Local Memory
    1
    2
    template <typename T>
    __aicore__ inline void DataCopy(const LocalTensor<T>& dst, const LocalTensor<T>& src, const DataCopyParams& intriParams, const DataCopyEnhancedParams& enhancedParams)
    
  • Local Memory -> Global Memory
    1
    2
    template <typename T>
    __aicore__ inline void DataCopy(const GlobalTensor<T>& dst, const LocalTensor<T>& src, const DataCopyParams& intriParams, const DataCopyEnhancedParams& enhancedParams)
    
  • Local Memory -> Local Memory, supporting inconsistent types of source and destination operands
    1
    2
    template <typename T, typename U>
    __aicore__ inline void DataCopy(const LocalTensor<T>& dst, const LocalTensor<U>& src, const DataCopyParams& intriParams, const DataCopyEnhancedParams& enhancedParams)
    

For details about the supported data paths and data types of each prototype, see Supported Channels and Data Types.

Parameters

Table 1 Parameters in the template

Parameter

Description

T, U

Operand data type. For details about the supported data types, see Supported Channels and Data Types.

Table 2 Parameters

Parameter

Input/Output

Meaning

dst

Output

Destination operand of the LocalTensor or GlobalTensor type.

src

Input

Source operand of the LocalTensor or GlobalTensor type.

intriParams

Input

Movement parameter. DataCopyParams type.

enhancedParams

Input

Enhanced information parameter. It is of the DataCopyEnhancedParams type.

For details, see ${INSTALL_DIR}/include/ascendc/basic_api/interface/kernel_struct_data_copy.h. Replace ${INSTALL_DIR} with the actual CANN component directory.

Table 3 Parameters in the DataCopyEnhancedParams structure

Field

Meaning

blockMode

Basic fractal for moving data. BlockMode enumeration type. The following configurations are supported:

  • BLOCK_MODE_NORMAL: The movement unit is 32 bytes. Currently, this parameter is not supported.
  • BLOCK_MODE_MATRIX: a Cube fractal whose movement unit is 16 x 16.
  • BLOCK_MODE_VECTOR: a Cube fractal whose movement unit is 1 x 16.
  • BLOCK_MODE_SMALL_CHANNEL: a Cube fractal whose movement unit is 16 x 4. Currently, this parameter is not supported.
  • BLOCK_MODE_DEPTHWISE: a Cube fractal whose movement unit is 16 x 16. This configuration provides the channel-split function. Currently, this option is not supported.

For details about the unit of parameters such as blockLen in each mode, see Table 4.

deqScale

Auxiliary parameter for real-time precision conversion, that is, quantization mode. For details about the supported quantization modes and corresponding data types, see Table 5. In DEQ, DEQ8, and DEQ16 modes, the deqValue quantization coefficient needs to be passed and the bit corresponding to deqValue needs to be set. In VDEQ, VDEQ8, and VDEQ16 modes, the quantized parameter vector containing 16 elements (deqValue) needs to be transferred and the bit corresponding to deqTensorAddr needs to be set. In addition, each element (deqValue) of the dequantized parameter vector stored in DEQADDR must meet the expectation and usage restrictions.

In VDEQ mode, the length of the dequantized parameter vector is 32 bytes (16 half elements). In other modes, the length of the dequantized parameter vector is 128 bytes (16 64-bit dequantized elements).

deqValue

Quantization coefficient. For details about how to configure deqValue, see deqValue configuration mode.

deqTensorAddr

Start address for storing the dequantized parameter vector in the UB. When deqScale is set to VDEQ, VDEQ8, or VDEQ16, the vector address of the dequantization parameter needs to be passed. The address must be 32-byte aligned.

In VDEQ mode, this address points to a 32-byte dequantized parameter vector. The size of each element is 16 bits (half).

In VDEQ8 and VDEQ16 modes, the size of each element in the dequantized parameter vector is 64 bits. During movement, blockCount data chunks are moved. The length of each data chunk is blockLen. Each data chunk corresponds to a 128-byte dequantized vector. For a same data chunk, 16 elements in the dequantized parameter vector are continuously reused. Different data chunks correspond to different dequantized parameter vectors, and the address offset is 128 bytes. For example, if the start address is A, the start address of the 128B dequantized parameter vector of the first data chunk is A, and the start address of the 128B dequantized parameter vector of the second data chunk is A + 128B.

The MCB flag of each element in the same dequantized parameter vector must be the same.

sidStoreMode

Storage mode when deqScale is DEQ8 or VDEQ8. It controls how the dequantization result is stored in the dst address. For details about the configuration, see sidStoreMode configuration.

  • 0: The dst data is stored in the first half of each data block, that is, the upper 16 bytes of each 32-byte data block.
  • 1: The dst data is stored in the second half of each data block, that is, the lower 16 bytes of each 32-byte data block.
  • 2: The dst data is stored in a complete DataBlock, that is, the entire 32 bytes.

isRelu

Whether linear rectification can be performed during data movement. When deqValue is configured, if this parameter is set to true, the ReLU flag of deqValue is updated to 1. If this parameter is set to false, the ReLU flag is not modified. When deqTensorAddr is configured, the ReLU flag in the element of the dequantized parameter vector does not take effect, and the value of isRelu takes effect.

If only isRelu is configured and no quantization parameters are configured (that is, deqValue is set to DEQ_NONE), the supported data type combinations of src and dst are {half, half}, {float, float}, {int32_t, int32_t}, and {float, half}. If both isRelu and quantization parameters are configured, refer to Table 5 to obtain the supported data type combinations.

padMode

Reserved.

Table 4 Parameter units corresponding to different blockMode values

blockMode

src

dst

Data type.

blockLen Unit

srcStride Unit

dstStride Unit

BLOCK_MODE_NORMAL

GM

A1

int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, uint64_t, half, bfloat16_t, float, double

32B

32B

32B

GM

B1

int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, uint64_t, half, bfloat16_t, float, double

32B

32B

32B

GM

VECIN

int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, uint64_t, half, bfloat16_t, float, double

32B

32B

32B

VECOUT

GM

int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, uint64_t, half, bfloat16_t, float, double

32B

32B

32B

VECIN

VECOUT

int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, uint64_t, half, bfloat16_t, float, double

32B

32B

32B

BLOCK_MODE_MATRIX

CO1

CO2

half, int16_t, uint16_t

512B

512B

32B

CO1

CO2

float, int32_t, uint32_t

1024B

1024B

32B

BLOCK_MODE_VECTOR

CO1

CO2

half, int16_t, uint16_t

32B

512B

32B

CO1

CO2

float, int32_t, uint32_t

64B

1024B

32B

Table 5 deqScale parameters

Quantization mode

src.dtype

dst.dtype

Parameters That Are Used Together

DEQ

int32_t

half

Variable M in deqValue

DEQ

half

half

DEQ8

int32_t

int8_t

  • deqValue
    • Variable M
    • Variable N
    • MCB flag
    • Offset
    • Sign flag
    • ReLU flag
  • isRelu

DEQ8

int32_t

uint8_t

DEQ16

int32_t

half

  • deqValue
    • Variable M
    • Variable N
    • MCB flag
    • ReLU flag
  • isRelu

DEQ16

int32_t

int16_t

  • deqValue
    • Variable N
    • ReLU flag
  • isRelu

VDEQ

int32_t

half

For details about the parameters that can be configured for the deqValue element in the dequantized parameter vector stored in the deqTensorAddr address, see the descriptions of DEQ, DEQ8, and DEQ16, respectively.

  • deqTensorAddr
    • DEQADDR
    • ReLU flag
  • isRelu

VDEQ8

int32_t

int8_t

VDEQ8

int32_t

uint8_t

VDEQ16

int32_t

half

VDEQ16

int32_t

int16_t

Table 6 deqValue configuration mode

Mode

Number of Bits

Variable

Function

DEQ8, VDEQ8, DEQ16, and VDEQ16

0~31

M

The 32-bit value is considered as the float type and is used as the value to be multiplied for dequantization. Variable M does not take effect when the data types of src and dst are int32_t and int16_t, respectively.

32~35

N

4 bits. The value range is [1, 16] (b'0000 indicates 1, and b'1111 indicates 16).

When the mode is DEQ8 or VDEQ8 and the MCB flag is set to 1, the input value is shifted rightwards by N bits. When the mode is DEQ16 or VDEQ16 and the dst data type is int16_t, the input value is shifted rightwards by N bits without the settings of the MCB flag.

36

MCB flag

Mode control bit. If it is set to 0, the input int32_t is directly converted to float. If it is set to 1, the input int32_t is shifted rightwards by N bits to convert into int16_t, and then convert into float.

37~45

Offset

9-bit integer data. The value is added to the Offset after the computation result of src x M is dequantized. It is used only in DEQ8 and VDEQ8 modes. If offset is not used, set it to 0.

46

Sign flag

If it is set to 1, the dequantization result is signed(int8). If it is set to 0, the dequantization result is unsigned(uint8). It is used only in DEQ8 and VDEQ8 modes.

47

ReLU flag

If it is set to 1, ReLU computation is performed on the final result. If it is set to 0, no extra computation is performed.

  • int32_t->int8_t: offset must be set to –128 when ReLU is configured.
  • int32_t->uint8_t: offset must be set to 0 when ReLU is configured.

48~63

-

Reserved

DEQ and VDEQ

0 to 15 correspond to the variable M. The 16-bit value is considered as the half type and is used as the value to be multiplied for dequantization.

Figure 1 sidStoreMode configuration

Returns

None

Restrictions

  • You must ensure that the configuration of the isRelu parameter in DataCopyEnhancedParams is the same as that of the ReLU flag of the quantization coefficient deqValue or quantized parameter vector deqTensorAddr.
  • If precision conversion is performed along CO1->CO2 data movement, the unit of blockLen of the operand for UB data movements needs to be halved.

Supported Channels and Data Types

The following data channels are expressed by using the logical position TPosition and the corresponding physical channels are also specified. For details about the mapping between TPosition and physical memory, see Table 1.

Table 7 Specific channels and supported data types for Local Memory -> Local Memory

Model

Datapath

Data Types of the Source and Destination Operands (Same)

Atlas inference product's AI Core

CO1 -> CO2 (L0C Buffer -> UB)

half, float, int32_t, uint32_t

Table 8 Specific channels and supported data types for Local Memory -> Local Memory (The source and destination operands can have different data types.)

Internal Model

Datapath

Source Operand

Destination Operand

Atlas inference product's AI Core

CO1 -> CO2 (L0C Buffer -> UB)

float

half

int32_t

int8_t, uint8_t, int16_t, half

Table 9 Channels supported when enhancedParams does not take effect. In this case, the API can be called, but enhanced data movement does not take effect. The function is equivalent to that of basic data movement.

Model

Datapath

Atlas training products

GM -> VECIN

GM -> A1, B1

VECIN -> VECCALC or VECCALC -> VECOUT

VECOUT -> GM

Atlas inference product's AI Core

GM -> VECIN

GM -> A1, B1

VECIN -> VECCALC or VECCALC -> VECOUT

VECIN, VECCALC, VECOUT -> A1, B1

VECOUT, CO2 -> GM

Atlas inference product's Vector Core

GM -> VECIN

VECOUT -> GM

Atlas A2 training products/Atlas A2 inference products

GM -> VECIN

GM -> A1, B1

VECIN -> VECCALC or VECCALC -> VECOUT

VECIN, VECCALC, VECOUT -> TSCM

VECOUT -> GM

A1, B1 -> GM

Atlas A3 training products/Atlas A3 inference products

GM -> VECIN

GM -> A1, B1

VECIN -> VECCALC or VECCALC -> VECOUT

VECIN, VECCALC, VECOUT -> TSCM

VECOUT -> GM

A1, B1 -> GM

Atlas 200I/500 A2 inference products

GM -> VECIN

VECOUT -> GM

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
AscendC::TPipe pipe;
AscendC::TQue<AscendC::TPosition::CO1, 1> inQueueSrc;
AscendC::TQue<AscendC::TPosition::CO2, 1> outQueueDst;
...
AscendC::LocalTensor<half> srcLocal = inQueueSrc.AllocTensor<half>();
AscendC::LocalTensor<half> dstLocal = outQueueDst.AllocTensor<half>();
DataCopyParams intriParams;
DataCopyEnhancedParams enhancedParams;
enhancedParams.blockMode = BlockMode::BLOCK_MODE_MATRIX;
AscendC::DataCopy(dstLocal , srcLocal , intriParams, enhancedParams);
...
Result example:
Input (srcLocal): [1 2 3 ... 512]
Output (dstLocal): [1 2 3 ... 512]