DataCopyPad(ISASI)
Function Usage
Enables data non-aligned movement. The supported channels are as follows:
GM->VECIN/VECOUT
VECIN/VECOUT->GM
VECIN/VECOUT->TSCM
During data movement from GM to VECIN/VECOUT, you can pad the data as required.
Prototype
- dataCopyParams is of the DataCopyExtParams type. It supports a larger value range for parameters such as the operand stride compared to the DataCopyParams type.
- Path: GM -> VECIN/VECOUT
1 2
template <typename T> __aicore__ inline void DataCopyPad(const LocalTensor<T> &dstLocal, const GlobalTensor<T> &srcGlobal, const DataCopyExtParams &dataCopyParams, const DataCopyPadExtParams<T> &padParams)
- Path: VECIN/VECOUT -> GM
1 2
template <typename T> __aicore__ inline void DataCopyPad(const GlobalTensor<T> &dstGlobal, const LocalTensor<T> &srcLocal, const DataCopyExtParams &dataCopyParams)
- Channel: VECIN/VECOUT-> TSCM. The actual channel is VECIN/VECOUT->GM->TSCM.
The
Atlas 200I/500 A2 inference products does not support this API prototype.1 2
template <typename T> __aicore__ inline void DataCopyPad(const LocalTensor<T> &dstLocal, const LocalTensor<T> &srcLocal, const DataCopyExtParams &dataCopyParams, const Nd2NzParams &nd2nzParams)
- Path: GM -> VECIN/VECOUT
- dataCopyParams is of the DataCopyParams type.
- Path: GM -> VECIN/VECOUT
1 2
template<typename T> __aicore__ inline void DataCopyPad(const LocalTensor<T>& dstLocal, const GlobalTensor<T>& srcGlobal, const DataCopyParams& dataCopyParams, const DataCopyPadParams& padParams)
- Path: VECIN/VECOUT -> GM
1 2
template<typename T> __aicore__ inline void DataCopyPad(const GlobalTensor<T>& dstGlobal, const LocalTensor<T>& srcLocal,const DataCopyParams& dataCopyParams)
- Channel: VECIN/VECOUT-> TSCM. The actual channel is VECIN/VECOUT->GM->TSCM.
The
Atlas 200I/500 A2 inference products does not support this API prototype.1 2
template<typename T> __aicore__ inline void DataCopyPad(const LocalTensor<T>& dstLocal, const LocalTensor<T>& srcLocal, const DataCopyParams& dataCopyParams, const Nd2NzParams& nd2nzParams)
- Path: GM -> VECIN/VECOUT
Parameters
|
Parameter |
Description |
|---|---|
|
T |
Data type of the operand and paddingValue (data value to be padded). For the For the For the |
|
Parameter |
Input/Output |
Description |
||||
|---|---|---|---|---|---|---|
|
dstLocal/dstGlobal |
Output |
Destination operand of type LocalTensor or GlobalTensor. The start address of LocalTensor must be 32-byte aligned. The start address of GlobalTensor does not require alignment. |
||||
|
srcLocal/srcGlobal |
Input |
Source operand of type LocalTensor or GlobalTensor. The start address of LocalTensor must be 32-byte aligned. The start address of GlobalTensor does not require alignment. |
||||
|
dataCopyParams |
Input |
Movement parameter.
|
||||
|
padParams |
Input |
When moving data from GM to VECIN/VECOUT, pad data on the left or right of the data to be moved as required. padParams is used to control data padding. It is of the DataCopyPadExtParams type and is defined as follows. For details about the parameters, see Table 5.
|
||||
|
nd2nzParams |
Input |
When moving data from VECIN/VECOUT to TSCM, you can convert the data format from ND to NZ. nd2nzParams is used to control data format conversion. It is of the Nd2NzParams type and is defined as follows. For details about the parameters, see Table 7.
Note: ndNum of Nd2NzParams can only be set to 1. |
|
Parameter |
Meaning |
|---|---|
|
blockCount |
Number of data chunks. The data type is uint16_t, and blockCount ∈ [1, 4095]. |
|
blockLen |
Length of each data chunk, in byte, which supports non-aligned movement. The length unit of each data chunk is byte. The data type is uint32_t, and blockLen ∈ [1, 2097151]. |
|
srcStride |
Source operand, that is, the interval between adjacent data chunks (the interval between the tail of the previous data chunk and the head of the next data chunk). If the logical position of the source operand is VECIN/VECOUT, the unit is data block (32 bytes). If the logical position of the source operand is GM, the unit is byte. The data type is uint32_t. The value of srcStride cannot exceed the value range of this data type. |
|
dstStride |
Destination operand, that is, the interval between adjacent data chunks (the interval between the tail of the previous data chunk and the head of the next data chunk). If the logical position of the destination operand is VECIN/VECOUT, the unit is data block (32 bytes). If the logical position of the destination operand is GM, the unit is byte. The data type is uint32_t. The value of dstStride cannot exceed the value range of this data type. |
|
rsv |
Reserved |
|
Parameter |
Meaning |
|---|---|
|
blockCount |
Number of data chunks. The data type is uint16_t, and blockCount ∈ [1, 4095]. |
|
blockLen |
Length of each data chunk, in byte, which supports non-aligned movement. The length unit of each data chunk is byte. The data type is uint16_t. The value of blockLen cannot exceed the value range of this data type. |
|
srcStride |
Source operand, that is, the interval between adjacent data chunks (the interval between the tail of the previous data chunk and the head of the next data chunk). If the logical position of the source operand is VECIN/VECOUT, the unit is data block (32 bytes). If the logical position of the source operand is GM, the unit is byte. The data type is uint16_t. The value of srcStride cannot exceed the value range of this data type. |
|
dstStride |
Destination operand, that is, the interval between adjacent data chunks (the interval between the tail of the previous data chunk and the head of the next data chunk). If the logical position of the destination operand is VECIN/VECOUT, the unit is data block (32 bytes). If the logical position of the destination operand is GM, the unit is byte. The data type is uint16_t. The value of dstStride cannot exceed the value range of this data type. |
|
Parameter |
Meaning |
|---|---|
|
isPad |
Whether to pad custom data. The options are as follows: true: Set a padding value. false: No padding value is specified, and a random value is used. |
|
leftPadding |
Range of the data to be padded on the left of the data chunk. The unit is the number of elements. The values of leftPadding and rightPadding cannot exceed 32 bytes. |
|
rightPadding |
Range of the data to be padded on the right of the data chunk. The unit is the number of elements. The values of leftPadding and rightPadding cannot exceed 32 bytes. |
|
paddingValue |
Values of the data to be padded on the left and right. The values must be within the byte range occupied by the data. The data type must be the same as that of the source operand. The data type is T. When the data type length is 64 bits, this parameter can only be set to 0. |
- GM -> VECIN/VECOUT
- Description
- When blockLen + leftPadding + rightPadding meets the 32-byte alignment requirement, if isPad is false, the values padded on the left and right sides are random numbers by default; if not, the values padded on the left and right sides are configured padding values.
- When blockLen + leftPadding + rightPadding does not meet the 32-byte alignment requirement, the framework pads some dummy data to ensure the 32-byte alignment. When both leftPadding and rightPadding are 0, the first element value of the data chunk to be transferred is padded by default. When leftPadding or rightPadding is not 0, if isPad is set to false, the data values padded on the left and right sides and dummy values are random numbers; if not, the data values padded on the left and right sides and dummy values are configured padding values.
- Configuration example 1
- The value of blockLen is 64, meaning each data chunk contains 64 bytes. The value of srcStride is 1. Because the logical position of the source operand is GM and the unit of srcStride is byte, there is 1-byte interval between adjacent data chunks of the source operand. The value of dstStride is 1. Because the logical position of the destination operand is VECIN/VECOUT and the unit of dstStride is a 32-byte data block, there is an interval of one data block between adjacent data chunks of the destination operand.
- When blockLen + leftPadding + rightPadding meets the 32-byte alignment requirement, if isPad is set to false, the data values padded on the left and right sides are random numbers by default; if not, the data values padded on the left and right sides are configured padding values. In this example, both leftPadding and rightPadding are 0, indicating that padding is not performed.
- When blockLen + leftPadding + rightPadding does not meet the 32-byte alignment requirement, the framework pads some dummy data to ensure the 32-byte alignment. When leftPadding or rightPadding is not 0, if isPad is set to false, the data values padded on the left and right sides and dummy values are random numbers; if not, the data values padded on the left and right sides are configured padding values.

- Configuration example 2
- The value of blockLen is 47, meaning each data chunk contains 47 bytes. The value of srcStride is 1, indicating that there is 1-byte interval between adjacent data chunks of the source operand. The value of dstStride is 1, indicating that there is an interval of one data block between adjacent data chunks of the destination operand.
- When blockLen + leftPadding + rightPadding does not meet the 32-byte alignment requirement, and both leftPadding and rightPadding are 0, the first element value of the data chunk to be transferred is padded with dummy values by default.
- When blockLen + leftPadding + rightPadding does not meet the 32-byte alignment requirement, and leftPadding or rightPadding is not 0, if isPad is false, the data values padded on the left and right sides and dummy values are random numbers; if not, the data values padded on the left and right sides and dummy values are configured padding values.

- Description
- VECIN/VECOUT -> GM
- When blockLen of each data chunk is 32-byte aligned, an example of DataCopyParams to be input is shown in the following figure. The value of blockLen is 64, meaning each data chunk contains 64 bytes. The value of srcStride is 1. Because the logical position of the source operand is VECIN/VECOUT and the unit of srcStride is a 32-byte data block, there is one data block between adjacent data blocks of the source operand. The value of dstStride is 1. Because the logical position of the destination operand is GM and the unit of dstStride is byte, there is 1-byte interval between adjacent data blocks of the destination operand.

- requires 32-byte alignment, but blockLen of each data chunk does not meet the requirement. In this case, when the data chunk is moved out, the framework automatically pads some dummy data to ensure alignment. When the data chunk is moved to GM, the dummy data is automatically deprecated. The following figure shows an example of DataCopyParams to be input in this scenario and the principle of padding dummy data. The value of blockLen is 47, meaning each data chunk contains 47 bytes, which is not 32-byte aligned. The value of srcStride is 1, indicating that there is an interval of one data block between adjacent data blocks of the source operand. The value of dstStride is 1, indicating that there is a 1-byte interval between adjacent data blocks of the destination operand. When data is moved out, the framework automatically pads 17-byte dummy data to ensure alignment. When data is moved to GM, the padded dummy data is automatically deprecated.

- When blockLen of each data chunk is 32-byte aligned, an example of DataCopyParams to be input is shown in the following figure. The value of blockLen is 64, meaning each data chunk contains 64 bytes. The value of srcStride is 1. Because the logical position of the source operand is VECIN/VECOUT and the unit of srcStride is a 32-byte data block, there is one data block between adjacent data blocks of the source operand. The value of dstStride is 1. Because the logical position of the destination operand is GM and the unit of dstStride is byte, there is 1-byte interval between adjacent data blocks of the destination operand.
- VECIN/VECOUT->TSCM
Note: The internal implementation involves the communication between the AIC and AIV, and the actual transfer path is VECIN/VECOUT -> GM -> TSCM. Sending communication messages causes overheads and affects the performance.
Figure 1 VECIN/VECOUT -> TSCM data transfer illustrates the process of transferring data from VECIN/VECOUT to GM and then to TSCM. In the example, the data type is half, with a single data block (32 bytes) containing 16 half elements, and the source operands A1 to A6, B1 to B6, and C1 to C6 indicate the data to be transferred.
- During the transfer from VECIN/VECOUT to GM, the data storage format remains unchanged as ND.
- blockCount indicates the number of data chunks. Set it to 3.
- blockLen indicates the size (in bytes) of a data chunk. Set it to 192 (which is 6 x 32).
- srcStride indicates the interval between adjacent data chunks of the source operand (the interval between the tail of the previous data chunk and the head of the next data chunk). The logical position of the source operand is VECIN/VECOUT, the unit is data block, and two data chunks (A1 to A6 and B1 to B6) are separated by one A7. Therefore, set srtStride to 1.
- dstStride indicates the interval between adjacent data chunks of the destination operand (the interval between the tail of the previous data chunk and the head of the next data chunk). The logical position of the destination operand is GM, the unit is byte, and two data chunks (A1 to A6 and B1 to B6) are separated by two empty data blocks. Therefore, set dstStride to 64Byte.
- During the transfer from GM to TSCM, the data storage format is converted from ND to NZ.
- ndNum is fixed to 1. That is, A1 to A6, B1 to B6, and C1 to C6 are considered as an entire ND matrix.
- nValue indicates the number of rows in the ND matrix, that is, 3.
- dValue indicates the number of elements in a row of the ND matrix, that is, 96 (which is 6 x 16).
- srcNdMatrixStride indicates the distance between adjacent ND matrices. Because only one ND matrix is involved, srcNdMatrixStride can be set to 0.
- srcDValue indicates a quantity of elements between the xth row and the (x+1)th row of the ND matrix. For example, a distance between A1 and B1 is eight data blocks, and there are 128 (which is 8 x 16) elements.
- dstNzC0Stride indicates the number of data blocks between adjacent data blocks in a same row of src in the NZ matrix, for example, a distance between A1 and A2, that is, seven data blocks (A1 + blank + B1 + blank + C1 + blank x 2).
- dstNzNStride indicates the number of data blocks between adjacent rows of src in the NZ matrix, for example, the distance between A1 and B1, that is, two data blocks (A1 + blank).
- dstNzMatrixStride indicates the number of elements between adjacent NZ matrices. Because only one NZ matrix is involved, dstNzMatrixStride can be set to 1.
- During the transfer from VECIN/VECOUT to GM, the data storage format remains unchanged as ND.
Returns
None
Availability
Constraints
The values of leftPadding and rightPadding cannot exceed 32 bytes.
Example
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 |
#include "kernel_operator.h" class TestDataCopyPad { public: __aicore__ inline TestDataCopyPad() {} __aicore__ inline void Init(__gm__ uint8_t* srcGm, __gm__ uint8_t* dstGm) { srcGlobal.SetGlobalBuffer((__gm__ half *)srcGm); dstGlobal.SetGlobalBuffer((__gm__ half *)dstGm); pipe.InitBuffer(inQueueSrc, 1, 32 * sizeof(half)); pipe.InitBuffer(outQueueDst, 1, 32 * sizeof(half)); } __aicore__ inline void Process() { CopyIn(); Compute(); CopyOut(); } private: __aicore__ inline void CopyIn() { AscendC::LocalTensor<half> srcLocal = inQueueSrc.AllocTensor<half>(); AscendC::DataCopyExtParams copyParams{1, 20 * sizeof(half), 0, 0, 0}; // The last parameter of DataCopyExtParams is reserved. AscendC::DataCopyPadExtParams<half> padParams{true, 0, 2, 0}; AscendC::DataCopyPad(srcLocal, srcGlobal, copyParams, padParams); // Move 40 bytes from GM to VECIN. inQueueSrc.EnQue<half>(srcLocal); } __aicore__ inline void Compute() { AscendC::LocalTensor<half> srcLocal = inQueueSrc.DeQue<half>(); AscendC::LocalTensor<half> dstLocal = outQueueDst.AllocTensor<half>(); AscendC::Adds(dstLocal, srcLocal, scalar, 20); outQueueDst.EnQue(dstLocal); inQueueSrc.FreeTensor(srcLocal); } __aicore__ inline void CopyOut() { AscendC::LocalTensor<half> dstLocal = outQueueDst.DeQue<half>(); AscendC::DataCopyExtParams copyParams{1, 20 * sizeof(half), 0, 0, 0}; AscendC::DataCopyPad(dstGlobal, dstLocal, copyParams); // Move 40 bytes from VECIN to GM. outQueueDst.FreeTensor(dstLocal); } private: AscendC::TPipe pipe; AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueueSrc; AscendC::TQue<AscendC::QuePosition::VECOUT, 1> outQueueDst; AscendC::GlobalTensor<half> srcGlobal; AscendC::GlobalTensor<half> dstGlobal; AscendC::DataCopyPadExtParams<half> padParams; AscendC::DataCopyExtParams copyParams; half scalar = 0; }; extern "C" __global__ __aicore__ void kernel_data_copy_pad_kernel(__gm__ uint8_t* src_gm, __gm__ uint8_t* dst_gm) { TestDataCopyPad op; op.Init(src_gm, dst_gm); op.Process(); } |
Input data (src0Global): [1 2 3 ... 32] Output data (dstGlobal):[1 2 3 ... 20]
