ND2NZ transfer with channel conversion
Applicability
Product |
Supported/Unsupported Global Memory -> Local Memory |
Supported/Unsupported Local Memory -> Local Memory |
|---|---|---|
√ |
√ |
|
√ |
√ |
|
x |
x |
|
√ |
x |
|
x |
x |
|
x |
x |
Functions
Supports conversion from the ND format to the NZ format during data movement.
Prototype
- Global Memory -> Local Memory
1 2
template <typename T> __aicore__ inline void DataCopy(const LocalTensor<T>& dst, const GlobalTensor<T>& src, const Nd2NzParams& intriParams)
- Local Memory -> Local Memory
1 2
template <typename T> __aicore__ inline void DataCopy(const LocalTensor<T>& dst, const LocalTensor<T>& src, const Nd2NzParams& intriParams)
For details about the supported data paths and data types of each prototype, see Supported Channels and Data Types.
Parameters
Parameter |
Description |
|---|---|
T |
Data type of the source operand and destination operand. For details about the supported data types, see Supported Channels and Data Types. |
Parameter |
Input/Output |
Meaning |
|---|---|---|
dst |
Output |
Destination operand of the LocalTensor type. |
src |
Input |
Source operand of the LocalTensor or GlobalTensor type. |
intriParams |
Input |
Transfer parameter of the Nd2NzParams type. For details, see ${INSTALL_DIR}/include/ascendc/basic_api/interface/kernel_struct_data_copy.h. Replace ${INSTALL_DIR} with the actual CANN component directory. |
Field |
Meaning |
|---|---|
ndNum |
Number of ND matrices to be transferred. The value range is [0, 4095]. |
nValue |
Number of rows in the ND matrix. The value range is [0, 16384]. |
dValue |
Number of columns in the ND matrix. Value range: dValue ∈ [0, 65535]. |
srcNdMatrixStride |
Offset between the start addresses of adjacent ND matrices of the source operand. Value range: srcNdMatrixStride ∈ [0, 65535]. The unit is element. |
srcDValue |
Offset between the start addresses of adjacent rows in the same ND matrix of the source operand. Value range: srcDValue ∈ [1, 65535]. The unit is element. |
dstNzC0Stride |
After the ND format is converted to the NZ format, one row in the source operand is converted to multiple rows in the destination operand. dstNzC0Stride indicates the offset between the start addresses of adjacent rows of multiple lines of data from the same row of the source operand in the destination NZ matrix. dstNzC0Stride ∈ [1, 16384]. Unit: C0_SIZE (32 bytes). |
dstNzNStride |
Offset between the start addresses of adjacent rows in the destination NZ matrix in Z format. dstNzNStride ∈ [1, 16384]. Unit: C0_SIZE (32 bytes). |
dstNzMatrixStride |
Offset between the start addresses of adjacent NZ matrices in the destination NZ matrix. Value range: dstNzMatrixStride ∈ [1, 65535]. The unit is element. |
The following figure shows the ND2NZ conversion. The parameter settings in the example are described as follows:
- ndNum = 2 indicates that the number of ND matrices to be transferred is 2 (ND matrix 1 is A1 to A2 + B1 to B2, and ND matrix 2 is C1 to C2 + D1 to D2).
- nValue = 2 indicates the number of rows in the ND matrix, that is, the height of the matrix is 2.
- dValue = 24 indicates the number of columns in the ND matrix, that is, the width of the matrix is 24 elements. If dValue is not 32-byte aligned, the insufficient part in the destination operand is padded with 0s. For example, the blank part of the data block where A2 is located in the figure is padded with 0s.
- srcNdMatrixStride = 144 indicates the offset between the start addresses of adjacent ND matrices, that is, the distance between A1 and C1. The distance is 9 DataBlocks, that is, 9 x 16 = 144 elements.
- srcDValue = 48 indicates that the number of elements in a row, that is, the distance between A1 and B1, is 3 data blocks or 48 elements (3 × 16).
- dstNzC0Stride = 11. After ND is converted to the NZ format, a row in the source operand is converted to multiple rows in the destination operand. For example, A1 and A2 in src occupy one row, while A1 and A2 in dst occupy two rows. The offset between the start addresses of multiple lines of data is the offset of A1 and A2 in dst. The offset is 11 data blocks.
- dstNzNStride = 2 indicates the offset of the xth row and (x+1)th row in the source ND matrix in the destination operand after the matrix is converted to the NZ format. That is, the offset of A1 and B1 in the destination operand is two DataBlocks.
- dstNzMatrixStride = 96 indicates the offset between the start point of the xth ND matrix and that of the (x+1)th ND matrix in the destination operand. That is, the distance between A1 and C1 is six DataBlocks, and the number of elements is 96 (6 x 16).
Returns
None
Restrictions
For
Supported Channels and Data Types
The following data channels are expressed by the logical position TPosition, and the corresponding physical channels are also specified. For details about the mapping between TPosition and physical memory, see Table 1.
Internal Model |
Datapath |
Data Types of the Source and Destination Operands (Same) |
|---|---|---|
|
int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, half, float |
|
|
int16_t, uint16_t, int32_t, uint32_t, half, float |
|
|
int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, half, bfloat16_t, float |
|
|
int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, half, bfloat16_t, float |
Internal Model |
Datapath |
Data Types of the Source and Destination Operands (Same) |
|---|---|---|
VECIN, VECCALC, VECOUT -> TSCM (UB -> L1 Buffer) |
int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, half, bfloat16_t, float |
|
VECIN, VECCALC, VECOUT -> TSCM (UB -> L1 Buffer) |
int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, half, bfloat16_t, float |
Examples
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 | #include "kernel_operator.h" class KernelDataCopyGm2UbNd2Nz{ public: __aicore__ inline KernelDataCopyGm2UbNd2Nz() {} __aicore__ inline void Init(__gm__ uint8_t* dstGm, __gm__ uint8_t* srcGm) { AscendC::Nd2NzParams intriParamsIn{1, 32, 32, 0, 32, 32, 1, 0}; intriParams = intriParamsIn; srcGlobal.SetGlobalBuffer((__gm__ half *)srcGm); dstGlobal.SetGlobalBuffer((__gm__ half *)dstGm); pipe.InitBuffer(inQueueSrcVecIn, 1, intriParams.nValue * intriParams.dValue * sizeof(half)); pipe.InitBuffer(inQueueSrcVecOut, 1, intriParams.nValue * intriParams.dValue * sizeof(half)); } __aicore__ inline void Process() { CopyIn(); Compute(); CopyOut(); } private: __aicore__ inline void CopyIn() { AscendC::LocalTensor<half> srcLocal = inQueueSrcVecIn.AllocTensor<half>(); AscendC::DataCopy(srcLocal, srcGlobal, intriParams); inQueueSrcVecIn.EnQue(srcLocal); } __aicore__ inline void Compute() { AscendC::LocalTensor<half> srcLocal = inQueueSrcVecIn.DeQue<half>(); AscendC::LocalTensor<half> dstLocal = inQueueSrcVecOut.AllocTensor<half>(); AscendC::DataCopy(dstLocal, srcLocal, intriParams.nValue * intriParams.dValue); inQueueSrcVecOut.EnQue(dstLocal); inQueueSrcVecIn.FreeTensor(srcLocal); } __aicore__ inline void CopyOut() { AscendC::LocalTensor<half> dstLocal = inQueueSrcVecOut.DeQue<half>(); AscendC::DataCopy(dstGlobal, dstLocal, intriParams.nValue * intriParams.dValue); inQueueSrcVecOut.FreeTensor(dstLocal); } private: AscendC::TPipe pipe; AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueSrcVecIn; AscendC::TQue<AscendC::TPosition::VECOUT, 1> inQueueSrcVecOut; AscendC::GlobalTensor<half> srcGlobal; AscendC::GlobalTensor<half> dstGlobal; AscendC::Nd2NzParams intriParams; }; extern "C" __global__ __aicore__ void kernel_data_copy_nd2nz_ub2out(__gm__ uint8_t* src_gm, __gm__ uint8_t* dst_gm) { KernelDataCopyGm2UbNd2Nz op; op.Init(dst_gm, src_gm); op.Process(); } |
Result example:
Input (srcGlobal): [1 2 3 ... 1024] Output data (dstGlobal):[1 2 ... 15 16 33 34 ... 47 48 65 66 ... 79 80 97 98 ... 111 112 ... 1009 1010... 1023 1024]
