NZ2ND transfer with channel conversion
Applicability
|
Product |
Supported/Unsupported |
|---|---|
|
|
√ |
|
|
√ |
|
|
x |
|
|
√ |
|
|
x |
|
|
x |
Functions
Supports conversion from the NZ format to the ND format during data movement.
Prototype
1 2 |
template <typename T> __aicore__ inline void DataCopy(const GlobalTensor<T>& dst, const LocalTensor<T>& src, const Nz2NdParamsFull& intriParams) |
For details about the supported data paths and data types of each prototype, see Supported Channels and Data Types.
Parameters
|
Parameter |
Description |
|---|---|
|
T |
Data type of the source operand and destination operand. For details about the supported data types, see Supported Channels and Data Types. |
|
Parameter |
Input/Output |
Meaning |
|---|---|---|
|
dst |
Output |
|
|
src |
Input |
Source operand of the LocalTensor type. |
|
intriParams |
Input |
Transfer parameter. The type is Nz2NdParamsFull. For details, see ${INSTALL_DIR}/include/ascendc/basic_api/interface/kernel_struct_data_copy.h. Replace ${INSTALL_DIR} with the actual CANN component directory. |
|
Field |
Meaning |
|---|---|
|
ndNum |
Number of NZ matrices to be transferred. The value range is [0, 4095]. |
|
nValue |
Number of rows in the NZ matrix. The value range is [1, 8192]. |
|
dValue |
Number of columns in the NZ matrix. The value range is [1, 8192]. The value of dValue must be a multiple of 16. |
|
srcNdMatrixStride |
Head-to-head offset between adjacent source NZ matrices. srcNdMatrixStride ∈ [1, 512]. The unit is 256 (16 x 16) elements. |
|
srcNStride |
Offset of the adjacent Zs in the source NZ matrix (head to head). The value range is [0, 4096], in the unit of 16 elements. |
|
dstDStride |
Offset of the adjacent rows in the destination ND matrix (head to head). The value range is [1, 65535], in the unit of element. |
|
dstNdMatrixStride |
Offset of the source adjacent NZ matrix in the destination ND matrix (head to head). The value range is [1, 65535], in the unit of element. |
The following figure shows the NZ2ND conversion of the half type. The parameter values and descriptions are as follows:
- ndNum = 2 indicates that there are two source NZ matrices (NZ matrix 1 is A1 to A4 + B1 to B4, and NZ matrix 2 is C1 to C4 + D1 to D4).
- nValue = 4 indicates the number of rows in the NZ matrix, that is, the height of the matrix is 4.
- dValue = 32 indicates the number of columns in the NZ matrix, that is, the width of the matrix is 32 elements.
- srcNdMatrixStride = 1 indicates the offset between the start addresses of adjacent NZ matrices, that is, the distance between A1 and C1, which is 256 elements (16 DataBlocks x 16 elements).
- srcNStride = 4 indicates the offset between adjacent Zs in the same source NZ matrix, that is, the distance between A1 and B1, which is 64 elements (4 DataBlocks x 16 elements).
- dstDStride = 160, indicating the offset between adjacent rows of a destination ND matrix, that is, the distance between A1 and A2, which is 10 DataBlocks, that is, 10 x 16 = 160 elements.
- dstNdMatrixStride = 48 indicates that the offset between the start point of the xth destination ND matrix and the start point of the (x + 1)th destination ND matrix in dst, that is, the distance between A1 and C1, is 48 elements (3 data blocks x 16 elements per block).
The following figure shows the NZ2ND conversion of the float type. The parameter values and descriptions are as follows:
- ndNum = 2 indicates that there are two source NZ matrices (NZ matrix 1 is A1 to A8 + B1 to B8, and NZ matrix 2 is C1 to C8 + D1 to D8).
- nValue = 4 indicates the number of rows in the NZ matrix, that is, the height of the matrix is 4.
- dValue = 32 indicates the number of columns in the NZ matrix, that is, the width of the matrix is 32 elements.
- srcNdMatrixStride = 1 indicates the offset between the start addresses of adjacent NZ matrices, that is, the distance between A1 and C1, which is 256 elements (32 DataBlocks x 8 elements).
- srcNStride = 4 indicates the stride between adjacent Z rows in the same source NZ matrix, that is, the distance from A1 to B1, which is 64 elements (8 DataBlocks x 8 elements).
- dstDStride = 144 indicates the stride between adjacent rows in the same destination ND matrix, that is, the distance between A1 and A3, which is 18 DataBlocks, that is, 18 x 8 = 144 elements.
- dstNdMatrixStride = 40 indicates the stride between the start point of the xth destination ND matrix and the start point of the (x+1)th destination ND matrix in the dst, that is, the distance between A1 and C1, which is 5 DataBlocks, that is, 5 x 8 = 40 elements.
Returns
None
Restrictions
None
Supported Channels and Data Types
The following data channels are expressed by the logical position TPosition and the corresponding physical channels are also specified. For details about the mapping between TPosition and physical memory, see Table 1.
|
Internal Model |
Datapath |
Data Types of the Source and Destination Operands (Same) |
|---|---|---|
|
|
VECOUT, CO2 -> GM (UB -> GM) |
int16_t, uint16_t, int32_t, uint32_t, half, float |
|
|
VECOUT -> GM (UB -> GM) |
int16_t, uint16_t, int32_t, uint32_t, half, bfloat16_t, float |
|
|
VECOUT -> GM (UB -> GM) |
int16_t, uint16_t, int32_t, uint32_t, half, bfloat16_t, float |
Examples
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 |
#include "kernel_operator.h" class KernelDataCopyUb2GmNz2Nd { public: __aicore__ inline KernelDataCopyUb2GmNz2Nd() {} __aicore__ inline void Init(__gm__ uint8_t* dstGm, __gm__ uint8_t* srcGm) { AscendC::Nz2NdParamsFull intriParamsIn{1, 32, 32, 1, 32, 32, 1}; intriParams = intriParamsIn; srcGlobal.SetGlobalBuffer((__gm__ half *)srcGm); dstGlobal.SetGlobalBuffer((__gm__ half *)dstGm); pipe.InitBuffer(inQueueSrcVecIn, 1, intriParams.nValue * intriParams.dValue * sizeof(half)); pipe.InitBuffer(inQueueSrcVecOut, 1, intriParams.nValue * intriParams.dValue * sizeof(half)); } __aicore__ inline void Process() { CopyIn(); Compute(); CopyOut(); } private: __aicore__ inline void CopyIn() { AscendC::LocalTensor<half> srcLocal = inQueueSrcVecIn.AllocTensor<half>(); AscendC::DataCopy(srcLocal, srcGlobal, intriParams.nValue * intriParams.dValue); inQueueSrcVecIn.EnQue(srcLocal); } __aicore__ inline void Compute() { AscendC::LocalTensor<half> dstLocal = inQueueSrcVecIn.DeQue<half>(); AscendC::LocalTensor<half> srcOutLocal = inQueueSrcVecOut.AllocTensor<half>(); AscendC::DataCopy(srcOutLocal, dstLocal, intriParams.nValue * intriParams.dValue); inQueueSrcVecOut.EnQue(srcOutLocal); inQueueSrcVecIn.FreeTensor(dstLocal); } __aicore__ inline void CopyOut() { AscendC::LocalTensor<half> srcOutLocalDe = inQueueSrcVecOut.DeQue<half>(); AscendC::DataCopy(dstGlobal, srcOutLocalDe, intriParams); inQueueSrcVecOut.FreeTensor(srcOutLocalDe); } private: AscendC::TPipe pipe; AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueSrcVecIn; AscendC::TQue<AscendC::TPosition::VECOUT, 1> inQueueSrcVecOut; AscendC::GlobalTensor<half> srcGlobal; AscendC::GlobalTensor<half> dstGlobal; AscendC::Nz2NdParamsFull intriParams; }; extern "C" __global__ __aicore__ void kernel_data_copy_nz2nd_ub2out(__gm__ uint8_t* src_gm, __gm__ uint8_t* dst_gm) { KernelDataCopyUb2GmNz2Nd op; op.Init(dst_gm, src_gm); op.Process(); } |
Result example:
Input (srcGlobal): [1 2 3 ... 1024] Output (dstGlobal):[1 2 ... 15 16 513 514 ... 527 528 17 18 ... 31 32 529 530 ... 543 544 ...497 498 ... 511 512 1009 1010... 1023 1024]
