ND2NZ Transfer with On-the-Fly Conversion

Applicability

Product	Supported Global Memory -> Local Memory	Supported Local Memory -> Local Memory
Atlas A3 training products/Atlas A3 inference products	√	√
Atlas A2 training products/Atlas A2 inference products	√	√
Atlas 200I/500 A2 inference products	x	x
Atlas inference product's AI Core	√	x
Atlas inference product's Vector Core	x	x
Atlas training products	x	x

Function

Supports ND-to-NZ format conversion during data transfer.

Prototype

Global Memory -> Local Memory

template <typename T>
__aicore__ inline void DataCopy(const LocalTensor<T>& dst, const GlobalTensor<T>& src, const Nd2NzParams& intriParams)

Local Memory -> Local Memory

template <typename T>   
__aicore__ inline void DataCopy(const LocalTensor<T>& dst, const LocalTensor<T>& src, const Nd2NzParams& intriParams)

For details about supported transfer paths and data types of each prototype, see Supported Paths and Data Types.

Parameters

**Table 1** Template parameters
Parameter	Description
T	Data type of the source or destination operand. For details about supported data types, see Supported Paths and Data Types.

**Table 2** Parameters
Parameter	Input/Output	Description
dst	Output	Destination operand, which is of the LocalTensor type.
src	Input	Source operand, which is of the LocalTensor or GlobalTensor type.
intriParams	Input	Transfer parameters, which are of the Nd2NzParams type. For details, see ${INSTALL_DIR}/include/ascendc/basic_api/interface/kernel_struct_data_copy.h. Replace ${INSTALL_DIR} with the CANN installation path.

**Table 3** Parameters in the Nd2NzParams structure
Parameter	Description
ndNum	Number of ND matrices to be transferred. The value range is [0, 4095].
nValue	Number of rows in the ND matrix. The value range is [0, 16384].
dValue	Number of columns in the ND matrix. The value range is [0, 65535].
srcNdMatrixStride	Offset between the start addresses of adjacent ND matrices of the source operand. The value range is [0, 65535] (unit: element).
srcDValue	Offset between the start addresses of adjacent rows in the same ND matrix of the source operand. The value range is [1, 65535] (unit: element).
dstNzC0Stride	After ND is converted to NZ, one row in the source operand is converted to multiple rows in the destination operand. dstNzC0Stride indicates the offset between the start addresses of adjacent rows that originate from the same row of the source operand in the destination NZ matrix. dstNzC0Stride ∈ [1, 16384]. Unit: C0_SIZE (32 bytes).
dstNzNStride	Offset between the start addresses of adjacent rows in the Z-matrix of the destination NZ matrix. dstNzNStride ∈ [1, 16384]. Unit: C0_SIZE (32 bytes).
dstNzMatrixStride	Offset between the start addresses of adjacent destination NZ matrices. dstNzMatrixStride ∈ [1, 65535] (unit: element).

The following figure shows the ND2NZ conversion. The parameter settings in the example are described as follows:

ndNum = 2 indicates that the number of ND matrices to transfer is 2 (ND matrix 1 is A1 to A2 + B1 to B2, and ND matrix 2 is C1 to C2 + D1 to D2).
nValue = 2 indicates that the number of rows in the ND matrix is 2, that is, the height of the matrix is 2.
dValue = 24 indicates that the number of columns in the ND matrix is 24, that is, the width of the matrix is 24 elements. If dValue is not 32-byte aligned, the insufficient part in the destination operand is padded with 0s. For example, the blank part in the data block where A2 is located is padded with 0s.
srcNdMatrixStride = 144 indicates the offset between the start addresses of adjacent ND matrices, that is, the distance between A1 and C1, which is nine data blocks (9 × 16 = 144 elements).
srcDValue = 48 indicates the number of elements in a row, that is, the distance between A1 and B1, which is three data blocks (3 × 16 = 48 elements).
dstNzC0Stride = 11 indicates that after conversion from ND to NZ format, a single row in the source operand is split into multiple rows in the destination operand. For example, A1 and A2 that belong to one row in src are divided into two separate rows in dst. The offset between the start addresses of these rows corresponds to the offset between A1 and A2 in dst, which is 11 data blocks.
dstNzNStride = 2 indicates the offset between the xth and (x+1)th rows of an ND matrix in src after conversion to the NZ format in dst. That is, the offset between A1 and B1 in dst is two data blocks.
dstNzMatrixStride = 96 indicates the offset between the start points of the x th and (x+1) th ND matrices in dst, that is, the distance between A1 and C1, which is six data blocks (6 × 16 = 96 elements).

Figure 1 ND2NZ conversion (half type)

Returns

None

Restrictions

For the Atlas inference product's AI Core, when the ND2NZ transfer API of the Global Memory -> Local Memory path is used, 8 KB UB space must be reserved as the temporary data storage area of the API.

Supported Paths and Data Types

The following data paths are expressed using logical positions TPosition, with the corresponding physical paths noted. For details about the mapping between TPosition and the physical memory, see Table 1.

**Table 4** Specific paths and supported data types of Global Memory -> Local Memory
Product Model	Data Path	Data Types of the Source and Destination Operands (Same)
Atlas inference product's AI Core	GM -> VECIN (GM -> UB)	int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, half, float
Atlas inference product's AI Core	GM -> A1, B1 (GM -> L1 Buffer)	int16_t, uint16_t, int32_t, uint32_t, half, float
Atlas A2 training products/Atlas A2 inference products	GM -> VECIN (GM -> UB) GM -> A1, B1 (GM -> L1 Buffer)	int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, half, bfloat16_t, float
Atlas A3 training products/Atlas A3 inference products	GM -> VECIN (GM -> UB) GM -> A1, B1 (GM -> L1 Buffer)	int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, half, bfloat16_t, float

**Table 5** Specific paths and supported data types of Local Memory -> Local Memory
Product Model	Data Path	Data Types of the Source and Destination Operands (Same)
Atlas A2 training products/Atlas A2 inference products	VECIN, VECCALC, VECOUT -> TSCM (UB -> L1 Buffer)	int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, half, bfloat16_t, float
Atlas A3 training products/Atlas A3 inference products	VECIN, VECCALC, VECOUT -> TSCM (UB -> L1 Buffer)	int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, half, bfloat16_t, float

Example

#include "kernel_operator.h"
class KernelDataCopyGm2UbNd2Nz{
public:
    __aicore__ inline KernelDataCopyGm2UbNd2Nz()
    {}
    __aicore__ inline void Init(__gm__ uint8_t* dstGm, __gm__ uint8_t* srcGm)
    {
        AscendC::Nd2NzParams intriParamsIn{1, 32, 32, 0, 32, 32, 1, 0};
        intriParams = intriParamsIn;
        srcGlobal.SetGlobalBuffer((__gm__ half *)srcGm);
        dstGlobal.SetGlobalBuffer((__gm__ half *)dstGm);
        pipe.InitBuffer(inQueueSrcVecIn, 1, intriParams.nValue * intriParams.dValue * sizeof(half));
        pipe.InitBuffer(inQueueSrcVecOut, 1, intriParams.nValue * intriParams.dValue * sizeof(half));
    }
    __aicore__ inline void Process()
    {
        CopyIn();
        Compute();
        CopyOut();
    }
private:
    __aicore__ inline void CopyIn()
    {
        AscendC::LocalTensor<half> srcLocal = inQueueSrcVecIn.AllocTensor<half>();
        AscendC::DataCopy(srcLocal, srcGlobal, intriParams);
        inQueueSrcVecIn.EnQue(srcLocal);
    }
    __aicore__ inline void Compute()
    {
        AscendC::LocalTensor<half> srcLocal = inQueueSrcVecIn.DeQue<half>();
        AscendC::LocalTensor<half> dstLocal = inQueueSrcVecOut.AllocTensor<half>();
        AscendC::DataCopy(dstLocal, srcLocal, intriParams.nValue * intriParams.dValue);
        inQueueSrcVecOut.EnQue(dstLocal);
        inQueueSrcVecIn.FreeTensor(srcLocal);
    }
    __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<half> dstLocal = inQueueSrcVecOut.DeQue<half>();
        AscendC::DataCopy(dstGlobal, dstLocal, intriParams.nValue * intriParams.dValue);
        inQueueSrcVecOut.FreeTensor(dstLocal);
    }
private:
    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueSrcVecIn;
    AscendC::TQue<AscendC::TPosition::VECOUT, 1> inQueueSrcVecOut;
    AscendC::GlobalTensor<half> srcGlobal;
    AscendC::GlobalTensor<half> dstGlobal;
    AscendC::Nd2NzParams intriParams;
};
extern "C" __global__ __aicore__ void kernel_data_copy_nd2nz_ub2out(__gm__ uint8_t* src_gm, __gm__ uint8_t* dst_gm)
{
    KernelDataCopyGm2UbNd2Nz op;
    op.Init(dst_gm, src_gm);
    op.Process();
}

Result example:

Input (srcGlobal): [1 2 3 ... 1024]
Output (dstGlobal): [1 2 ... 15 16 33 34 ... 47 48 65 66 ... 79 80 97 98 ... 111 112 ... 1009 1010... 1023 1024]

Parent topic: DataCopy