NZ2ND transfer with channel conversion

Applicability

Product

Supported/Unsupported

Atlas A3 training products / Atlas A3 inference products

Atlas A2 training products / Atlas A2 inference products

Atlas 200I/500 A2 inference products

x

Atlas inference product 's AI Core

Atlas inference product 's Vector Core

x

Atlas training products

x

Functions

Supports conversion from the NZ format to the ND format during data movement.

Prototype

1
2
template <typename T>
__aicore__ inline void DataCopy(const GlobalTensor<T>& dst, const LocalTensor<T>& src, const Nz2NdParamsFull& intriParams)

For details about the supported data paths and data types of each prototype, see Supported Channels and Data Types.

Parameters

Table 1 Parameters in the template

Parameter

Description

T

Data type of the source operand and destination operand. For details about the supported data types, see Supported Channels and Data Types.

Table 2 Interface parameters

Parameter

Input/Output

Meaning

dst

Output

Destination operand of the GlobalTensor type.

src

Input

Source operand of the LocalTensor type.

intriParams

Input

Transfer parameter. The type is Nz2NdParamsFull.

For details, see ${INSTALL_DIR}/include/ascendc/basic_api/interface/kernel_struct_data_copy.h. Replace ${INSTALL_DIR} with the actual CANN component directory.

Table 3 Parameters in the Nz2NdParamsFull structure

Field

Meaning

ndNum

Number of NZ matrices to be transferred. The value range is [0, 4095].

nValue

Number of rows in the NZ matrix. The value range is [1, 8192].

dValue

Number of columns in the NZ matrix. The value range is [1, 8192]. The value of dValue must be a multiple of 16.

srcNdMatrixStride

Head-to-head offset between adjacent source NZ matrices. srcNdMatrixStride ∈ [1, 512]. The unit is 256 (16 x 16) elements.

srcNStride

Offset of the adjacent Zs in the source NZ matrix (head to head). The value range is [0, 4096], in the unit of 16 elements.

dstDStride

Offset of the adjacent rows in the destination ND matrix (head to head). The value range is [1, 65535], in the unit of element.

dstNdMatrixStride

Offset of the source adjacent NZ matrix in the destination ND matrix (head to head). The value range is [1, 65535], in the unit of element.

The following figure shows the NZ2ND conversion of the half type. The parameter values and descriptions are as follows:

  • ndNum = 2 indicates that there are two source NZ matrices (NZ matrix 1 is A1 to A4 + B1 to B4, and NZ matrix 2 is C1 to C4 + D1 to D4).
  • nValue = 4 indicates the number of rows in the NZ matrix, that is, the height of the matrix is 4.
  • dValue = 32 indicates the number of columns in the NZ matrix, that is, the width of the matrix is 32 elements.
  • srcNdMatrixStride = 1 indicates the offset between the start addresses of adjacent NZ matrices, that is, the distance between A1 and C1, which is 256 elements (16 DataBlocks x 16 elements).
  • srcNStride = 4 indicates the offset between adjacent Zs in the same source NZ matrix, that is, the distance between A1 and B1, which is 64 elements (4 DataBlocks x 16 elements).
  • dstDStride = 160, indicating the offset between adjacent rows of a destination ND matrix, that is, the distance between A1 and A2, which is 10 DataBlocks, that is, 10 x 16 = 160 elements.
  • dstNdMatrixStride = 48 indicates that the offset between the start point of the xth destination ND matrix and the start point of the (x + 1)th destination ND matrix in dst, that is, the distance between A1 and C1, is 48 elements (3 data blocks x 16 elements per block).
Figure 1 NZ2ND conversion (half type)

The following figure shows the NZ2ND conversion of the float type. The parameter values and descriptions are as follows:

  • ndNum = 2 indicates that there are two source NZ matrices (NZ matrix 1 is A1 to A8 + B1 to B8, and NZ matrix 2 is C1 to C8 + D1 to D8).
  • nValue = 4 indicates the number of rows in the NZ matrix, that is, the height of the matrix is 4.
  • dValue = 32 indicates the number of columns in the NZ matrix, that is, the width of the matrix is 32 elements.
  • srcNdMatrixStride = 1 indicates the offset between the start addresses of adjacent NZ matrices, that is, the distance between A1 and C1, which is 256 elements (32 DataBlocks x 8 elements).
  • srcNStride = 4 indicates the stride between adjacent Z rows in the same source NZ matrix, that is, the distance from A1 to B1, which is 64 elements (8 DataBlocks x 8 elements).
  • dstDStride = 144 indicates the stride between adjacent rows in the same destination ND matrix, that is, the distance between A1 and A3, which is 18 DataBlocks, that is, 18 x 8 = 144 elements.
  • dstNdMatrixStride = 40 indicates the stride between the start point of the xth destination ND matrix and the start point of the (x+1)th destination ND matrix in the dst, that is, the distance between A1 and C1, which is 5 DataBlocks, that is, 5 x 8 = 40 elements.
Figure 2 NZ2ND conversion (float type)

Returns

None

Restrictions

None

Supported Channels and Data Types

The following data channels are expressed by the logical position TPosition and the corresponding physical channels are also specified. For details about the mapping between TPosition and physical memory, see Table 1.

Table 4 Local Memory -> Global Memory channels and supported data types

Internal Model

Datapath

Data Types of the Source and Destination Operands (Same)

Atlas inference product 's AI Core

VECOUT, CO2 -> GM (UB -> GM)

int16_t, uint16_t, int32_t, uint32_t, half, float

Atlas A2 training products / Atlas A2 inference products

VECOUT -> GM (UB -> GM)

int16_t, uint16_t, int32_t, uint32_t, half, bfloat16_t, float

Atlas A3 training products / Atlas A3 inference products

VECOUT -> GM (UB -> GM)

int16_t, uint16_t, int32_t, uint32_t, half, bfloat16_t, float

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
#include "kernel_operator.h"
class KernelDataCopyUb2GmNz2Nd {
public:
    __aicore__ inline KernelDataCopyUb2GmNz2Nd()
    {}
    __aicore__ inline void Init(__gm__ uint8_t* dstGm, __gm__ uint8_t* srcGm)
    {
        AscendC::Nz2NdParamsFull intriParamsIn{1, 32, 32, 1, 32, 32, 1};
        intriParams = intriParamsIn;
        srcGlobal.SetGlobalBuffer((__gm__ half *)srcGm);
        dstGlobal.SetGlobalBuffer((__gm__ half *)dstGm);
        pipe.InitBuffer(inQueueSrcVecIn, 1, intriParams.nValue * intriParams.dValue * sizeof(half));
        pipe.InitBuffer(inQueueSrcVecOut, 1, intriParams.nValue * intriParams.dValue * sizeof(half));
    }
    __aicore__ inline void Process()
    {
        CopyIn();
        Compute();
        CopyOut();
    }
private:
    __aicore__ inline void CopyIn()
    {
        AscendC::LocalTensor<half> srcLocal = inQueueSrcVecIn.AllocTensor<half>();
        AscendC::DataCopy(srcLocal, srcGlobal, intriParams.nValue * intriParams.dValue);
        inQueueSrcVecIn.EnQue(srcLocal);
    }
    __aicore__ inline void Compute()
    {
        AscendC::LocalTensor<half> dstLocal = inQueueSrcVecIn.DeQue<half>();
        AscendC::LocalTensor<half> srcOutLocal = inQueueSrcVecOut.AllocTensor<half>();
        AscendC::DataCopy(srcOutLocal, dstLocal, intriParams.nValue * intriParams.dValue);
        inQueueSrcVecOut.EnQue(srcOutLocal);
        inQueueSrcVecIn.FreeTensor(dstLocal);
    }
    __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<half> srcOutLocalDe = inQueueSrcVecOut.DeQue<half>();
        AscendC::DataCopy(dstGlobal, srcOutLocalDe, intriParams);
        inQueueSrcVecOut.FreeTensor(srcOutLocalDe);
    }
private:
    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueSrcVecIn;
    AscendC::TQue<AscendC::TPosition::VECOUT, 1> inQueueSrcVecOut;
    AscendC::GlobalTensor<half> srcGlobal;
    AscendC::GlobalTensor<half> dstGlobal;
    AscendC::Nz2NdParamsFull intriParams;
};
extern "C" __global__ __aicore__ void kernel_data_copy_nz2nd_ub2out(__gm__ uint8_t* src_gm, __gm__ uint8_t* dst_gm)
{
    KernelDataCopyUb2GmNz2Nd op;
    op.Init(dst_gm, src_gm);
    op.Process();
}

Result example:

Input (srcGlobal): [1 2 3 ... 1024]
Output (dstGlobal):[1 2 ... 15 16 513 514 ... 527 528 17 18 ... 31 32 529 530 ... 543 544 ...497 498 ...  511 512  1009 1010... 1023 1024]