ND2NZ transfer with channel conversion

Applicability

Product

Supported/Unsupported

Global Memory -> Local Memory

Supported/Unsupported

Local Memory -> Local Memory

Atlas A3 training products/Atlas A3 inference products

Atlas A2 training products/Atlas A2 inference products

Atlas 200I/500 A2 inference products

x

x

Atlas inference product's AI Core

x

Atlas inference product's Vector Core

x

x

Atlas training products

x

x

Functions

Supports conversion from the ND format to the NZ format during data movement.

Prototype

  • Global Memory -> Local Memory
    1
    2
    template <typename T>
    __aicore__ inline void DataCopy(const LocalTensor<T>& dst, const GlobalTensor<T>& src, const Nd2NzParams& intriParams)
    
  • Local Memory -> Local Memory
    1
    2
    template <typename T>   
    __aicore__ inline void DataCopy(const LocalTensor<T>& dst, const LocalTensor<T>& src, const Nd2NzParams& intriParams)
    

For details about the supported data paths and data types of each prototype, see Supported Channels and Data Types.

Parameters

Table 1 Parameters in the template

Parameter

Description

T

Data type of the source operand and destination operand. For details about the supported data types, see Supported Channels and Data Types.

Table 2 Parameters

Parameter

Input/Output

Meaning

dst

Output

Destination operand of the LocalTensor type.

src

Input

Source operand of the LocalTensor or GlobalTensor type.

intriParams

Input

Transfer parameter of the Nd2NzParams type.

For details, see ${INSTALL_DIR}/include/ascendc/basic_api/interface/kernel_struct_data_copy.h. Replace ${INSTALL_DIR} with the actual CANN component directory.

Table 3 Parameters in the Nd2NzParams structure

Field

Meaning

ndNum

Number of ND matrices to be transferred. The value range is [0, 4095].

nValue

Number of rows in the ND matrix. The value range is [0, 16384].

dValue

Number of columns in the ND matrix. Value range: dValue ∈ [0, 65535].

srcNdMatrixStride

Offset between the start addresses of adjacent ND matrices of the source operand. Value range: srcNdMatrixStride ∈ [0, 65535]. The unit is element.

srcDValue

Offset between the start addresses of adjacent rows in the same ND matrix of the source operand. Value range: srcDValue ∈ [1, 65535]. The unit is element.

dstNzC0Stride

After the ND format is converted to the NZ format, one row in the source operand is converted to multiple rows in the destination operand. dstNzC0Stride indicates the offset between the start addresses of adjacent rows of multiple lines of data from the same row of the source operand in the destination NZ matrix. dstNzC0Stride ∈ [1, 16384]. Unit: C0_SIZE (32 bytes).

dstNzNStride

Offset between the start addresses of adjacent rows in the destination NZ matrix in Z format. dstNzNStride ∈ [1, 16384]. Unit: C0_SIZE (32 bytes).

dstNzMatrixStride

Offset between the start addresses of adjacent NZ matrices in the destination NZ matrix. Value range: dstNzMatrixStride ∈ [1, 65535]. The unit is element.

The following figure shows the ND2NZ conversion. The parameter settings in the example are described as follows:

  • ndNum = 2 indicates that the number of ND matrices to be transferred is 2 (ND matrix 1 is A1 to A2 + B1 to B2, and ND matrix 2 is C1 to C2 + D1 to D2).
  • nValue = 2 indicates the number of rows in the ND matrix, that is, the height of the matrix is 2.
  • dValue = 24 indicates the number of columns in the ND matrix, that is, the width of the matrix is 24 elements. If dValue is not 32-byte aligned, the insufficient part in the destination operand is padded with 0s. For example, the blank part of the data block where A2 is located in the figure is padded with 0s.
  • srcNdMatrixStride = 144 indicates the offset between the start addresses of adjacent ND matrices, that is, the distance between A1 and C1. The distance is 9 DataBlocks, that is, 9 x 16 = 144 elements.
  • srcDValue = 48 indicates that the number of elements in a row, that is, the distance between A1 and B1, is 3 data blocks or 48 elements (3 × 16).
  • dstNzC0Stride = 11. After ND is converted to the NZ format, a row in the source operand is converted to multiple rows in the destination operand. For example, A1 and A2 in src occupy one row, while A1 and A2 in dst occupy two rows. The offset between the start addresses of multiple lines of data is the offset of A1 and A2 in dst. The offset is 11 data blocks.
  • dstNzNStride = 2 indicates the offset of the xth row and (x+1)th row in the source ND matrix in the destination operand after the matrix is converted to the NZ format. That is, the offset of A1 and B1 in the destination operand is two DataBlocks.
  • dstNzMatrixStride = 96 indicates the offset between the start point of the xth ND matrix and that of the (x+1)th ND matrix in the destination operand. That is, the distance between A1 and C1 is six DataBlocks, and the number of elements is 96 (6 x 16).
Figure 1 ND2NZ conversion (half type)

Returns

None

Restrictions

For Atlas inference product's AI Core, when the ND2NZ data transfer API using the Global Memory -> Local Memory channel is used, 8 KB UB space needs to be reserved as the temporary data storage area of the API.

Supported Channels and Data Types

The following data channels are expressed by the logical position TPosition, and the corresponding physical channels are also specified. For details about the mapping between TPosition and physical memory, see Table 1.

Table 4 Global Memory -> Local Memory channels and supported data types

Internal Model

Datapath

Data Types of the Source and Destination Operands (Same)

Atlas inference product's AI Core

  • GM -> VECIN (GM -> UB)

int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, half, float

  • GM -> A1, B1 (GM -> L1 Buffer)

int16_t, uint16_t, int32_t, uint32_t, half, float

Atlas A2 training products/Atlas A2 inference products

  • GM -> VECIN (GM -> UB)
  • GM -> A1, B1 (GM -> L1 Buffer)

int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, half, bfloat16_t, float

Atlas A3 training products/Atlas A3 inference products

  • GM -> VECIN (GM -> UB)
  • GM -> A1, B1 (GM -> L1 Buffer)

int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, half, bfloat16_t, float

Table 5 Local memory to local memory transfer and supported data types.

Internal Model

Datapath

Data Types of the Source and Destination Operands (Same)

Atlas A2 training products/Atlas A2 inference products

VECIN, VECCALC, VECOUT -> TSCM (UB -> L1 Buffer)

int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, half, bfloat16_t, float

Atlas A3 training products/Atlas A3 inference products

VECIN, VECCALC, VECOUT -> TSCM (UB -> L1 Buffer)

int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, half, bfloat16_t, float

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
#include "kernel_operator.h"
class KernelDataCopyGm2UbNd2Nz{
public:
    __aicore__ inline KernelDataCopyGm2UbNd2Nz()
    {}
    __aicore__ inline void Init(__gm__ uint8_t* dstGm, __gm__ uint8_t* srcGm)
    {
        AscendC::Nd2NzParams intriParamsIn{1, 32, 32, 0, 32, 32, 1, 0};
        intriParams = intriParamsIn;
        srcGlobal.SetGlobalBuffer((__gm__ half *)srcGm);
        dstGlobal.SetGlobalBuffer((__gm__ half *)dstGm);
        pipe.InitBuffer(inQueueSrcVecIn, 1, intriParams.nValue * intriParams.dValue * sizeof(half));
        pipe.InitBuffer(inQueueSrcVecOut, 1, intriParams.nValue * intriParams.dValue * sizeof(half));
    }
    __aicore__ inline void Process()
    {
        CopyIn();
        Compute();
        CopyOut();
    }
private:
    __aicore__ inline void CopyIn()
    {
        AscendC::LocalTensor<half> srcLocal = inQueueSrcVecIn.AllocTensor<half>();
        AscendC::DataCopy(srcLocal, srcGlobal, intriParams);
        inQueueSrcVecIn.EnQue(srcLocal);
    }
    __aicore__ inline void Compute()
    {
        AscendC::LocalTensor<half> srcLocal = inQueueSrcVecIn.DeQue<half>();
        AscendC::LocalTensor<half> dstLocal = inQueueSrcVecOut.AllocTensor<half>();
        AscendC::DataCopy(dstLocal, srcLocal, intriParams.nValue * intriParams.dValue);
        inQueueSrcVecOut.EnQue(dstLocal);
        inQueueSrcVecIn.FreeTensor(srcLocal);
    }
    __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<half> dstLocal = inQueueSrcVecOut.DeQue<half>();
        AscendC::DataCopy(dstGlobal, dstLocal, intriParams.nValue * intriParams.dValue);
        inQueueSrcVecOut.FreeTensor(dstLocal);
    }
private:
    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueSrcVecIn;
    AscendC::TQue<AscendC::TPosition::VECOUT, 1> inQueueSrcVecOut;
    AscendC::GlobalTensor<half> srcGlobal;
    AscendC::GlobalTensor<half> dstGlobal;
    AscendC::Nd2NzParams intriParams;
};
extern "C" __global__ __aicore__ void kernel_data_copy_nd2nz_ub2out(__gm__ uint8_t* src_gm, __gm__ uint8_t* dst_gm)
{
    KernelDataCopyGm2UbNd2Nz op;
    op.Init(dst_gm, src_gm);
    op.Process();
}

Result example:

Input (srcGlobal): [1 2 3 ... 1024]
Output data (dstGlobal):[1 2 ... 15 16 33 34 ... 47 48 65 66 ... 79 80 97 98 ... 111 112 ... 1009 1010... 1023 1024]