On-the-Fly Quantization and Activation Transfer

Applicability

Product	Supported
Atlas A3 training products/Atlas A3 inference products	√
Atlas A2 training products/Atlas A2 inference products	√
Atlas 200I/500 A2 inference products	√
Atlas inference product's AI Core	x
Atlas inference product's Vector Core	x
Atlas training products	x

Function

Supports quantization and ReLU activation during data transfer, as well as NZ-to-ND format conversion on the path from local memory to global memory.

Prototype

Local Memory -> Global Memory: supporting operations such as quantization and ReLU activation and NZ2ND conversion

template <typename T, typename U>
__aicore__ inline void DataCopy(const GlobalTensor<T>& dst, const LocalTensor<U>& src, const DataCopyCO12DstParams& intriParams)

Local Memory -> Local Memory: supporting operations such as quantization and ReLU activation

template <typename T, typename U>
__aicore__ inline void DataCopy(const LocalTensor<T>& dst, const LocalTensor<U>& src, const DataCopyCO12DstParams& intriParams)

For details about supported transfer paths and data types of each prototype, see Supported Paths and Data Types.

Parameters

**Table 1** Template parameters
Parameter	Description
T	Data type of the destination operand. For details about supported data types, see Supported Paths and Data Types.
U	Data type of the source operand. For details about supported data types, see Supported Paths and Data Types.

**Table 2** Parameters
Parameter	Input/Output	Description
dst	Output	Destination operand, which is of the LocalTensor or GlobalTensor type.
src	Input	Source operand, which is of the LocalTensor type.
intriParams	Input	Transfer parameters, which are of the DataCopyCO12DstParams type. For details, see ${INSTALL_DIR}/include/ascendc/basic_api/interface/kernel_struct_data_copy.h. Replace ${INSTALL_DIR} with the CANN installation path.

Table 3 Parameters in the DataCopyCO12DstParams structure (C0 = 16 in general, and C0 = 8 when channelSplit is enabled)

Parameter

Description

nSize

Horizontal dimension size of src.

If NZ2ND is disabled, this value must be a multiple of C0, and the number of consecutive data blocks to be transferred equals nSize/C0.
If NZ2ND is enabled, there is no such restriction.

mSize

Vertical dimension size of src.

If NZ2ND is disabled, the size of consecutive data blocks to be transferred is mSize × C0.

If NZ2ND is enabled, the size of the NZ/ND matrix is mSize × nSize.

dstStride

If NZ2ND is disabled:
Stride between adjacent consecutive data segments of dst (head-to-head stride between adjacent data blocks). The value cannot be 0. The unit is data block (32 bytes).

If NZ2ND is enabled:
Head-to-head stride between adjacent rows in the same ND matrix of dst. The value cannot be 0. The unit is element.

srcStride

If NZ2ND is disabled:
Stride between adjacent consecutive data segments of src (head-to-head stride between adjacent data blocks). The value must be a multiple of 16. Value range: srcStride ∈ [0, 65535]. Unit: C0_Size (C0 × sizeof(U), where U is the data type of src).
If NZ2ND is enabled:
Head-to-head stride between adjacent Z-tiles in the same NZ matrix of src. The value must be a multiple of 16. srcStride ∈ [0, 65535] (unit: C0_size)

quantPre

Quantization mode, which is of the QuantMode_t type. The default value is QuantMode_t::NoQuant, that is, the quantization function is disabled.

When scalar quantization is configured, call the SetFixpipePreQuantFlag API to set scalar quantization parameters. When tensor quantization is configured, call the SetFixPipeConfig API to set tensor quantization parameters.

enum QuantMode_t
{
    NoQuant,      // Quantization disabled
    F322F16,      // float-to-half scalar quantization
    F322BF16,     // float-to-bfloat16_t scalar quantization
    DEQF16,       // int32_t-to-half scalar quantization
    VDEQF16,      // int32_t-to-half tensor quantization
    QF322B8_PRE,  // float-to-int8_t/uint8_t scalar quantization
    VQF322B8_PRE, // float-to-int8_t/uint8_t tensor quantization
    REQ8,         // int32_t-to-int8_t/uint8_t scalar quantization
    VREQ8,        // int32_t-to-int8_t/uint8_t tensor quantization
};

reluPre

ReLU operation mode, which is of the uint8_t type. The options are as follows:

0: ReLU disabled
1: ReLU enabled

channelSplit

Whether to enable channel splitting. The type is bool. It is valid for dst of the float type.

false: disabled
true: enabled

nz2ndEn

Whether to enable NZ2ND format conversion. The type is bool. It takes effect only in the CO1 -> GM path.

To enable NZ2ND, SetFixpipeNz2ndFlag must be called to set the format conversion configuration.

false: disabled
true: enabled

clipReluPre

Whether to enable the ClipReLU operation. The type is uint8_t. The value 0 indicates that ClipReLU is disabled, and 1 indicates that ClipReLU is enabled. If this parameter is set to 1, SetFixPipeClipRelu must be called to configure the maximum value for ClipReLU.

This operation is executed after on-the-fly quantization and takes effect only after quantPre is configured. The currently supported quantization modes include F322F16, DEQF16, VDEQF16, QF322B8_PRE, VQF322B8_PRE, REQ8, and VREQ8.
This parameter is supported only by the Atlas 200I/500 A2 inference products.

eltWiseOp

Whether to enable element-wise operation and set the operation mode. The element-wise operation performs element-wise addition or subtraction with a LocalTensor after on-the-fly quantization. The LocalTensor has a size of mSize × nSize. Call SetFixPipeAddr to set parameters related to the LocalTensor address.

The parameter type is uint8_t. The options are as follows:

0: element-wise operation disabled.
1: element-wise addition enabled
2: element-wise subtraction enabled

This parameter is supported only by the Atlas 200I/500 A2 inference products.

sid

Reserved parameter for future use.

Returns

None

Restrictions

None

Supported Paths and Data Types

The following data paths are expressed using logical positions TPosition, with the corresponding physical paths noted. For details about the mapping between TPosition and the physical memory, see Table 1.

**Table 4** Specific paths and supported data types of Local Memory -> Global Memory
Supported Model	Data Path	Data Type of the Source Operand	Data Type of the Destination Operand
Atlas A2 training products/Atlas A2 inference products	CO1 -> GM (L0C Buffer -> GM)	float	uint8_t, int8_t, half, bfloat16_t, float
Atlas A2 training products/Atlas A2 inference products	CO1 -> GM (L0C Buffer -> GM)	int32_t	uint8_t, int8_t, half, int16_t, int32_t
Atlas A3 training products/Atlas A3 inference products	CO1 -> GM (L0C Buffer -> GM)	float	uint8_t, int8_t, half, bfloat16_t, float
Atlas A3 training products/Atlas A3 inference products	CO1 -> GM (L0C Buffer -> GM)	int32_t	uint8_t, int8_t, half, int16_t, int32_t
Atlas 200I/500 A2 inference products	CO1 -> GM (L0C Buffer -> GM)	float	uint8_t, int8_t, half, bfloat16_t, float
Atlas 200I/500 A2 inference products	CO1 -> GM (L0C Buffer -> GM)	int32_t	uint8_t, int8_t, half, int16_t, int32_t

**Table 5** Specific paths and supported data types of Local Memory -> Local Memory
Supported Model	Data Path	Data Type of the Source Operand	Data Type of the Destination Operand
Atlas A2 training products/Atlas A2 inference products	CO1 -> A1 (L0C Buffer -> L1 Buffer)	float	uint8_t, int8_t, half, bfloat16_t
Atlas A2 training products/Atlas A2 inference products	CO1 -> A1 (L0C Buffer -> L1 Buffer)	int32_t	uint8_t, int8_t, half, int16_t
Atlas A3 training products/Atlas A3 inference products	CO1 -> A1 (L0C Buffer -> L1 Buffer)	float	uint8_t, int8_t, half, bfloat16_t
Atlas A3 training products/Atlas A3 inference products	CO1 -> A1 (L0C Buffer -> L1 Buffer)	int32_t	uint8_t, int8_t, half, int16_t

Example

Data transfer with on-the-fly format conversion along the C01->A1/C01->GM path

Example: Mmad implements matrix multiplication with bias. The left and right matrices are of type int8_t, and the result matrix is of type int32_t. The quantization mode is DEQF16, and the scalar quantization parameter is 0.5. The computation result of Mmad is quantized from int32_t to half and then output.

#ifdef ASCENDC_CPU_DEBUG
#include "tikicpulib.h"
#endif
#include "kernel_operator.h"
#include "../../instrs/common_utils/register_utils.h"
SET_G_CORE_TYPE_IS_AIC
template <typename dst_T, typename fmap_T, typename weight_T, typename dstCO1_T> class KernelCubeDataCopy{
public:
    __aicore__ inline KernelCubeDataCopy(uint16_t CoutIn, uint8_t dilationHIn, uint8_t dilationWIn, QuantMode_t deqModeIn)
    {
        // ceiling of 16
        Cout = CoutIn;
        dilationH = dilationHIn;
        dilationW = dilationWIn;
        C0 = 32 / sizeof(fmap_T);
        C1 = channelSize / C0;
        coutBlocks = (Cout + 16 - 1) / 16;
        ho = H - dilationH * (Kh - 1);
        wo = W - dilationW * (Kw - 1);
        howo = ho * wo;
        howoRound = ((howo + 16 - 1) / 16) * 16;
        featureMapA1Size = C1 * H * W * C0;      // shape: [C1, H, W, C0]
        weightA1Size = C1 * Kh * Kw * Cout * C0; // shape: [C1, Kh, Kw, Cout, C0]
        featureMapA2Size = howoRound * (C1 * Kh * Kw * C0);
        weightB2Size = (C1 * Kh * Kw * C0) * coutBlocks * 16;
        m = howo;
        k = C1 * Kh * Kw * C0;
        n = Cout;
        biasSize = Cout;                  // shape: [Cout]
        dstSize = coutBlocks * howo * 16; // shape: [coutBlocks, howo, 16]
        dstCO1Size = coutBlocks * howoRound * 16;
        fmRepeat = featureMapA2Size / (16 * C0);
        weRepeat = weightB2Size / (16 * C0);
        deqMode = deqModeIn;
    }
    __aicore__ inline void Init(__gm__ uint8_t* fmGm, __gm__ uint8_t* weGm, __gm__ uint8_t* biasGm, __gm__ uint8_t* deqGm, __gm__ uint8_t* dstGm)
    {
        fmGlobal.SetGlobalBuffer((__gm__ fmap_T*)fmGm);
        weGlobal.SetGlobalBuffer((__gm__ weight_T*)weGm);
        biasGlobal.SetGlobalBuffer((__gm__ dstCO1_T*)biasGm);
        deqGlobal.SetGlobalBuffer((__gm__ uint64_t*)deqGm);
        dstGlobal.SetGlobalBuffer((__gm__ dst_T*)dstGm);
        pipe.InitBuffer(inQueueFmA1, 1, featureMapA1Size * sizeof(fmap_T));
        pipe.InitBuffer(inQueueFmA2, 1, featureMapA2Size * sizeof(fmap_T));
        pipe.InitBuffer(inQueueWeB1, 1, weightA1Size * sizeof(weight_T));
        pipe.InitBuffer(inQueueWeB2, 1, weightB2Size * sizeof(weight_T));
        pipe.InitBuffer(inQueueBiasA1, 1, biasSize * sizeof(dstCO1_T));
        pipe.InitBuffer(inQueueDeqA1, 1, dstCO1Size * sizeof(uint64_t));
        pipe.InitBuffer(inQueueDeqFB, 1, dstCO1Size * sizeof(uint64_t));
        pipe.InitBuffer(outQueueCO1, 1, dstCO1Size * sizeof(dstCO1_T));
        pipe.InitBuffer(outQueueA1, 1, dstCO1Size * sizeof(dst_T));
     }
    __aicore__ inline void Process()
    {
        CopyIn();
        Split();
        Compute();
        CopyOut();
    }
private:
    __aicore__ inline void CopyIn()
    {
        AscendC::LocalTensor<fmap_T> featureMapA1 = inQueueFmA1.AllocTensor<fmap_T>();
        AscendC::LocalTensor<weight_T> weightB1 = inQueueWeB1.AllocTensor<weight_T>();
        AscendC::LocalTensor<dstCO1_T> biasA1 = inQueueBiasA1.AllocTensor<dstCO1_T>();
        AscendC::DataCopy(featureMapA1, fmGlobal, { 1, static_cast<uint16_t>(featureMapA1Size * sizeof(fmap_T) / 32), 0, 0 });
        AscendC::DataCopy(weightB1, weGlobal, { 1, static_cast<uint16_t>(weightA1Size * sizeof(weight_T) / 32), 0, 0 });
        AscendC::DataCopy(biasA1, biasGlobal, { 1, static_cast<uint16_t>(biasSize * sizeof(dstCO1_T) / 32), 0, 0 });
        inQueueFmA1.EnQue(featureMapA1);
        inQueueWeB1.EnQue(weightB1);
        inQueueBiasA1.EnQue(biasA1);
    }
    __aicore__ inline void Split()
    {
        AscendC::LocalTensor<fmap_T> featureMapA1 = inQueueFmA1.DeQue<fmap_T>();
        AscendC::LocalTensor<weight_T> weightB1 = inQueueWeB1.DeQue<weight_T>();
        AscendC::LocalTensor<fmap_T> featureMapA2 = inQueueFmA2.AllocTensor<fmap_T>();
        AscendC::LocalTensor<weight_T> weightB2 = inQueueWeB2.AllocTensor<weight_T>();
        uint8_t padList[] = {0, 0, 0, 0};
        // load3dv2
        AscendC::LoadData(featureMapA2, featureMapA1, { padList, H, W, channelSize, k, howoRound, 0, 0, 1, 1, Kw, Kh, dilationW, dilationH, false, false, 0 });
        // load2d
        AscendC::LoadData(weightB2, weightB1, { 0, weRepeat, 1, 0, 0, false, 0 });
        inQueueFmA2.EnQue<fmap_T>(featureMapA2);
        inQueueWeB2.EnQue<weight_T>(weightB2);
        inQueueFmA1.FreeTensor(featureMapA1);
        inQueueWeB1.FreeTensor(weightB1);
    }
    __aicore__ inline void Compute()
    {
        AscendC::LocalTensor<fmap_T> featureMapA2 = inQueueFmA2.DeQue<fmap_T>();
        AscendC::LocalTensor<weight_T> weightB2 = inQueueWeB2.DeQue<weight_T>();
        AscendC::LocalTensor<dstCO1_T> dstCO1 = outQueueCO1.AllocTensor<dstCO1_T>();
        AscendC::LocalTensor<dstCO1_T> biasA1 = inQueueBiasA1.DeQue<dstCO1_T>();
        // C = A * B + bias
        // m: height of the left matrix; k: width of the left matrix; n: width of the right matrix
        AscendC::Mmad(dstCO1, featureMapA2, weightB2, biasA1, { m, n, k, true, 0, false, false, false });
        outQueueCO1.EnQue<dstCO1_T>(dstCO1);
        inQueueFmA2.FreeTensor(featureMapA2);
        inQueueWeB2.FreeTensor(weightB2);
    }
    __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<dstCO1_T> dstCO1 = outQueueCO1.DeQue<dstCO1_T>();
        AscendC::LocalTensor<dst_T> dstA1 = outQueueA1.DeQue<dst_T>();
        // Enable DEQF16 quantization and set the quantization parameter to 0.5.
        float tmp = (float)0.5;
        // Convert tmp of float to deqScalar of uint64_t.
        uint64_t deqScalar = static_cast<uint64_t>(*reinterpret_cast<int32_t*>(&tmp));
        bool nz2ndEn = false;
        // If NZ2ND is disabled, the value of nSize must be a multiple of 16.
        uint16_t nSize = coutBlocks * 16;
        uint16_t mSize = m;
        // The value of srcStride must be a multiple of 16.
        uint16_t srcStride = (m + 16 - 1) / 16 * 16;
        // If NZ2ND is disabled, dstStride is the head-to-head distance between bursts and is 32-byte aligned.
        uint32_t dstStride = m * sizeof(dst_T) * 16 / 32;
        if (nz2ndEn) {
            // The number of ND matrices is 1. Set src_nd_stride and dst_nd_stride to 1.
            AscendC::SetFixpipeNz2ndFlag(1, 1, 1);
            // If NZ2ND is enabled, nSize can be a non-multiple of 16 and must be the same as n of Mmad.
            nSize = n;
            // If NZ2ND is enabled, dstStride indicates the stride between adjacent consecutive rows of the same ND matrix and is the same as n.
            dstStride = nSize;
        };
        // Disable ReLU and channelSplit.
        AscendC::DataCopyCO12DstParams intriParams(nSize, mSize, dstStride, srcStride, deqMode, 0, false, nz2ndEn);
      
        // mov l0c to gm, deq scalar quant
        AscendC::SetFixpipePreQuantFlag(deqScalar);  // Set the quantization parameter.
        AscendC::PipeBarrier<PIPE_FIX>();
        AscendC::DataCopy(dstGlobal, dstCO1, intriParams);
        // // mov l0c to gm, deq tensor quant
        // Additional global memory of the deq tensor needs to be allocated to transfer the value to workA1.
        // AscendC::LocalTensor<uint64_t> workA1 = inQueueDeqA1.AllocTensor<uint64_t>();
        // Size of the deq tensor
        // uint16_t deqSize = 128;
        // AscendC::DataCopy(workA1, deqGlobal, deqSize);
        // Address of the deq tensor on the fix
        // AscendC::LocalTensor<uint64_t> deqFB = inQueueDeqFB.AllocTensor<uint64_t>();
        // // l1->fix, burst_len unit is 128Bytes
        // uint16_t fbufBurstLen = deqSize / 128;
        // AscendC::DataCopyParams dataCopyParams(1, fbufBurstLen, 0, 0);
        // AscendC::DataCopy(deqFB, workA1, dataCopyParams);
        // Set the quantization tensor.
        // AscendC::SetFixPipeConfig(deqFB);
        // AscendC::PipeBarrier<PIPE_FIX>();
        // AscendC::DataCopy(dstGlobal, dstCO1, intriParams);
        // inQueueDeqA1.FreeTensor(workA1);
        // inQueueDeqFB.FreeTensor(deqFB);
        // // mov l0c to l1, deq scalar quant, and then mov l1 to gm
        // AscendC::SetFixpipePreQuantFlag(deqScalar);  // Set the quantization parameter.
        // AscendC::PipeBarrier<PIPE_FIX>();
        // AscendC::DataCopy(dstA1, dstCO1, intriParams);
        // AscendC::DataCopy(dstGlobal, dstA1, dstCO1Size);
        // // mov l0c to l1, deq tensor quant, and then mov l1 to gm
        // AscendC::LocalTensor<uint64_t> workA1 = inQueueDeqA1.AllocTensor<uint64_t>();
        // uint16_t deqSize = 128;
        // AscendC::DataCopy(workA1, deqGlobal, deqSize);
        // AscendC::LocalTensor<uint64_t> deqFB = inQueueDeqFB.AllocTensor<uint64_t>();
        // uint16_t fbufBurstLen = deqSize / 128;
        // AscendC::DataCopyParams dataCopyParams(1, fbufBurstLen, 0, 0);
        // AscendC::DataCopy(deqFB, workA1, dataCopyParams);
        // Set the quantization tensor.
        // AscendC::SetFixPipeConfig(deqFB);
        // AscendC::PipeBarrier<PIPE_FIX>();
        // AscendC::DataCopy(dstA1, dstCO1, intriParams);
        // AscendC::DataCopy(dstGlobal, dstA1, dstCO1Size);
        // inQueueDeqA1.FreeTensor(workA1);
        // inQueueDeqFB.FreeTensor(deqFB);
        // outQueueCO1.FreeTensor(dstCO1);
        // outQueueA1.FreeTensor(dstA1);
    }
private:
    AscendC::TPipe pipe;
    // feature map queue
    AscendC::TQue<AscendC::TPosition::A1, 1> inQueueFmA1;
    AscendC::TQue<AscendC::TPosition::A2, 1> inQueueFmA2;
    // weight queue
    AscendC::TQue<AscendC::TPosition::B1, 1> inQueueWeB1;
    AscendC::TQue<AscendC::TPosition::B2, 1> inQueueWeB2;
    // bias queue
    AscendC::TQue<AscendC::TPosition::A1, 1> inQueueBiasA1;
    // deq tensor queue
    AscendC::TQue<AscendC::TPosition::A1, 1> inQueueDeqA1;
    // fb dst of deq tensor
    AscendC::TQue<AscendC::TPosition::C2PIPE2GM, 1> inQueueDeqFB;
    // dst queue
    AscendC::TQue<AscendC::TPosition::CO1, 1> outQueueCO1;
    AscendC::TQue<AscendC::TPosition::A1, 1> outQueueA1;
    AscendC::GlobalTensor<fmap_T> fmGlobal;
    AscendC::GlobalTensor<weight_T> weGlobal;
    AscendC::GlobalTensor<dst_T> dstGlobal;
    AscendC::GlobalTensor<uint64_t> deqGlobal;
    AscendC::GlobalTensor<dstCO1_T> biasGlobal;
    AscendC::GlobalTensor<half> eleWiseGlobal;
    uint16_t channelSize = 32;
    uint16_t H = 4, W = 4;
    uint8_t Kh = 2, Kw = 2;
    uint16_t Cout;
    uint16_t C0, C1;
    uint8_t dilationH, dilationW;
    uint16_t coutBlocks, ho, wo, howo, howoRound;
    uint32_t featureMapA1Size, weightA1Size, featureMapA2Size, weightB2Size, biasSize, dstSize, dstCO1Size;
    uint16_t m, k, n;
    uint8_t fmRepeat, weRepeat;
    QuantMode_t deqMode = QuantMode_t::NoQuant;
};
#define KERNEL_CUBE_DATACOPY(dst_type, fmap_type, weight_type, dstCO1_type, CoutIn, dilationHIn, dilationWIn, deqModeIn)  \
    extern "C" __global__ __aicore__ void cube_datacopy_kernel_##fmap_type(__gm__ uint8_t* fmGm, __gm__ uint8_t* weGm,    \
        __gm__ uint8_t* biasGm, __gm__ uint8_t* deqGm, __gm__ uint8_t* dstGm)                                             \
    {                                                                                                                     \
        if (g_coreType == AscendC::AIV) {                                                                                 \
            return;                                                                                                       \
        }                                                                                                                 \
        KernelCubeDataCopy<dst_type, fmap_type, weight_type, dstCO1_type> op(CoutIn, dilationHIn, dilationWIn,            \
            deqModeIn);                                                                                                   \
        op.Init(fmGm, weGm, biasGm, deqGm, dstGm);                                                                        \
        op.Process();                                                                                                     \
    }
KERNEL_CUBE_DATACOPY(half, int8_t, int8_t, int32_t, 128, 1, 1, QuantMode_t::DEQF16);

Data transfer with on-the-fly format conversion along the CO1->GM path for the Atlas 200I/500 A2 inference products

#ifdef ASCENDC_CPU_DEBUG
#include "tikicpulib.h"
#endif
#include "kernel_operator.h"
#include "../../instrs/common_utils/register_utils.h"
template <typename dst_T, typename fmap_T, typename weight_T, typename dstCO1_T> class KernelCubeDataCopy{
public:
    __aicore__ inline KernelCubeDataCopy(uint16_t CoutIn, uint8_t dilationHIn, uint8_t dilationWIn, QuantMode_t deqModeIn)
    {
        // ceiling of 16
        Cout = CoutIn;
        dilationH = dilationHIn;
        dilationW = dilationWIn;
        C0 = 32 / sizeof(fmap_T);
        C1 = channelSize / C0;
        coutBlocks = (Cout + 16 - 1) / 16;
        ho = H - dilationH * (Kh - 1);
        wo = W - dilationW * (Kw - 1);
        howo = ho * wo;
        howoRound = ((howo + 16 - 1) / 16) * 16;
        featureMapA1Size = C1 * H * W * C0;      // shape: [C1, H, W, C0]
        weightA1Size = C1 * Kh * Kw * Cout * C0; // shape: [C1, Kh, Kw, Cout, C0]
        featureMapA2Size = howoRound * (C1 * Kh * Kw * C0);
        weightB2Size = (C1 * Kh * Kw * C0) * coutBlocks * 16;
        m = howo;
        k = C1 * Kh * Kw * C0;
        n = Cout;
        biasSize = Cout;                  // shape: [Cout]
        dstSize = coutBlocks * howo * 16; // shape: [coutBlocks, howo, 16]
        dstCO1Size = coutBlocks * howoRound * 16;
        fmRepeat = featureMapA2Size / (16 * C0);
        weRepeat = weightB2Size / (16 * C0);
        deqMode = deqModeIn;
    }
    __aicore__ inline void Init(__gm__ uint8_t* fmGm, __gm__ uint8_t* weGm, __gm__ uint8_t* biasGm, __gm__ uint8_t* deqGm, __gm__ uint8_t* eleWiseGm, __gm__ uint8_t* dstGm)
    {
        fmGlobal.SetGlobalBuffer((__gm__ fmap_T*)fmGm);
        weGlobal.SetGlobalBuffer((__gm__ weight_T*)weGm);
        biasGlobal.SetGlobalBuffer((__gm__ dstCO1_T*)biasGm);
        deqGlobal.SetGlobalBuffer((__gm__ uint64_t*)deqGm);
        dstGlobal.SetGlobalBuffer((__gm__ dst_T*)dstGm);
        eleWiseGlobal.SetGlobalBuffer((__gm__ half*)eleWiseGm);
        pipe.InitBuffer(inQueueFmA1, 1, featureMapA1Size * sizeof(fmap_T));
        pipe.InitBuffer(inQueueFmA2, 1, featureMapA2Size * sizeof(fmap_T));
        pipe.InitBuffer(inQueueWeB1, 1, weightA1Size * sizeof(weight_T));
        pipe.InitBuffer(inQueueWeB2, 1, weightB2Size * sizeof(weight_T));
        pipe.InitBuffer(inQueueBiasA1, 1, biasSize * sizeof(dstCO1_T));
        pipe.InitBuffer(inQueueDeqA1, 1, dstCO1Size * sizeof(uint64_t));
        pipe.InitBuffer(inQueueDeqFB, 1, dstCO1Size * sizeof(uint64_t));
        pipe.InitBuffer(outQueueCO1, 1, dstCO1Size * sizeof(dstCO1_T));
        pipe.InitBuffer(inQueueC1, 1, dstSize * sizeof(half));
    }
    __aicore__ inline void Process()
    {
        CopyIn();
        Split();
        Compute();
        CopyOut();
    }
private:
    __aicore__ inline void CopyIn()
    {
        AscendC::LocalTensor<fmap_T> featureMapA1 = inQueueFmA1.AllocTensor<fmap_T>();
        AscendC::LocalTensor<weight_T> weightB1 = inQueueWeB1.AllocTensor<weight_T>();
        AscendC::LocalTensor<dstCO1_T> biasA1 = inQueueBiasA1.AllocTensor<dstCO1_T>();
        AscendC::DataCopy(featureMapA1, fmGlobal, { 1, static_cast<uint16_t>(featureMapA1Size * sizeof(fmap_T) / 32), 0, 0 });
        AscendC::DataCopy(weightB1, weGlobal, { 1, static_cast<uint16_t>(weightA1Size * sizeof(weight_T) / 32), 0, 0 });
        AscendC::DataCopy(biasA1, biasGlobal, { 1, static_cast<uint16_t>(biasSize * sizeof(dstCO1_T) / 32), 0, 0 });
        inQueueFmA1.EnQue(featureMapA1);
        inQueueWeB1.EnQue(weightB1);
        inQueueBiasA1.EnQue(biasA1);
    }
    __aicore__ inline void Split()
    {
        AscendC::LocalTensor<fmap_T> featureMapA1 = inQueueFmA1.DeQue<fmap_T>();
        AscendC::LocalTensor<weight_T> weightB1 = inQueueWeB1.DeQue<weight_T>();
        AscendC::LocalTensor<fmap_T> featureMapA2 = inQueueFmA2.AllocTensor<fmap_T>();
        AscendC::LocalTensor<weight_T> weightB2 = inQueueWeB2.AllocTensor<weight_T>();
        uint8_t padList[] = {0, 0, 0, 0};
        // load3dv2
        AscendC::LoadData(featureMapA2, featureMapA1, { padList, H, W, channelSize, k, howoRound, 0, 0, 1, 1, Kw, Kh, dilationW, dilationH, false, false, 0 });
        // load2d
        AscendC::LoadData(weightB2, weightB1, { 0, weRepeat, 1, 0, 0, false, 0 });
        inQueueFmA2.EnQue<fmap_T>(featureMapA2);
        inQueueWeB2.EnQue<weight_T>(weightB2);
        inQueueFmA1.FreeTensor(featureMapA1);
        inQueueWeB1.FreeTensor(weightB1);
    }
    __aicore__ inline void Compute()
    {
        AscendC::LocalTensor<fmap_T> featureMapA2 = inQueueFmA2.DeQue<fmap_T>();
        AscendC::LocalTensor<weight_T> weightB2 = inQueueWeB2.DeQue<weight_T>();
        AscendC::LocalTensor<dstCO1_T> dstCO1 = outQueueCO1.AllocTensor<dstCO1_T>();
        AscendC::LocalTensor<dstCO1_T> biasA1 = inQueueBiasA1.DeQue<dstCO1_T>();
        // C = A * B + bias
        // m: height of the left matrix; k: width of the left matrix; n: width of the right matrix
        AscendC::Mmad(dstCO1, featureMapA2, weightB2, biasA1, { m, n, k, true, 0, false, false, false });
        outQueueCO1.EnQue<dstCO1_T>(dstCO1);
        inQueueFmA2.FreeTensor(featureMapA2);
        inQueueWeB2.FreeTensor(weightB2);
    }
    __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<dstCO1_T> dstCO1 = outQueueCO1.DeQue<dstCO1_T>();
        // Enable DEQF16 quantization and set the quantization parameter to 0.5.
        float tmp = (float)0.5;
        // Convert tmp of float to deqScalar of uint64_t.
        uint64_t deqScalar = static_cast<uint64_t>(*reinterpret_cast<int32_t*>(&tmp));
        bool nz2ndEn = false;
        // If NZ2ND is disabled, the value of nSize must be a multiple of 16.
        uint16_t nSize = coutBlocks * 16;
        uint16_t mSize = m;
        // The value of srcStride must be a multiple of 16.
        uint16_t srcStride = (m + 16 - 1) / 16 * 16;
        // If NZ2ND is disabled, dstStride is the head-to-head distance between bursts and is 32-byte aligned.
        uint32_t dstStride = m * sizeof(dst_T) * 16 / 32;
        if (nz2ndEn) {
            // The number of ND matrices is 1. Set src_nd_stride and dst_nd_stride to 1.
            AscendC::SetFixpipeNz2ndFlag(1, 1, 1);
            // If NZ2ND is enabled, nSize can be a non-multiple of 16 and must be the same as n of Mmad.
            nSize = n;
            // If NZ2ND is enabled, dstStride indicates the stride between adjacent consecutive rows of the same ND matrix and is the same as n.
            dstStride = nSize;
        };
        // Disable ReLU and channelSplit.
        AscendC::DataCopyCO12DstParams intriParams(nSize, mSize, dstStride, srcStride, deqMode, 0, false, nz2ndEn);
       
        // mov l0c to gm, deq scalar quant
        AscendC::SetFixpipePreQuantFlag(deqScalar);  // Set the quantization parameter.
        AscendC::PipeBarrier<PIPE_FIX>();
        AscendC::DataCopy(dstGlobal, dstCO1, intriParams);
        // // mov l0c to gm, deq tensor quant
        // Additional global memory of the deq tensor needs to be allocated to transfer the value to workA1.
        // AscendC::LocalTensor<uint64_t> workA1 = inQueueDeqA1.AllocTensor<uint64_t>();
        // Size of the deq tensor
        // uint16_t deqSize = 128;
        // AscendC::DataCopy(workA1, deqGlobal, deqSize);
        // Address of the deq tensor on the fix
        // AscendC::LocalTensor<uint64_t> deqFB = inQueueDeqFB.AllocTensor<uint64_t>();
        // // l1->fix, burst_len unit is 128Bytes
        // uint16_t fbufBurstLen = deqSize / 128;
        // AscendC::DataCopyParams dataCopyParams(1, fbufBurstLen, 0, 0);
        // AscendC::DataCopy(deqFB, workA1, dataCopyParams);
        // Set the quantization tensor.
        // AscendC::SetFixPipeConfig(deqFB);
        // AscendC::PipeBarrier<PIPE_FIX>();
        // mov l0c to gm: Enable the ClipReLU operation after quantization.
        // intriParams.clipReluPre = 1; 
        // Set the value of ClipReLU in the register.
        // uint64_t clipReluVal = 0x3c00; // value 1, half
        // SetFixPipeClipRelu(clipReluVal);
        // mov l0c to gm: Perform element-wise addition after quantization.
        // intriParams.eltWiseOp = 1;
        // Additional global memory of the element-wise tensor needs to be allocated to transfer the value to eleWiseTensor.
        // AscendC::LocalTensor<half> eleWiseTensor = inQueueC1.AllocTensor<half>();
        // DataCopy(eleWiseTensor, eleWiseGlobal, { 1, static_cast<uint16_t>(sizeof(half) * dst_size / 32), 0, 0 });
        // AscendC::PipeBarrier<PIPE_ALL>();
        // Set the address for storing the element-wise tensor to the register.
        // SetFixPipeAddr(eleWiseTensor, 1);

        // AscendC::DataCopy(dstGlobal, dstCO1, intriParams);
        // inQueueDeqA1.FreeTensor(workA1);
        // inQueueDeqFB.FreeTensor(deqFB);
        // outQueueCO1.FreeTensor(dstCO1);
        // inQueueC1.FreeTensor(eleWiseTensor);
     }
private:
    AscendC::TPipe pipe;
    // feature map queue
    AscendC::TQue<AscendC::TPosition::A1, 1> inQueueFmA1;
    AscendC::TQue<AscendC::TPosition::A2, 1> inQueueFmA2;
    // weight queue
    AscendC::TQue<AscendC::TPosition::B1, 1> inQueueWeB1;
    AscendC::TQue<AscendC::TPosition::B2, 1> inQueueWeB2;
    // bias queue
    AscendC::TQue<AscendC::TPosition::A1, 1> inQueueBiasA1;
    // deq tensor queue
    AscendC::TQue<AscendC::TPosition::A1, 1> inQueueDeqA1;
    // fb dst of deq tensor
    AscendC::TQue<AscendC::TPosition::C2PIPE2GM, 1> inQueueDeqFB;
    // dst queue
    AscendC::TQue<AscendC::TPosition::CO1, 1> outQueueCO1;
    // element-wise tensor
    AscendC::TQue<AscendC::TPosition::C1, 1> inQueueC1;
    AscendC::GlobalTensor<fmap_T> fmGlobal;
    AscendC::GlobalTensor<weight_T> weGlobal;
    AscendC::GlobalTensor<dst_T> dstGlobal;
    AscendC::GlobalTensor<uint64_t> deqGlobal;
    AscendC::GlobalTensor<dstCO1_T> biasGlobal;
    AscendC::GlobalTensor<half> eleWiseGlobal;
    uint16_t channelSize = 32;
    uint16_t H = 4, W = 4;
    uint8_t Kh = 2, Kw = 2;
    uint16_t Cout;
    uint16_t C0, C1;
    uint8_t dilationH, dilationW;
    uint16_t coutBlocks, ho, wo, howo, howoRound;
    uint32_t featureMapA1Size, weightA1Size, featureMapA2Size, weightB2Size, biasSize, dstSize, dstCO1Size;
    uint16_t m, k, n;
    uint8_t fmRepeat, weRepeat;
    QuantMode_t deqMode = QuantMode_t::NoQuant;
};
#define KERNEL_CUBE_DATACOPY(dst_type, fmap_type, weight_type, dstCO1_type, CoutIn, dilationHIn, dilationWIn, deqModeIn)  \
    extern "C" __global__ __aicore__ void cube_datacopy_kernel_##fmap_type(__gm__ uint8_t* fmGm, __gm__ uint8_t* weGm,    \
        __gm__ uint8_t* biasGm, __gm__ uint8_t* deqGm, __gm__ uint8_t* eleWiseGm, __gm__ uint8_t* dstGm)                                             \
    {                                                                                                                     \
        if (g_coreType == AscendC::AIV) {                                                                                 \
            return;                                                                                                       \
        }                                                                                                                 \
        KernelCubeDataCopy<dst_type, fmap_type, weight_type, dstCO1_type> op(CoutIn, dilationHIn, dilationWIn,            \
            deqModeIn);                                                                                                   \
        op.Init(fmGm, weGm, biasGm, deqGm, eleWiseGm, dstGm);                                                                        \
        op.Process();                                                                                                     \
    }
KERNEL_CUBE_DATACOPY(half, int8_t, int8_t, int32_t, 128, 1, 1, QuantMode_t::DEQF16);

Parent topic: DataCopy