Activation data transfer with channel-wise quantization

Applicability

Product

Supported/Unsupported

Atlas A3 training products / Atlas A3 inference products

Atlas A2 training products / Atlas A2 inference products

Atlas 200I/500 A2 inference products

Atlas inference product 's AI Core

x

Atlas inference product 's Vector Core

x

Atlas training products

x

Functions

Supports quantization and ReLU activation during data movement, and supports conversion from the NZ format to the ND format in the channel from the local memory to the global memory.

Prototype

  • Local Memory -> Global Memory, supporting quantization, ReLU activation, and other operations, and supporting the conversion from the NZ format to the ND format.
    1
    2
    template <typename T, typename U>
    __aicore__ inline void DataCopy(const GlobalTensor<T>& dst, const LocalTensor<U>& src, const DataCopyCO12DstParams& intriParams)
    
  • Local Memory -> Local Memory, supporting quantization, ReLU activation, and other operations.
    1
    2
    template <typename T, typename U>
    __aicore__ inline void DataCopy(const LocalTensor<T>& dst, const LocalTensor<U>& src, const DataCopyCO12DstParams& intriParams)
    

For details about the supported data paths and data types of each prototype, see Supported Channels and Data Types.

Parameters

Table 1 Parameters in the template

Parameter

Description

T

Data type of the destination operand. For details about the supported data types, see Supported Channels and Data Types.

U

Data type of the source operand. For details about the supported data types, see Supported Channels and Data Types.

Table 2 Parameters

Parameter

Input/Output

Meaning

dst

Output

Destination operand of the LocalTensor or GlobalTensor type.

src

Input

Source operand of the LocalTensor type.

intriParams

Input

Data transfer parameter. The type is DataCopyCO12DstParams.

For details, see ${INSTALL_DIR}/include/ascendc/basic_api/interface/kernel_struct_data_copy.h. Replace ${INSTALL_DIR} with the actual CANN component directory.

Table 3 Parameters in the DataCopyCO12DstParams structure (C0 value: Generally, C0 = 16. When channel splitting is enabled, C0 = 8.)

Field

Meaning

nSize

Size in the horizontal direction of the source.

  • If the NZ2ND function is disabled, the value must be a multiple of C0. In this case, the number of data chunks is nSize/C0.
  • If the NZ2ND function is enabled, there are no restrictions.

mSize

Size in the vertical direction of the source.

  • If the NZ2ND function is disabled, the size of a data chunk is mSize x C0 elements.
  • If the NZ2ND function is enabled, the size of the NZ/ND matrix is mSize x nSize.

dstStride

  • NZ2ND disabled

    Stride between adjacent data chunks of dstLocal (head-to-head stride between adjacent data chunks). The value cannot be 0. The unit is DataBlock (32 bytes).

  • NZ2ND enabled

    Offset between adjacent rows in the same ND matrix of the dst (the offset between the header and the header). The value is not 0, and the unit is element.

srcStride

  • NZ2ND disabled

    Stride between adjacent data chunks of srcLocal (head-to-head stride between adjacent data chunks). The value must be a multiple of 16. Value range: srcStride∈[0, 65535]. The unit is C0_Size (C0 x sizeof(U), where U is the data type of the src).

  • NZ2ND enabled

    Head-to-head offset between adjacent z arrangements in the same NZ matrix of srcLocal. The value must be a multiple of 16. srcStride ∈ [0, 65535] (unit: C0_size)

quantPre

Used to control the quantization mode. It is of the QuantMode_t type and its definition is as follows. The default value is QuantMode_t::NoQuant, that is, the quantization function is disabled.

When it is configured for scalar quantization, SetFixpipePreQuantFlag needs to be called to set the scalar quantization parameters; when it is configured for tensor quantization, SetFixPipeConfig needs to be called to set the tensor quantization parameters.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
enum QuantMode_t
{
    NoQuant, // Quantization is disabled.
    F322F16, // float quantized to half, scalar quantized
    F322BF16, // float-to-bfloat16_t scalar quantization
    DEQF16, // int32_t quantized to half, scalar quantized.
    VDEQF16, // int32_t quantized to half, tensor quantized
    QF322B8_PRE, // float-to-int8_t/uint8_t scalar quantization
    VQF322B8_PRE, // float-to-int8_t/uint8_t tensor quantization
    REQ8, // int32_t quantized to int8_t/uint8_t, scalar quantized
    VREQ8, // int32_t quantized to int8_t/uint8_t, tensor quantized
};

reluPre

Used to configure the ReLU operation mode. The type is uint8_t, and the value is as follows:

  • 0: Disable ReLU
  • 1: Normal ReLU

channelSplit

The value is of the bool type, indicating whether to enable channel splitting. This parameter is valid for the dst of the float type.

  • false: disabled
  • true: enabled

nz2ndEn

The value is of the bool type, indicating whether to enable the NZ2ND format conversion. This parameter is valid only for the CO1 -> GM channel.

To enable the NZ2ND function, you need to call SetFixpipeNz2ndFlag to set format conversion.

  • false: disabled
  • true: enabled

clipReluPre

Whether to enable the ClipReLU operation. The type is uint8_t. The value 0 indicates that ClipReLU is disabled, and the value 1 indicates that ClipReLU is enabled. If this parameter is set to 1, SetFixPipeClipRelu needs to be called to set the maximum value of ClipReLU.

  • This operation is performed after path-associated quantization and can be used only after quantPre is configured. Currently, the supported quantization modes are F322F16, DEQF16, VDEQF16, QF322B8_PRE, VQF322B8_PRE, REQ8, and VREQ8.
  • This parameter is supported only by the Atlas 200I/500 A2 inference products .

eltWiseOp

Whether to enable the Elementwise operation and the operation mode. The Elementwise operation refers to that after the per-channel quantization, a LocalTensor can be added or subtracted to or from each element. The size of the LocalTensor is mSize * nSize. The parameters related to the LocalTensor address need to be set by calling SetFixPipeAddr.

The parameter type of eltWiseOp is uint8_t. The options are as follows:

  • 0: Elementwise disabled.
  • 1: Elementwise Addition
  • 2: Elementwise Subtraction

This parameter is supported only by the Atlas 200I/500 A2 inference products .

sid

Reserved parameter for future use.

Returns

None

Restrictions

None

Supported Channels and Data Types

The following data channels are expressed by the logical position TPosition and the corresponding physical channels are also specified. For details about the mapping between TPosition and physical memory, see Table 1.

Table 4 Specific channels and supported data types for Local Memory -> Global Memory

Model

Datapath

Source Operand

Destination Operand

Atlas A2 training products / Atlas A2 inference products

CO1 -> GM (L0C Buffer -> GM)

float

uint8_t, int8_t, half, bfloat16_t, float

int32_t

uint8_t, int8_t, half, int16_t, int32_t

Atlas A3 training products / Atlas A3 inference products

CO1 -> GM (L0C Buffer -> GM)

float

uint8_t, int8_t, half, bfloat16_t, float

int32_t

uint8_t, int8_t, half, int16_t, int32_t

Atlas 200I/500 A2 inference products

CO1 -> GM (L0C Buffer -> GM)

float

uint8_t, int8_t, half, bfloat16_t, float

int32_t

uint8_t, int8_t, half, int16_t, int32_t

Table 5 Specific channels and supported data types for Local Memory -> Local Memory

Model

Datapath

Source Operand

Destination Operand

Atlas A2 training products / Atlas A2 inference products

CO1 -> A1 (L0C Buffer -> L1 Buffer)

float

uint8_t, int8_t, half, bfloat16_t

int32_t

uint8_t, int8_t, half, int16_t

Atlas A3 training products / Atlas A3 inference products

CO1 -> A1 (L0C Buffer -> L1 Buffer)

float

uint8_t, int8_t, half, bfloat16_t

int32_t

uint8_t, int8_t, half, int16_t

Examples

  • Format conversion along data movements: CO1 -> A1 and CO1 -> GM
    Example: mmad contains matrix multiplication bias. The data types of the left and right matrices are int8_t, and the data type of the result matrix is int32_t. In DEQF16 quantization mode, the scalar quantization parameter is 0.5. The result computed by the MMAD is quantized from int32_t to half and then moved out.
      1
      2
      3
      4
      5
      6
      7
      8
      9
     10
     11
     12
     13
     14
     15
     16
     17
     18
     19
     20
     21
     22
     23
     24
     25
     26
     27
     28
     29
     30
     31
     32
     33
     34
     35
     36
     37
     38
     39
     40
     41
     42
     43
     44
     45
     46
     47
     48
     49
     50
     51
     52
     53
     54
     55
     56
     57
     58
     59
     60
     61
     62
     63
     64
     65
     66
     67
     68
     69
     70
     71
     72
     73
     74
     75
     76
     77
     78
     79
     80
     81
     82
     83
     84
     85
     86
     87
     88
     89
     90
     91
     92
     93
     94
     95
     96
     97
     98
     99
    100
    101
    102
    103
    104
    105
    106
    107
    108
    109
    110
    111
    112
    113
    114
    115
    116
    117
    118
    119
    120
    121
    122
    123
    124
    125
    126
    127
    128
    129
    130
    131
    132
    133
    134
    135
    136
    137
    138
    139
    140
    141
    142
    143
    144
    145
    146
    147
    148
    149
    150
    151
    152
    153
    154
    155
    156
    157
    158
    159
    160
    161
    162
    163
    164
    165
    166
    167
    168
    169
    170
    171
    172
    173
    174
    175
    176
    177
    178
    179
    180
    181
    182
    183
    184
    185
    186
    187
    188
    189
    190
    191
    192
    193
    194
    195
    196
    197
    198
    199
    200
    201
    202
    203
    204
    205
    206
    207
    208
    209
    210
    211
    212
    213
    214
    215
    216
    217
    218
    219
    220
    221
    #ifdef ASCENDC_CPU_DEBUG
    #include "tikicpulib.h"
    #endif
    #include "kernel_operator.h"
    #include "../../instrs/common_utils/register_utils.h"
    SET_G_CORE_TYPE_IS_AIC
    template <typename dst_T, typename fmap_T, typename weight_T, typename dstCO1_T> class KernelCubeDataCopy{
    public:
        __aicore__ inline KernelCubeDataCopy(uint16_t CoutIn, uint8_t dilationHIn, uint8_t dilationWIn, QuantMode_t deqModeIn)
        {
            // ceiling of 16
            Cout = CoutIn;
            dilationH = dilationHIn;
            dilationW = dilationWIn;
            C0 = 32 / sizeof(fmap_T);
            C1 = channelSize / C0;
            coutBlocks = (Cout + 16 - 1) / 16;
            ho = H - dilationH * (Kh - 1);
            wo = W - dilationW * (Kw - 1);
            howo = ho * wo;
            howoRound = ((howo + 16 - 1) / 16) * 16;
            featureMapA1Size = C1 * H * W * C0;      // shape: [C1, H, W, C0]
            weightA1Size = C1 * Kh * Kw * Cout * C0; // shape: [C1, Kh, Kw, Cout, C0]
            featureMapA2Size = howoRound * (C1 * Kh * Kw * C0);
            weightB2Size = (C1 * Kh * Kw * C0) * coutBlocks * 16;
            m = howo;
            k = C1 * Kh * Kw * C0;
            n = Cout;
            biasSize = Cout;                  // shape: [Cout]
            dstSize = coutBlocks * howo * 16; // shape: [coutBlocks, howo, 16]
            dstCO1Size = coutBlocks * howoRound * 16;
            fmRepeat = featureMapA2Size / (16 * C0);
            weRepeat = weightB2Size / (16 * C0);
            deqMode = deqModeIn;
        }
        __aicore__ inline void Init(__gm__ uint8_t* fmGm, __gm__ uint8_t* weGm, __gm__ uint8_t* biasGm, __gm__ uint8_t* deqGm, __gm__ uint8_t* dstGm)
        {
            fmGlobal.SetGlobalBuffer((__gm__ fmap_T*)fmGm);
            weGlobal.SetGlobalBuffer((__gm__ weight_T*)weGm);
            biasGlobal.SetGlobalBuffer((__gm__ dstCO1_T*)biasGm);
            deqGlobal.SetGlobalBuffer((__gm__ uint64_t*)deqGm);
            dstGlobal.SetGlobalBuffer((__gm__ dst_T*)dstGm);
            pipe.InitBuffer(inQueueFmA1, 1, featureMapA1Size * sizeof(fmap_T));
            pipe.InitBuffer(inQueueFmA2, 1, featureMapA2Size * sizeof(fmap_T));
            pipe.InitBuffer(inQueueWeB1, 1, weightA1Size * sizeof(weight_T));
            pipe.InitBuffer(inQueueWeB2, 1, weightB2Size * sizeof(weight_T));
            pipe.InitBuffer(inQueueBiasA1, 1, biasSize * sizeof(dstCO1_T));
            pipe.InitBuffer(inQueueDeqA1, 1, dstCO1Size * sizeof(uint64_t));
            pipe.InitBuffer(inQueueDeqFB, 1, dstCO1Size * sizeof(uint64_t));
            pipe.InitBuffer(outQueueCO1, 1, dstCO1Size * sizeof(dstCO1_T));
            pipe.InitBuffer(outQueueA1, 1, dstCO1Size * sizeof(dst_T));
         }
        __aicore__ inline void Process()
        {
            CopyIn();
            Split();
            Compute();
            CopyOut();
        }
    private:
        __aicore__ inline void CopyIn()
        {
            AscendC::LocalTensor<fmap_T> featureMapA1 = inQueueFmA1.AllocTensor<fmap_T>();
            AscendC::LocalTensor<weight_T> weightB1 = inQueueWeB1.AllocTensor<weight_T>();
            AscendC::LocalTensor<dstCO1_T> biasA1 = inQueueBiasA1.AllocTensor<dstCO1_T>();
            AscendC::DataCopy(featureMapA1, fmGlobal, { 1, static_cast<uint16_t>(featureMapA1Size * sizeof(fmap_T) / 32), 0, 0 });
            AscendC::DataCopy(weightB1, weGlobal, { 1, static_cast<uint16_t>(weightA1Size * sizeof(weight_T) / 32), 0, 0 });
            AscendC::DataCopy(biasA1, biasGlobal, { 1, static_cast<uint16_t>(biasSize * sizeof(dstCO1_T) / 32), 0, 0 });
            inQueueFmA1.EnQue(featureMapA1);
            inQueueWeB1.EnQue(weightB1);
            inQueueBiasA1.EnQue(biasA1);
        }
        __aicore__ inline void Split()
        {
            AscendC::LocalTensor<fmap_T> featureMapA1 = inQueueFmA1.DeQue<fmap_T>();
            AscendC::LocalTensor<weight_T> weightB1 = inQueueWeB1.DeQue<weight_T>();
            AscendC::LocalTensor<fmap_T> featureMapA2 = inQueueFmA2.AllocTensor<fmap_T>();
            AscendC::LocalTensor<weight_T> weightB2 = inQueueWeB2.AllocTensor<weight_T>();
            uint8_t padList[] = {0, 0, 0, 0};
            // load3dv2
            AscendC::LoadData(featureMapA2, featureMapA1, { padList, H, W, channelSize, k, howoRound, 0, 0, 1, 1, Kw, Kh, dilationW, dilationH, false, false, 0 });
            // load2d
            AscendC::LoadData(weightB2, weightB1, { 0, weRepeat, 1, 0, 0, false, 0 });
            inQueueFmA2.EnQue<fmap_T>(featureMapA2);
            inQueueWeB2.EnQue<weight_T>(weightB2);
            inQueueFmA1.FreeTensor(featureMapA1);
            inQueueWeB1.FreeTensor(weightB1);
        }
        __aicore__ inline void Compute()
        {
            AscendC::LocalTensor<fmap_T> featureMapA2 = inQueueFmA2.DeQue<fmap_T>();
            AscendC::LocalTensor<weight_T> weightB2 = inQueueWeB2.DeQue<weight_T>();
            AscendC::LocalTensor<dstCO1_T> dstCO1 = outQueueCO1.AllocTensor<dstCO1_T>();
            AscendC::LocalTensor<dstCO1_T> biasA1 = inQueueBiasA1.DeQue<dstCO1_T>();
            // C = A * B + bias
            // m: height of the left matrix; k: width of the left matrix; n: width of the right matrix
            AscendC::Mmad(dstCO1, featureMapA2, weightB2, biasA1, { m, n, k, true, 0, false, false, false });
            outQueueCO1.EnQue<dstCO1_T>(dstCO1);
            inQueueFmA2.FreeTensor(featureMapA2);
            inQueueWeB2.FreeTensor(weightB2);
        }
        __aicore__ inline void CopyOut()
        {
            AscendC::LocalTensor<dstCO1_T> dstCO1 = outQueueCO1.DeQue<dstCO1_T>();
            AscendC::LocalTensor<dst_T> dstA1 = outQueueA1.DeQue<dst_T>();
            // Enable DEQF16 quantization and set the quantization parameter to 0.5.
            float tmp = (float)0.5;
            // Convert tmp of float to deqScalar of uint64_t.
            uint64_t deqScalar = static_cast<uint64_t>(*reinterpret_cast<int32_t*>(&tmp));
            bool nz2ndEn = false;
            // If NZ2ND is disabled, the value of nSize must be a multiple of 16.
            uint16_t nSize = coutBlocks * 16;
            uint16_t mSize = m;
            // The value of srcStride must be a multiple of 16.
            uint16_t srcStride = (m + 16 - 1) / 16 * 16;
            // When NZ2ND is disabled, dstStride is the head-to-head distance between bursts and is 32-byte aligned.
            uint32_t dstStride = m * sizeof(dst_T) * 16 / 32;
            if (nz2ndEn) {
                // The number of ND matrices is 1. Set src_nd_stride and dst_nd_stride to 1.
                AscendC::SetFixpipeNz2ndFlag(1, 1, 1);
                // When NZ2ND is enabled, nSize may not be a multiple of 16 but must be the same as n of Mmad.
                nSize = n;
                // When NZ2ND is enabled, dstStride indicates the stride between adjacent consecutive rows of the same ND matrix and is the same as n.
                dstStride = nSize;
            };
            // Disable ReLU and channelSplit.
            AscendC::DataCopyCO12DstParams intriParams(nSize, mSize, dstStride, srcStride, deqMode, 0, false, nz2ndEn);
          
            // mov l0c to gm, deq scalar quant
            AscendC::SetFixpipePreQuantFlag(deqScalar);  // Set the quantization parameter.
            AscendC::PipeBarrier<PIPE_FIX>();
            AscendC::DataCopy(dstGlobal, dstCO1, intriParams);
            // // mov l0c to gm, deq tensor quant
            // // Additional gm space of the deq tensor needs to be allocated to move the value to workA1.
            // AscendC::LocalTensor<uint64_t> workA1 = inQueueDeqA1.AllocTensor<uint64_t>();
            // // Size of the deq tensor
            // uint16_t deqSize = 128;
            // AscendC::DataCopy(workA1, deqGlobal, deqSize);
            // // Address of the deq tensor on the fix
            // AscendC::LocalTensor<uint64_t> deqFB = inQueueDeqFB.AllocTensor<uint64_t>();
            // // l1->fix, burst_len unit is 128Bytes
            // uint16_t fbufBurstLen = deqSize / 128;
            // AscendC::DataCopyParams dataCopyParams(1, fbufBurstLen, 0, 0);
            // AscendC::DataCopy(deqFB, workA1, dataCopyParams);
            // // Set the quantization tensor.
            // AscendC::SetFixPipeConfig(deqFB);
            // AscendC::PipeBarrier<PIPE_FIX>();
            // AscendC::DataCopy(dstGlobal, dstCO1, intriParams);
            // inQueueDeqA1.FreeTensor(workA1);
            // inQueueDeqFB.FreeTensor(deqFB);
            // // mov l0c to l1, deq scalar quant, and then mov l1 to gm
            // AscendC::SetFixpipePreQuantFlag(deqScalar);  // Set the quantization parameter.
            // AscendC::PipeBarrier<PIPE_FIX>();
            // AscendC::DataCopy(dstA1, dstCO1, intriParams);
            // AscendC::DataCopy(dstGlobal, dstA1, dstCO1Size);
            // // mov l0c to l1, deq tensor quant, and then mov l1 to gm
            // AscendC::LocalTensor<uint64_t> workA1 = inQueueDeqA1.AllocTensor<uint64_t>();
            // uint16_t deqSize = 128;
            // AscendC::DataCopy(workA1, deqGlobal, deqSize);
            // AscendC::LocalTensor<uint64_t> deqFB = inQueueDeqFB.AllocTensor<uint64_t>();
            // uint16_t fbufBurstLen = deqSize / 128;
            // AscendC::DataCopyParams dataCopyParams(1, fbufBurstLen, 0, 0);
            // AscendC::DataCopy(deqFB, workA1, dataCopyParams);
            // // Set the quantization tensor.
            // AscendC::SetFixPipeConfig(deqFB);
            // AscendC::PipeBarrier<PIPE_FIX>();
            // AscendC::DataCopy(dstA1, dstCO1, intriParams);
            // AscendC::DataCopy(dstGlobal, dstA1, dstCO1Size);
            // inQueueDeqA1.FreeTensor(workA1);
            // inQueueDeqFB.FreeTensor(deqFB);
            // outQueueCO1.FreeTensor(dstCO1);
            // outQueueA1.FreeTensor(dstA1);
        }
    private:
        AscendC::TPipe pipe;
        // feature map queue
        AscendC::TQue<AscendC::TPosition::A1, 1> inQueueFmA1;
        AscendC::TQue<AscendC::TPosition::A2, 1> inQueueFmA2;
        // weight queue
        AscendC::TQue<AscendC::TPosition::B1, 1> inQueueWeB1;
        AscendC::TQue<AscendC::TPosition::B2, 1> inQueueWeB2;
        // bias queue
        AscendC::TQue<AscendC::TPosition::A1, 1> inQueueBiasA1;
        // deq tensor queue
        AscendC::TQue<AscendC::TPosition::A1, 1> inQueueDeqA1;
        // fb dst of deq tensor
        AscendC::TQue<AscendC::TPosition::C2PIPE2GM, 1> inQueueDeqFB;
        // dst queue
        AscendC::TQue<AscendC::TPosition::CO1, 1> outQueueCO1;
        AscendC::TQue<AscendC::TPosition::A1, 1> outQueueA1;
        AscendC::GlobalTensor<fmap_T> fmGlobal;
        AscendC::GlobalTensor<weight_T> weGlobal;
        AscendC::GlobalTensor<dst_T> dstGlobal;
        AscendC::GlobalTensor<uint64_t> deqGlobal;
        AscendC::GlobalTensor<dstCO1_T> biasGlobal;
        AscendC::GlobalTensor<half> eleWiseGlobal;
        uint16_t channelSize = 32;
        uint16_t H = 4, W = 4;
        uint8_t Kh = 2, Kw = 2;
        uint16_t Cout;
        uint16_t C0, C1;
        uint8_t dilationH, dilationW;
        uint16_t coutBlocks, ho, wo, howo, howoRound;
        uint32_t featureMapA1Size, weightA1Size, featureMapA2Size, weightB2Size, biasSize, dstSize, dstCO1Size;
        uint16_t m, k, n;
        uint8_t fmRepeat, weRepeat;
        QuantMode_t deqMode = QuantMode_t::NoQuant;
    };
    #define KERNEL_CUBE_DATACOPY(dst_type, fmap_type, weight_type, dstCO1_type, CoutIn, dilationHIn, dilationWIn, deqModeIn)  \
        extern "C" __global__ __aicore__ void cube_datacopy_kernel_##fmap_type(__gm__ uint8_t* fmGm, __gm__ uint8_t* weGm,    \
            __gm__ uint8_t* biasGm, __gm__ uint8_t* deqGm, __gm__ uint8_t* dstGm)                                             \
        {                                                                                                                     \
            if (g_coreType == AscendC::AIV) {                                                                                 \
                return;                                                                                                       \
            }                                                                                                                 \
            KernelCubeDataCopy<dst_type, fmap_type, weight_type, dstCO1_type> op(CoutIn, dilationHIn, dilationWIn,            \
                deqModeIn);                                                                                                   \
            op.Init(fmGm, weGm, biasGm, deqGm, dstGm);                                                                        \
            op.Process();                                                                                                     \
        }
    KERNEL_CUBE_DATACOPY(half, int8_t, int8_t, int32_t, 128, 1, 1, QuantMode_t::DEQF16);
    
  • For Atlas 200I/500 A2 inference products , format conversion is performed during data movement and the path is CO1->GM.
    Example: mmad contains matrix multiplication bias. The data types of the left and right matrices are int8_t, and the data type of the result matrix is int32_t. In DEQF16 quantization mode, the scalar quantization parameter is 0.5. The result computed by the MMAD is quantized from int32_t to half and then moved out.
      1
      2
      3
      4
      5
      6
      7
      8
      9
     10
     11
     12
     13
     14
     15
     16
     17
     18
     19
     20
     21
     22
     23
     24
     25
     26
     27
     28
     29
     30
     31
     32
     33
     34
     35
     36
     37
     38
     39
     40
     41
     42
     43
     44
     45
     46
     47
     48
     49
     50
     51
     52
     53
     54
     55
     56
     57
     58
     59
     60
     61
     62
     63
     64
     65
     66
     67
     68
     69
     70
     71
     72
     73
     74
     75
     76
     77
     78
     79
     80
     81
     82
     83
     84
     85
     86
     87
     88
     89
     90
     91
     92
     93
     94
     95
     96
     97
     98
     99
    100
    101
    102
    103
    104
    105
    106
    107
    108
    109
    110
    111
    112
    113
    114
    115
    116
    117
    118
    119
    120
    121
    122
    123
    124
    125
    126
    127
    128
    129
    130
    131
    132
    133
    134
    135
    136
    137
    138
    139
    140
    141
    142
    143
    144
    145
    146
    147
    148
    149
    150
    151
    152
    153
    154
    155
    156
    157
    158
    159
    160
    161
    162
    163
    164
    165
    166
    167
    168
    169
    170
    171
    172
    173
    174
    175
    176
    177
    178
    179
    180
    181
    182
    183
    184
    185
    186
    187
    188
    189
    190
    191
    192
    193
    194
    195
    196
    197
    198
    199
    200
    201
    202
    203
    204
    205
    206
    207
    208
    209
    210
    211
    212
    213
    214
    215
    #ifdef ASCENDC_CPU_DEBUG
    #include "tikicpulib.h"
    #endif
    #include "kernel_operator.h"
    #include "../../instrs/common_utils/register_utils.h"
    template <typename dst_T, typename fmap_T, typename weight_T, typename dstCO1_T> class KernelCubeDataCopy{
    public:
        __aicore__ inline KernelCubeDataCopy(uint16_t CoutIn, uint8_t dilationHIn, uint8_t dilationWIn, QuantMode_t deqModeIn)
        {
            // ceiling of 16
            Cout = CoutIn;
            dilationH = dilationHIn;
            dilationW = dilationWIn;
            C0 = 32 / sizeof(fmap_T);
            C1 = channelSize / C0;
            coutBlocks = (Cout + 16 - 1) / 16;
            ho = H - dilationH * (Kh - 1);
            wo = W - dilationW * (Kw - 1);
            howo = ho * wo;
            howoRound = ((howo + 16 - 1) / 16) * 16;
            featureMapA1Size = C1 * H * W * C0;      // shape: [C1, H, W, C0]
            weightA1Size = C1 * Kh * Kw * Cout * C0; // shape: [C1, Kh, Kw, Cout, C0]
            featureMapA2Size = howoRound * (C1 * Kh * Kw * C0);
            weightB2Size = (C1 * Kh * Kw * C0) * coutBlocks * 16;
            m = howo;
            k = C1 * Kh * Kw * C0;
            n = Cout;
            biasSize = Cout;                  // shape: [Cout]
            dstSize = coutBlocks * howo * 16; // shape: [coutBlocks, howo, 16]
            dstCO1Size = coutBlocks * howoRound * 16;
            fmRepeat = featureMapA2Size / (16 * C0);
            weRepeat = weightB2Size / (16 * C0);
            deqMode = deqModeIn;
        }
        __aicore__ inline void Init(__gm__ uint8_t* fmGm, __gm__ uint8_t* weGm, __gm__ uint8_t* biasGm, __gm__ uint8_t* deqGm, __gm__ uint8_t* eleWiseGm, __gm__ uint8_t* dstGm)
        {
            fmGlobal.SetGlobalBuffer((__gm__ fmap_T*)fmGm);
            weGlobal.SetGlobalBuffer((__gm__ weight_T*)weGm);
            biasGlobal.SetGlobalBuffer((__gm__ dstCO1_T*)biasGm);
            deqGlobal.SetGlobalBuffer((__gm__ uint64_t*)deqGm);
            dstGlobal.SetGlobalBuffer((__gm__ dst_T*)dstGm);
            eleWiseGlobal.SetGlobalBuffer((__gm__ half*)eleWiseGm);
            pipe.InitBuffer(inQueueFmA1, 1, featureMapA1Size * sizeof(fmap_T));
            pipe.InitBuffer(inQueueFmA2, 1, featureMapA2Size * sizeof(fmap_T));
            pipe.InitBuffer(inQueueWeB1, 1, weightA1Size * sizeof(weight_T));
            pipe.InitBuffer(inQueueWeB2, 1, weightB2Size * sizeof(weight_T));
            pipe.InitBuffer(inQueueBiasA1, 1, biasSize * sizeof(dstCO1_T));
            pipe.InitBuffer(inQueueDeqA1, 1, dstCO1Size * sizeof(uint64_t));
            pipe.InitBuffer(inQueueDeqFB, 1, dstCO1Size * sizeof(uint64_t));
            pipe.InitBuffer(outQueueCO1, 1, dstCO1Size * sizeof(dstCO1_T));
            pipe.InitBuffer(inQueueC1, 1, dstSize * sizeof(half));
        }
        __aicore__ inline void Process()
        {
            CopyIn();
            Split();
            Compute();
            CopyOut();
        }
    private:
        __aicore__ inline void CopyIn()
        {
            AscendC::LocalTensor<fmap_T> featureMapA1 = inQueueFmA1.AllocTensor<fmap_T>();
            AscendC::LocalTensor<weight_T> weightB1 = inQueueWeB1.AllocTensor<weight_T>();
            AscendC::LocalTensor<dstCO1_T> biasA1 = inQueueBiasA1.AllocTensor<dstCO1_T>();
            AscendC::DataCopy(featureMapA1, fmGlobal, { 1, static_cast<uint16_t>(featureMapA1Size * sizeof(fmap_T) / 32), 0, 0 });
            AscendC::DataCopy(weightB1, weGlobal, { 1, static_cast<uint16_t>(weightA1Size * sizeof(weight_T) / 32), 0, 0 });
            AscendC::DataCopy(biasA1, biasGlobal, { 1, static_cast<uint16_t>(biasSize * sizeof(dstCO1_T) / 32), 0, 0 });
            inQueueFmA1.EnQue(featureMapA1);
            inQueueWeB1.EnQue(weightB1);
            inQueueBiasA1.EnQue(biasA1);
        }
        __aicore__ inline void Split()
        {
            AscendC::LocalTensor<fmap_T> featureMapA1 = inQueueFmA1.DeQue<fmap_T>();
            AscendC::LocalTensor<weight_T> weightB1 = inQueueWeB1.DeQue<weight_T>();
            AscendC::LocalTensor<fmap_T> featureMapA2 = inQueueFmA2.AllocTensor<fmap_T>();
            AscendC::LocalTensor<weight_T> weightB2 = inQueueWeB2.AllocTensor<weight_T>();
            uint8_t padList[] = {0, 0, 0, 0};
            // load3dv2
            AscendC::LoadData(featureMapA2, featureMapA1, { padList, H, W, channelSize, k, howoRound, 0, 0, 1, 1, Kw, Kh, dilationW, dilationH, false, false, 0 });
            // load2d
            AscendC::LoadData(weightB2, weightB1, { 0, weRepeat, 1, 0, 0, false, 0 });
            inQueueFmA2.EnQue<fmap_T>(featureMapA2);
            inQueueWeB2.EnQue<weight_T>(weightB2);
            inQueueFmA1.FreeTensor(featureMapA1);
            inQueueWeB1.FreeTensor(weightB1);
        }
        __aicore__ inline void Compute()
        {
            AscendC::LocalTensor<fmap_T> featureMapA2 = inQueueFmA2.DeQue<fmap_T>();
            AscendC::LocalTensor<weight_T> weightB2 = inQueueWeB2.DeQue<weight_T>();
            AscendC::LocalTensor<dstCO1_T> dstCO1 = outQueueCO1.AllocTensor<dstCO1_T>();
            AscendC::LocalTensor<dstCO1_T> biasA1 = inQueueBiasA1.DeQue<dstCO1_T>();
            // C = A * B + bias
            // m: height of the left matrix; k: width of the left matrix; n: width of the right matrix
            AscendC::Mmad(dstCO1, featureMapA2, weightB2, biasA1, { m, n, k, true, 0, false, false, false });
            outQueueCO1.EnQue<dstCO1_T>(dstCO1);
            inQueueFmA2.FreeTensor(featureMapA2);
            inQueueWeB2.FreeTensor(weightB2);
        }
        __aicore__ inline void CopyOut()
        {
            AscendC::LocalTensor<dstCO1_T> dstCO1 = outQueueCO1.DeQue<dstCO1_T>();
            // Enable DEQF16 quantization and set the quantization parameter to 0.5.
            float tmp = (float)0.5;
            // Convert tmp of float to deqScalar of uint64_t.
            uint64_t deqScalar = static_cast<uint64_t>(*reinterpret_cast<int32_t*>(&tmp));
            bool nz2ndEn = false;
            // If NZ2ND is disabled, the value of nSize must be a multiple of 16.
            uint16_t nSize = coutBlocks * 16;
            uint16_t mSize = m;
            // The value of srcStride must be a multiple of 16.
            uint16_t srcStride = (m + 16 - 1) / 16 * 16;
            // When NZ2ND is disabled, dstStride is the head-to-head distance between bursts and is 32-byte aligned.
            uint32_t dstStride = m * sizeof(dst_T) * 16 / 32;
            if (nz2ndEn) {
                // The number of ND matrices is 1. Set src_nd_stride and dst_nd_stride to 1.
                AscendC::SetFixpipeNz2ndFlag(1, 1, 1);
                // When NZ2ND is enabled, nSize may not be a multiple of 16 but must be the same as n of Mmad.
                nSize = n;
                // When NZ2ND is enabled, dstStride indicates the stride between adjacent consecutive rows of the same ND matrix and is the same as n.
                dstStride = nSize;
            };
            // Disable ReLU and channelSplit.
            AscendC::DataCopyCO12DstParams intriParams(nSize, mSize, dstStride, srcStride, deqMode, 0, false, nz2ndEn);
           
            // mov l0c to gm, deq scalar quant
            AscendC::SetFixpipePreQuantFlag(deqScalar);  // Set the quantization parameter.
            AscendC::PipeBarrier<PIPE_FIX>();
            AscendC::DataCopy(dstGlobal, dstCO1, intriParams);
            // // mov l0c to gm, deq tensor quant
            // // Additional gm space of the deq tensor needs to be allocated to move the value to workA1.
            // AscendC::LocalTensor<uint64_t> workA1 = inQueueDeqA1.AllocTensor<uint64_t>();
            // // Size of the deq tensor
            // uint16_t deqSize = 128;
            // AscendC::DataCopy(workA1, deqGlobal, deqSize);
            // // Address of the deq tensor on the fix
            // AscendC::LocalTensor<uint64_t> deqFB = inQueueDeqFB.AllocTensor<uint64_t>();
            // // l1->fix, burst_len unit is 128Bytes
            // uint16_t fbufBurstLen = deqSize / 128;
            // AscendC::DataCopyParams dataCopyParams(1, fbufBurstLen, 0, 0);
            // AscendC::DataCopy(deqFB, workA1, dataCopyParams);
            // // Set the quantization tensor.
            // AscendC::SetFixPipeConfig(deqFB);
            // AscendC::PipeBarrier<PIPE_FIX>();
            // // mov l0c to gm, enable the ClipReLU operation after the quantization operation
            // intriParams.clipReluPre = 1; 
            // // Set the value of ClipReLU in the register.
            // uint64_t clipReluVal = 0x3c00; // value 1, half
            // SetFixPipeClipRelu(clipReluVal);
            // //mov l0c to gm, set element-wise operation after the quantization operation, and add a LocalTensor element by element
            // intriParams.eltWiseOp = 1;
            // // Additional gm space of the element-wise tensor needs to be allocated to move the value to eleWiseTensor.
            // AscendC::LocalTensor<half> eleWiseTensor = inQueueC1.AllocTensor<half>();
            // DataCopy(eleWiseTensor, eleWiseGlobal, { 1, static_cast<uint16_t>(sizeof(half) * dst_size / 32), 0, 0 });
            // AscendC::PipeBarrier<PIPE_ALL>();
            // // Set the address for storing the element-wise tensor to the register.
            // SetFixPipeAddr(eleWiseTensor, 1);
    
            // AscendC::DataCopy(dstGlobal, dstCO1, intriParams);
            // inQueueDeqA1.FreeTensor(workA1);
            // inQueueDeqFB.FreeTensor(deqFB);
            // outQueueCO1.FreeTensor(dstCO1);
            // inQueueC1.FreeTensor(eleWiseTensor);
         }
    private:
        AscendC::TPipe pipe;
        // feature map queue
        AscendC::TQue<AscendC::TPosition::A1, 1> inQueueFmA1;
        AscendC::TQue<AscendC::TPosition::A2, 1> inQueueFmA2;
        // weight queue
        AscendC::TQue<AscendC::TPosition::B1, 1> inQueueWeB1;
        AscendC::TQue<AscendC::TPosition::B2, 1> inQueueWeB2;
        // bias queue
        AscendC::TQue<AscendC::TPosition::A1, 1> inQueueBiasA1;
        // deq tensor queue
        AscendC::TQue<AscendC::TPosition::A1, 1> inQueueDeqA1;
        // fb dst of deq tensor
        AscendC::TQue<AscendC::TPosition::C2PIPE2GM, 1> inQueueDeqFB;
        // dst queue
        AscendC::TQue<AscendC::TPosition::CO1, 1> outQueueCO1;
        // element-wise tensor
        AscendC::TQue<AscendC::TPosition::C1, 1> inQueueC1;
        AscendC::GlobalTensor<fmap_T> fmGlobal;
        AscendC::GlobalTensor<weight_T> weGlobal;
        AscendC::GlobalTensor<dst_T> dstGlobal;
        AscendC::GlobalTensor<uint64_t> deqGlobal;
        AscendC::GlobalTensor<dstCO1_T> biasGlobal;
        AscendC::GlobalTensor<half> eleWiseGlobal;
        uint16_t channelSize = 32;
        uint16_t H = 4, W = 4;
        uint8_t Kh = 2, Kw = 2;
        uint16_t Cout;
        uint16_t C0, C1;
        uint8_t dilationH, dilationW;
        uint16_t coutBlocks, ho, wo, howo, howoRound;
        uint32_t featureMapA1Size, weightA1Size, featureMapA2Size, weightB2Size, biasSize, dstSize, dstCO1Size;
        uint16_t m, k, n;
        uint8_t fmRepeat, weRepeat;
        QuantMode_t deqMode = QuantMode_t::NoQuant;
    };
    #define KERNEL_CUBE_DATACOPY(dst_type, fmap_type, weight_type, dstCO1_type, CoutIn, dilationHIn, dilationWIn, deqModeIn)  \
        extern "C" __global__ __aicore__ void cube_datacopy_kernel_##fmap_type(__gm__ uint8_t* fmGm, __gm__ uint8_t* weGm,    \
            __gm__ uint8_t* biasGm, __gm__ uint8_t* deqGm, __gm__ uint8_t* eleWiseGm, __gm__ uint8_t* dstGm)                                             \
        {                                                                                                                     \
            if (g_coreType == AscendC::AIV) {                                                                                 \
                return;                                                                                                       \
            }                                                                                                                 \
            KernelCubeDataCopy<dst_type, fmap_type, weight_type, dstCO1_type> op(CoutIn, dilationHIn, dilationWIn,            \
                deqModeIn);                                                                                                   \
            op.Init(fmGm, weGm, biasGm, deqGm, eleWiseGm, dstGm);                                                                        \
            op.Process();                                                                                                     \
        }
    KERNEL_CUBE_DATACOPY(half, int8_t, int8_t, int32_t, 128, 1, 1, QuantMode_t::DEQF16);