Activation data transfer with channel-wise quantization
Applicability
|
Product |
Supported/Unsupported |
|---|---|
|
|
√ |
|
|
√ |
|
|
√ |
|
|
x |
|
|
x |
|
|
x |
Functions
Supports quantization and ReLU activation during data movement, and supports conversion from the NZ format to the ND format in the channel from the local memory to the global memory.
Prototype
- Local Memory -> Global Memory, supporting quantization, ReLU activation, and other operations, and supporting the conversion from the NZ format to the ND format.
1 2
template <typename T, typename U> __aicore__ inline void DataCopy(const GlobalTensor<T>& dst, const LocalTensor<U>& src, const DataCopyCO12DstParams& intriParams)
- Local Memory -> Local Memory, supporting quantization, ReLU activation, and other operations.
1 2
template <typename T, typename U> __aicore__ inline void DataCopy(const LocalTensor<T>& dst, const LocalTensor<U>& src, const DataCopyCO12DstParams& intriParams)
For details about the supported data paths and data types of each prototype, see Supported Channels and Data Types.
Parameters
|
Parameter |
Description |
|---|---|
|
T |
Data type of the destination operand. For details about the supported data types, see Supported Channels and Data Types. |
|
U |
Data type of the source operand. For details about the supported data types, see Supported Channels and Data Types. |
|
Parameter |
Input/Output |
Meaning |
|---|---|---|
|
dst |
Output |
Destination operand of the LocalTensor or GlobalTensor type. |
|
src |
Input |
Source operand of the LocalTensor type. |
|
intriParams |
Input |
Data transfer parameter. The type is DataCopyCO12DstParams. For details, see ${INSTALL_DIR}/include/ascendc/basic_api/interface/kernel_struct_data_copy.h. Replace ${INSTALL_DIR} with the actual CANN component directory. |
|
Field |
Meaning |
||
|---|---|---|---|
|
nSize |
Size in the horizontal direction of the source.
|
||
|
mSize |
Size in the vertical direction of the source.
|
||
|
dstStride |
|||
|
srcStride |
|
||
|
quantPre |
Used to control the quantization mode. It is of the QuantMode_t type and its definition is as follows. The default value is QuantMode_t::NoQuant, that is, the quantization function is disabled.
When it is configured for scalar quantization, SetFixpipePreQuantFlag needs to be called to set the scalar quantization parameters; when it is configured for tensor quantization, SetFixPipeConfig needs to be called to set the tensor quantization parameters.
|
||
|
reluPre |
Used to configure the ReLU operation mode. The type is uint8_t, and the value is as follows:
|
||
|
channelSplit |
The value is of the bool type, indicating whether to enable channel splitting. This parameter is valid for the dst of the float type.
|
||
|
nz2ndEn |
The value is of the bool type, indicating whether to enable the NZ2ND format conversion. This parameter is valid only for the CO1 -> GM channel. To enable the NZ2ND function, you need to call SetFixpipeNz2ndFlag to set format conversion.
|
||
|
clipReluPre |
Whether to enable the ClipReLU operation. The type is uint8_t. The value 0 indicates that ClipReLU is disabled, and the value 1 indicates that ClipReLU is enabled. If this parameter is set to 1, SetFixPipeClipRelu needs to be called to set the maximum value of ClipReLU.
|
||
|
eltWiseOp |
Whether to enable the Elementwise operation and the operation mode. The Elementwise operation refers to that after the per-channel quantization, a LocalTensor can be added or subtracted to or from each element. The size of the LocalTensor is mSize * nSize. The parameters related to the LocalTensor address need to be set by calling SetFixPipeAddr. The parameter type of eltWiseOp is uint8_t. The options are as follows:
This parameter is supported only by the |
||
|
sid |
Reserved parameter for future use. |
Returns
None
Restrictions
None
Supported Channels and Data Types
The following data channels are expressed by the logical position TPosition and the corresponding physical channels are also specified. For details about the mapping between TPosition and physical memory, see Table 1.
|
Model |
Datapath |
Source Operand |
Destination Operand |
|---|---|---|---|
|
|
CO1 -> GM (L0C Buffer -> GM) |
float |
uint8_t, int8_t, half, bfloat16_t, float |
|
int32_t |
uint8_t, int8_t, half, int16_t, int32_t |
||
|
|
CO1 -> GM (L0C Buffer -> GM) |
float |
uint8_t, int8_t, half, bfloat16_t, float |
|
int32_t |
uint8_t, int8_t, half, int16_t, int32_t |
||
|
|
CO1 -> GM (L0C Buffer -> GM) |
float |
uint8_t, int8_t, half, bfloat16_t, float |
|
int32_t |
uint8_t, int8_t, half, int16_t, int32_t |
|
Model |
Datapath |
Source Operand |
Destination Operand |
|---|---|---|---|
|
|
CO1 -> A1 (L0C Buffer -> L1 Buffer) |
float |
uint8_t, int8_t, half, bfloat16_t |
|
int32_t |
uint8_t, int8_t, half, int16_t |
||
|
|
CO1 -> A1 (L0C Buffer -> L1 Buffer) |
float |
uint8_t, int8_t, half, bfloat16_t |
|
int32_t |
uint8_t, int8_t, half, int16_t |
Examples
- Format conversion along data movements: CO1 -> A1 and CO1 -> GM
Example: mmad contains matrix multiplication bias. The data types of the left and right matrices are int8_t, and the data type of the result matrix is int32_t. In DEQF16 quantization mode, the scalar quantization parameter is 0.5. The result computed by the MMAD is quantized from int32_t to half and then moved out.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221
#ifdef ASCENDC_CPU_DEBUG #include "tikicpulib.h" #endif #include "kernel_operator.h" #include "../../instrs/common_utils/register_utils.h" SET_G_CORE_TYPE_IS_AIC template <typename dst_T, typename fmap_T, typename weight_T, typename dstCO1_T> class KernelCubeDataCopy{ public: __aicore__ inline KernelCubeDataCopy(uint16_t CoutIn, uint8_t dilationHIn, uint8_t dilationWIn, QuantMode_t deqModeIn) { // ceiling of 16 Cout = CoutIn; dilationH = dilationHIn; dilationW = dilationWIn; C0 = 32 / sizeof(fmap_T); C1 = channelSize / C0; coutBlocks = (Cout + 16 - 1) / 16; ho = H - dilationH * (Kh - 1); wo = W - dilationW * (Kw - 1); howo = ho * wo; howoRound = ((howo + 16 - 1) / 16) * 16; featureMapA1Size = C1 * H * W * C0; // shape: [C1, H, W, C0] weightA1Size = C1 * Kh * Kw * Cout * C0; // shape: [C1, Kh, Kw, Cout, C0] featureMapA2Size = howoRound * (C1 * Kh * Kw * C0); weightB2Size = (C1 * Kh * Kw * C0) * coutBlocks * 16; m = howo; k = C1 * Kh * Kw * C0; n = Cout; biasSize = Cout; // shape: [Cout] dstSize = coutBlocks * howo * 16; // shape: [coutBlocks, howo, 16] dstCO1Size = coutBlocks * howoRound * 16; fmRepeat = featureMapA2Size / (16 * C0); weRepeat = weightB2Size / (16 * C0); deqMode = deqModeIn; } __aicore__ inline void Init(__gm__ uint8_t* fmGm, __gm__ uint8_t* weGm, __gm__ uint8_t* biasGm, __gm__ uint8_t* deqGm, __gm__ uint8_t* dstGm) { fmGlobal.SetGlobalBuffer((__gm__ fmap_T*)fmGm); weGlobal.SetGlobalBuffer((__gm__ weight_T*)weGm); biasGlobal.SetGlobalBuffer((__gm__ dstCO1_T*)biasGm); deqGlobal.SetGlobalBuffer((__gm__ uint64_t*)deqGm); dstGlobal.SetGlobalBuffer((__gm__ dst_T*)dstGm); pipe.InitBuffer(inQueueFmA1, 1, featureMapA1Size * sizeof(fmap_T)); pipe.InitBuffer(inQueueFmA2, 1, featureMapA2Size * sizeof(fmap_T)); pipe.InitBuffer(inQueueWeB1, 1, weightA1Size * sizeof(weight_T)); pipe.InitBuffer(inQueueWeB2, 1, weightB2Size * sizeof(weight_T)); pipe.InitBuffer(inQueueBiasA1, 1, biasSize * sizeof(dstCO1_T)); pipe.InitBuffer(inQueueDeqA1, 1, dstCO1Size * sizeof(uint64_t)); pipe.InitBuffer(inQueueDeqFB, 1, dstCO1Size * sizeof(uint64_t)); pipe.InitBuffer(outQueueCO1, 1, dstCO1Size * sizeof(dstCO1_T)); pipe.InitBuffer(outQueueA1, 1, dstCO1Size * sizeof(dst_T)); } __aicore__ inline void Process() { CopyIn(); Split(); Compute(); CopyOut(); } private: __aicore__ inline void CopyIn() { AscendC::LocalTensor<fmap_T> featureMapA1 = inQueueFmA1.AllocTensor<fmap_T>(); AscendC::LocalTensor<weight_T> weightB1 = inQueueWeB1.AllocTensor<weight_T>(); AscendC::LocalTensor<dstCO1_T> biasA1 = inQueueBiasA1.AllocTensor<dstCO1_T>(); AscendC::DataCopy(featureMapA1, fmGlobal, { 1, static_cast<uint16_t>(featureMapA1Size * sizeof(fmap_T) / 32), 0, 0 }); AscendC::DataCopy(weightB1, weGlobal, { 1, static_cast<uint16_t>(weightA1Size * sizeof(weight_T) / 32), 0, 0 }); AscendC::DataCopy(biasA1, biasGlobal, { 1, static_cast<uint16_t>(biasSize * sizeof(dstCO1_T) / 32), 0, 0 }); inQueueFmA1.EnQue(featureMapA1); inQueueWeB1.EnQue(weightB1); inQueueBiasA1.EnQue(biasA1); } __aicore__ inline void Split() { AscendC::LocalTensor<fmap_T> featureMapA1 = inQueueFmA1.DeQue<fmap_T>(); AscendC::LocalTensor<weight_T> weightB1 = inQueueWeB1.DeQue<weight_T>(); AscendC::LocalTensor<fmap_T> featureMapA2 = inQueueFmA2.AllocTensor<fmap_T>(); AscendC::LocalTensor<weight_T> weightB2 = inQueueWeB2.AllocTensor<weight_T>(); uint8_t padList[] = {0, 0, 0, 0}; // load3dv2 AscendC::LoadData(featureMapA2, featureMapA1, { padList, H, W, channelSize, k, howoRound, 0, 0, 1, 1, Kw, Kh, dilationW, dilationH, false, false, 0 }); // load2d AscendC::LoadData(weightB2, weightB1, { 0, weRepeat, 1, 0, 0, false, 0 }); inQueueFmA2.EnQue<fmap_T>(featureMapA2); inQueueWeB2.EnQue<weight_T>(weightB2); inQueueFmA1.FreeTensor(featureMapA1); inQueueWeB1.FreeTensor(weightB1); } __aicore__ inline void Compute() { AscendC::LocalTensor<fmap_T> featureMapA2 = inQueueFmA2.DeQue<fmap_T>(); AscendC::LocalTensor<weight_T> weightB2 = inQueueWeB2.DeQue<weight_T>(); AscendC::LocalTensor<dstCO1_T> dstCO1 = outQueueCO1.AllocTensor<dstCO1_T>(); AscendC::LocalTensor<dstCO1_T> biasA1 = inQueueBiasA1.DeQue<dstCO1_T>(); // C = A * B + bias // m: height of the left matrix; k: width of the left matrix; n: width of the right matrix AscendC::Mmad(dstCO1, featureMapA2, weightB2, biasA1, { m, n, k, true, 0, false, false, false }); outQueueCO1.EnQue<dstCO1_T>(dstCO1); inQueueFmA2.FreeTensor(featureMapA2); inQueueWeB2.FreeTensor(weightB2); } __aicore__ inline void CopyOut() { AscendC::LocalTensor<dstCO1_T> dstCO1 = outQueueCO1.DeQue<dstCO1_T>(); AscendC::LocalTensor<dst_T> dstA1 = outQueueA1.DeQue<dst_T>(); // Enable DEQF16 quantization and set the quantization parameter to 0.5. float tmp = (float)0.5; // Convert tmp of float to deqScalar of uint64_t. uint64_t deqScalar = static_cast<uint64_t>(*reinterpret_cast<int32_t*>(&tmp)); bool nz2ndEn = false; // If NZ2ND is disabled, the value of nSize must be a multiple of 16. uint16_t nSize = coutBlocks * 16; uint16_t mSize = m; // The value of srcStride must be a multiple of 16. uint16_t srcStride = (m + 16 - 1) / 16 * 16; // When NZ2ND is disabled, dstStride is the head-to-head distance between bursts and is 32-byte aligned. uint32_t dstStride = m * sizeof(dst_T) * 16 / 32; if (nz2ndEn) { // The number of ND matrices is 1. Set src_nd_stride and dst_nd_stride to 1. AscendC::SetFixpipeNz2ndFlag(1, 1, 1); // When NZ2ND is enabled, nSize may not be a multiple of 16 but must be the same as n of Mmad. nSize = n; // When NZ2ND is enabled, dstStride indicates the stride between adjacent consecutive rows of the same ND matrix and is the same as n. dstStride = nSize; }; // Disable ReLU and channelSplit. AscendC::DataCopyCO12DstParams intriParams(nSize, mSize, dstStride, srcStride, deqMode, 0, false, nz2ndEn); // mov l0c to gm, deq scalar quant AscendC::SetFixpipePreQuantFlag(deqScalar); // Set the quantization parameter. AscendC::PipeBarrier<PIPE_FIX>(); AscendC::DataCopy(dstGlobal, dstCO1, intriParams); // // mov l0c to gm, deq tensor quant // // Additional gm space of the deq tensor needs to be allocated to move the value to workA1. // AscendC::LocalTensor<uint64_t> workA1 = inQueueDeqA1.AllocTensor<uint64_t>(); // // Size of the deq tensor // uint16_t deqSize = 128; // AscendC::DataCopy(workA1, deqGlobal, deqSize); // // Address of the deq tensor on the fix // AscendC::LocalTensor<uint64_t> deqFB = inQueueDeqFB.AllocTensor<uint64_t>(); // // l1->fix, burst_len unit is 128Bytes // uint16_t fbufBurstLen = deqSize / 128; // AscendC::DataCopyParams dataCopyParams(1, fbufBurstLen, 0, 0); // AscendC::DataCopy(deqFB, workA1, dataCopyParams); // // Set the quantization tensor. // AscendC::SetFixPipeConfig(deqFB); // AscendC::PipeBarrier<PIPE_FIX>(); // AscendC::DataCopy(dstGlobal, dstCO1, intriParams); // inQueueDeqA1.FreeTensor(workA1); // inQueueDeqFB.FreeTensor(deqFB); // // mov l0c to l1, deq scalar quant, and then mov l1 to gm // AscendC::SetFixpipePreQuantFlag(deqScalar); // Set the quantization parameter. // AscendC::PipeBarrier<PIPE_FIX>(); // AscendC::DataCopy(dstA1, dstCO1, intriParams); // AscendC::DataCopy(dstGlobal, dstA1, dstCO1Size); // // mov l0c to l1, deq tensor quant, and then mov l1 to gm // AscendC::LocalTensor<uint64_t> workA1 = inQueueDeqA1.AllocTensor<uint64_t>(); // uint16_t deqSize = 128; // AscendC::DataCopy(workA1, deqGlobal, deqSize); // AscendC::LocalTensor<uint64_t> deqFB = inQueueDeqFB.AllocTensor<uint64_t>(); // uint16_t fbufBurstLen = deqSize / 128; // AscendC::DataCopyParams dataCopyParams(1, fbufBurstLen, 0, 0); // AscendC::DataCopy(deqFB, workA1, dataCopyParams); // // Set the quantization tensor. // AscendC::SetFixPipeConfig(deqFB); // AscendC::PipeBarrier<PIPE_FIX>(); // AscendC::DataCopy(dstA1, dstCO1, intriParams); // AscendC::DataCopy(dstGlobal, dstA1, dstCO1Size); // inQueueDeqA1.FreeTensor(workA1); // inQueueDeqFB.FreeTensor(deqFB); // outQueueCO1.FreeTensor(dstCO1); // outQueueA1.FreeTensor(dstA1); } private: AscendC::TPipe pipe; // feature map queue AscendC::TQue<AscendC::TPosition::A1, 1> inQueueFmA1; AscendC::TQue<AscendC::TPosition::A2, 1> inQueueFmA2; // weight queue AscendC::TQue<AscendC::TPosition::B1, 1> inQueueWeB1; AscendC::TQue<AscendC::TPosition::B2, 1> inQueueWeB2; // bias queue AscendC::TQue<AscendC::TPosition::A1, 1> inQueueBiasA1; // deq tensor queue AscendC::TQue<AscendC::TPosition::A1, 1> inQueueDeqA1; // fb dst of deq tensor AscendC::TQue<AscendC::TPosition::C2PIPE2GM, 1> inQueueDeqFB; // dst queue AscendC::TQue<AscendC::TPosition::CO1, 1> outQueueCO1; AscendC::TQue<AscendC::TPosition::A1, 1> outQueueA1; AscendC::GlobalTensor<fmap_T> fmGlobal; AscendC::GlobalTensor<weight_T> weGlobal; AscendC::GlobalTensor<dst_T> dstGlobal; AscendC::GlobalTensor<uint64_t> deqGlobal; AscendC::GlobalTensor<dstCO1_T> biasGlobal; AscendC::GlobalTensor<half> eleWiseGlobal; uint16_t channelSize = 32; uint16_t H = 4, W = 4; uint8_t Kh = 2, Kw = 2; uint16_t Cout; uint16_t C0, C1; uint8_t dilationH, dilationW; uint16_t coutBlocks, ho, wo, howo, howoRound; uint32_t featureMapA1Size, weightA1Size, featureMapA2Size, weightB2Size, biasSize, dstSize, dstCO1Size; uint16_t m, k, n; uint8_t fmRepeat, weRepeat; QuantMode_t deqMode = QuantMode_t::NoQuant; }; #define KERNEL_CUBE_DATACOPY(dst_type, fmap_type, weight_type, dstCO1_type, CoutIn, dilationHIn, dilationWIn, deqModeIn) \ extern "C" __global__ __aicore__ void cube_datacopy_kernel_##fmap_type(__gm__ uint8_t* fmGm, __gm__ uint8_t* weGm, \ __gm__ uint8_t* biasGm, __gm__ uint8_t* deqGm, __gm__ uint8_t* dstGm) \ { \ if (g_coreType == AscendC::AIV) { \ return; \ } \ KernelCubeDataCopy<dst_type, fmap_type, weight_type, dstCO1_type> op(CoutIn, dilationHIn, dilationWIn, \ deqModeIn); \ op.Init(fmGm, weGm, biasGm, deqGm, dstGm); \ op.Process(); \ } KERNEL_CUBE_DATACOPY(half, int8_t, int8_t, int32_t, 128, 1, 1, QuantMode_t::DEQF16);
- For
Atlas 200I/500 A2 inference products , format conversion is performed during data movement and the path is CO1->GM.Example: mmad contains matrix multiplication bias. The data types of the left and right matrices are int8_t, and the data type of the result matrix is int32_t. In DEQF16 quantization mode, the scalar quantization parameter is 0.5. The result computed by the MMAD is quantized from int32_t to half and then moved out.1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215
#ifdef ASCENDC_CPU_DEBUG #include "tikicpulib.h" #endif #include "kernel_operator.h" #include "../../instrs/common_utils/register_utils.h" template <typename dst_T, typename fmap_T, typename weight_T, typename dstCO1_T> class KernelCubeDataCopy{ public: __aicore__ inline KernelCubeDataCopy(uint16_t CoutIn, uint8_t dilationHIn, uint8_t dilationWIn, QuantMode_t deqModeIn) { // ceiling of 16 Cout = CoutIn; dilationH = dilationHIn; dilationW = dilationWIn; C0 = 32 / sizeof(fmap_T); C1 = channelSize / C0; coutBlocks = (Cout + 16 - 1) / 16; ho = H - dilationH * (Kh - 1); wo = W - dilationW * (Kw - 1); howo = ho * wo; howoRound = ((howo + 16 - 1) / 16) * 16; featureMapA1Size = C1 * H * W * C0; // shape: [C1, H, W, C0] weightA1Size = C1 * Kh * Kw * Cout * C0; // shape: [C1, Kh, Kw, Cout, C0] featureMapA2Size = howoRound * (C1 * Kh * Kw * C0); weightB2Size = (C1 * Kh * Kw * C0) * coutBlocks * 16; m = howo; k = C1 * Kh * Kw * C0; n = Cout; biasSize = Cout; // shape: [Cout] dstSize = coutBlocks * howo * 16; // shape: [coutBlocks, howo, 16] dstCO1Size = coutBlocks * howoRound * 16; fmRepeat = featureMapA2Size / (16 * C0); weRepeat = weightB2Size / (16 * C0); deqMode = deqModeIn; } __aicore__ inline void Init(__gm__ uint8_t* fmGm, __gm__ uint8_t* weGm, __gm__ uint8_t* biasGm, __gm__ uint8_t* deqGm, __gm__ uint8_t* eleWiseGm, __gm__ uint8_t* dstGm) { fmGlobal.SetGlobalBuffer((__gm__ fmap_T*)fmGm); weGlobal.SetGlobalBuffer((__gm__ weight_T*)weGm); biasGlobal.SetGlobalBuffer((__gm__ dstCO1_T*)biasGm); deqGlobal.SetGlobalBuffer((__gm__ uint64_t*)deqGm); dstGlobal.SetGlobalBuffer((__gm__ dst_T*)dstGm); eleWiseGlobal.SetGlobalBuffer((__gm__ half*)eleWiseGm); pipe.InitBuffer(inQueueFmA1, 1, featureMapA1Size * sizeof(fmap_T)); pipe.InitBuffer(inQueueFmA2, 1, featureMapA2Size * sizeof(fmap_T)); pipe.InitBuffer(inQueueWeB1, 1, weightA1Size * sizeof(weight_T)); pipe.InitBuffer(inQueueWeB2, 1, weightB2Size * sizeof(weight_T)); pipe.InitBuffer(inQueueBiasA1, 1, biasSize * sizeof(dstCO1_T)); pipe.InitBuffer(inQueueDeqA1, 1, dstCO1Size * sizeof(uint64_t)); pipe.InitBuffer(inQueueDeqFB, 1, dstCO1Size * sizeof(uint64_t)); pipe.InitBuffer(outQueueCO1, 1, dstCO1Size * sizeof(dstCO1_T)); pipe.InitBuffer(inQueueC1, 1, dstSize * sizeof(half)); } __aicore__ inline void Process() { CopyIn(); Split(); Compute(); CopyOut(); } private: __aicore__ inline void CopyIn() { AscendC::LocalTensor<fmap_T> featureMapA1 = inQueueFmA1.AllocTensor<fmap_T>(); AscendC::LocalTensor<weight_T> weightB1 = inQueueWeB1.AllocTensor<weight_T>(); AscendC::LocalTensor<dstCO1_T> biasA1 = inQueueBiasA1.AllocTensor<dstCO1_T>(); AscendC::DataCopy(featureMapA1, fmGlobal, { 1, static_cast<uint16_t>(featureMapA1Size * sizeof(fmap_T) / 32), 0, 0 }); AscendC::DataCopy(weightB1, weGlobal, { 1, static_cast<uint16_t>(weightA1Size * sizeof(weight_T) / 32), 0, 0 }); AscendC::DataCopy(biasA1, biasGlobal, { 1, static_cast<uint16_t>(biasSize * sizeof(dstCO1_T) / 32), 0, 0 }); inQueueFmA1.EnQue(featureMapA1); inQueueWeB1.EnQue(weightB1); inQueueBiasA1.EnQue(biasA1); } __aicore__ inline void Split() { AscendC::LocalTensor<fmap_T> featureMapA1 = inQueueFmA1.DeQue<fmap_T>(); AscendC::LocalTensor<weight_T> weightB1 = inQueueWeB1.DeQue<weight_T>(); AscendC::LocalTensor<fmap_T> featureMapA2 = inQueueFmA2.AllocTensor<fmap_T>(); AscendC::LocalTensor<weight_T> weightB2 = inQueueWeB2.AllocTensor<weight_T>(); uint8_t padList[] = {0, 0, 0, 0}; // load3dv2 AscendC::LoadData(featureMapA2, featureMapA1, { padList, H, W, channelSize, k, howoRound, 0, 0, 1, 1, Kw, Kh, dilationW, dilationH, false, false, 0 }); // load2d AscendC::LoadData(weightB2, weightB1, { 0, weRepeat, 1, 0, 0, false, 0 }); inQueueFmA2.EnQue<fmap_T>(featureMapA2); inQueueWeB2.EnQue<weight_T>(weightB2); inQueueFmA1.FreeTensor(featureMapA1); inQueueWeB1.FreeTensor(weightB1); } __aicore__ inline void Compute() { AscendC::LocalTensor<fmap_T> featureMapA2 = inQueueFmA2.DeQue<fmap_T>(); AscendC::LocalTensor<weight_T> weightB2 = inQueueWeB2.DeQue<weight_T>(); AscendC::LocalTensor<dstCO1_T> dstCO1 = outQueueCO1.AllocTensor<dstCO1_T>(); AscendC::LocalTensor<dstCO1_T> biasA1 = inQueueBiasA1.DeQue<dstCO1_T>(); // C = A * B + bias // m: height of the left matrix; k: width of the left matrix; n: width of the right matrix AscendC::Mmad(dstCO1, featureMapA2, weightB2, biasA1, { m, n, k, true, 0, false, false, false }); outQueueCO1.EnQue<dstCO1_T>(dstCO1); inQueueFmA2.FreeTensor(featureMapA2); inQueueWeB2.FreeTensor(weightB2); } __aicore__ inline void CopyOut() { AscendC::LocalTensor<dstCO1_T> dstCO1 = outQueueCO1.DeQue<dstCO1_T>(); // Enable DEQF16 quantization and set the quantization parameter to 0.5. float tmp = (float)0.5; // Convert tmp of float to deqScalar of uint64_t. uint64_t deqScalar = static_cast<uint64_t>(*reinterpret_cast<int32_t*>(&tmp)); bool nz2ndEn = false; // If NZ2ND is disabled, the value of nSize must be a multiple of 16. uint16_t nSize = coutBlocks * 16; uint16_t mSize = m; // The value of srcStride must be a multiple of 16. uint16_t srcStride = (m + 16 - 1) / 16 * 16; // When NZ2ND is disabled, dstStride is the head-to-head distance between bursts and is 32-byte aligned. uint32_t dstStride = m * sizeof(dst_T) * 16 / 32; if (nz2ndEn) { // The number of ND matrices is 1. Set src_nd_stride and dst_nd_stride to 1. AscendC::SetFixpipeNz2ndFlag(1, 1, 1); // When NZ2ND is enabled, nSize may not be a multiple of 16 but must be the same as n of Mmad. nSize = n; // When NZ2ND is enabled, dstStride indicates the stride between adjacent consecutive rows of the same ND matrix and is the same as n. dstStride = nSize; }; // Disable ReLU and channelSplit. AscendC::DataCopyCO12DstParams intriParams(nSize, mSize, dstStride, srcStride, deqMode, 0, false, nz2ndEn); // mov l0c to gm, deq scalar quant AscendC::SetFixpipePreQuantFlag(deqScalar); // Set the quantization parameter. AscendC::PipeBarrier<PIPE_FIX>(); AscendC::DataCopy(dstGlobal, dstCO1, intriParams); // // mov l0c to gm, deq tensor quant // // Additional gm space of the deq tensor needs to be allocated to move the value to workA1. // AscendC::LocalTensor<uint64_t> workA1 = inQueueDeqA1.AllocTensor<uint64_t>(); // // Size of the deq tensor // uint16_t deqSize = 128; // AscendC::DataCopy(workA1, deqGlobal, deqSize); // // Address of the deq tensor on the fix // AscendC::LocalTensor<uint64_t> deqFB = inQueueDeqFB.AllocTensor<uint64_t>(); // // l1->fix, burst_len unit is 128Bytes // uint16_t fbufBurstLen = deqSize / 128; // AscendC::DataCopyParams dataCopyParams(1, fbufBurstLen, 0, 0); // AscendC::DataCopy(deqFB, workA1, dataCopyParams); // // Set the quantization tensor. // AscendC::SetFixPipeConfig(deqFB); // AscendC::PipeBarrier<PIPE_FIX>(); // // mov l0c to gm, enable the ClipReLU operation after the quantization operation // intriParams.clipReluPre = 1; // // Set the value of ClipReLU in the register. // uint64_t clipReluVal = 0x3c00; // value 1, half // SetFixPipeClipRelu(clipReluVal); // //mov l0c to gm, set element-wise operation after the quantization operation, and add a LocalTensor element by element // intriParams.eltWiseOp = 1; // // Additional gm space of the element-wise tensor needs to be allocated to move the value to eleWiseTensor. // AscendC::LocalTensor<half> eleWiseTensor = inQueueC1.AllocTensor<half>(); // DataCopy(eleWiseTensor, eleWiseGlobal, { 1, static_cast<uint16_t>(sizeof(half) * dst_size / 32), 0, 0 }); // AscendC::PipeBarrier<PIPE_ALL>(); // // Set the address for storing the element-wise tensor to the register. // SetFixPipeAddr(eleWiseTensor, 1); // AscendC::DataCopy(dstGlobal, dstCO1, intriParams); // inQueueDeqA1.FreeTensor(workA1); // inQueueDeqFB.FreeTensor(deqFB); // outQueueCO1.FreeTensor(dstCO1); // inQueueC1.FreeTensor(eleWiseTensor); } private: AscendC::TPipe pipe; // feature map queue AscendC::TQue<AscendC::TPosition::A1, 1> inQueueFmA1; AscendC::TQue<AscendC::TPosition::A2, 1> inQueueFmA2; // weight queue AscendC::TQue<AscendC::TPosition::B1, 1> inQueueWeB1; AscendC::TQue<AscendC::TPosition::B2, 1> inQueueWeB2; // bias queue AscendC::TQue<AscendC::TPosition::A1, 1> inQueueBiasA1; // deq tensor queue AscendC::TQue<AscendC::TPosition::A1, 1> inQueueDeqA1; // fb dst of deq tensor AscendC::TQue<AscendC::TPosition::C2PIPE2GM, 1> inQueueDeqFB; // dst queue AscendC::TQue<AscendC::TPosition::CO1, 1> outQueueCO1; // element-wise tensor AscendC::TQue<AscendC::TPosition::C1, 1> inQueueC1; AscendC::GlobalTensor<fmap_T> fmGlobal; AscendC::GlobalTensor<weight_T> weGlobal; AscendC::GlobalTensor<dst_T> dstGlobal; AscendC::GlobalTensor<uint64_t> deqGlobal; AscendC::GlobalTensor<dstCO1_T> biasGlobal; AscendC::GlobalTensor<half> eleWiseGlobal; uint16_t channelSize = 32; uint16_t H = 4, W = 4; uint8_t Kh = 2, Kw = 2; uint16_t Cout; uint16_t C0, C1; uint8_t dilationH, dilationW; uint16_t coutBlocks, ho, wo, howo, howoRound; uint32_t featureMapA1Size, weightA1Size, featureMapA2Size, weightB2Size, biasSize, dstSize, dstCO1Size; uint16_t m, k, n; uint8_t fmRepeat, weRepeat; QuantMode_t deqMode = QuantMode_t::NoQuant; }; #define KERNEL_CUBE_DATACOPY(dst_type, fmap_type, weight_type, dstCO1_type, CoutIn, dilationHIn, dilationWIn, deqModeIn) \ extern "C" __global__ __aicore__ void cube_datacopy_kernel_##fmap_type(__gm__ uint8_t* fmGm, __gm__ uint8_t* weGm, \ __gm__ uint8_t* biasGm, __gm__ uint8_t* deqGm, __gm__ uint8_t* eleWiseGm, __gm__ uint8_t* dstGm) \ { \ if (g_coreType == AscendC::AIV) { \ return; \ } \ KernelCubeDataCopy<dst_type, fmap_type, weight_type, dstCO1_type> op(CoutIn, dilationHIn, dilationWIn, \ deqModeIn); \ op.Init(fmGm, weGm, biasGm, deqGm, eleWiseGm, dstGm); \ op.Process(); \ } KERNEL_CUBE_DATACOPY(half, int8_t, int8_t, int32_t, 128, 1, 1, QuantMode_t::DEQF16);