Load3D
Product Support
|
Product |
Supported |
|---|---|
|
|
√ |
|
|
√ |
|
|
√ |
|
|
√ |
|
|
x |
|
|
√ |
Function
Performs the image-to-column operation to convert multi-dimensional feature maps into two-dimensional matrices. The following data paths are supported: A1->A2 and B1->B2.
Prototype
- Load3Dv1 API
1 2
template <typename T, const IsResetLoad3dConfig &defaultConfig = IS_RESER_LOAD3D_DEFAULT_CONFIG, typename U = PrimT<T>, typename Std::enable_if<Std::is_same<PrimT<T>, U>::value, bool>::type = true> __aicore__ inline void LoadData(const LocalTensor<T>& dst, const LocalTensor<T>& src, const LoadData3DParamsV1<U>& loadDataParams)
- Load3Dv2 API
1 2
template <typename T, const IsResetLoad3dConfig &defaultConfig = IS_RESER_LOAD3D_DEFAULT_CONFIG, typename U = PrimT<T>, typename Std::enable_if<Std::is_same<PrimT<T>, U>::value, bool>::type = true> __aicore__ inline void LoadData(const LocalTensor<T>& dst, const LocalTensor<T>& src, const LoadData3DParamsV2<U>& loadDataParams)
Parameters
|
Parameter |
Description |
||||
|---|---|---|---|---|---|
|
T |
Data types of the source and destination operands.
|
||||
|
defaultConfig |
Whether to set related attributes in Load3Dv1 or Load3Dv2. Parameter of the IsResetLoad3dConfig type. The IsResetLoad3dConfig structure is defined as follows:
If isSetFMatrix is set to true, the attribute description (including l1H, l1W, and padList. For details, see Table 3 and Table 4) of the FeatureMap is set in the API. If this parameter is set to false, the attribute description of the FeatureMap passed in the API does not take effect and you need to set the attribute description through SetFmatrix. If isSetPadding is set to true, the pad attribute description (that is, the padValue parameter. For details, see Table 3 and Table 4) is set in the API. If this parameter is set to false, the pad attribute passed in the API does not take effect and you need to set the attribute description through SetLoadDataPaddingValue. For details, see Example. The default value of this parameter is defined as follows:
|
||||
|
U |
Data type of padValue in LoadData3DParamsV1 or LoadData3DParamsV2.
The last template parameter is used only for checking the preceding data types. |
|
Parameter |
Input/Output |
Description |
|---|---|---|
|
dst |
Output |
Destination operand, which is of the LocalTensor type. The sequential arrangement of data is determined by TPosition of the destination operand. The constraints are as follows:
|
|
src |
Input |
Source operand, which is of the LocalTensor or GlobalTensor type. Its data type must be the same as that of dst. |
|
loadDataParams |
Input |
LoadData parameter structure. Supported types are as follows: For details about the definition of the preceding structure parameters, see ${INSTALL_DIR}/include/ascendc/basic_api/interface/kernel_struct_mm.h. Replace ${INSTALL_DIR} with the CANN installation path. |
|
Parameter |
Description |
|---|---|
|
padList |
Padding list [padding_left, padding_right, padding_top, padding_bottom]. The value range of each element is [0,255]. The default value is {0, 0, 0, 0}. |
|
l1H |
Height of the source operand. Value range: l1H ∈ [1, 32767]. |
|
l1W |
Width of the source operand. Value range: l1W ∈ [1, 32767]. |
|
c1Index |
Start position of the source tensor in the C1 dimension. Value range: c1Index ∈ [0, 4095]. The default value is 0. |
|
fetchFilterW |
Start position of the filter in the W dimension. Value range: fetchFilterW ∈ [0, 254]. The default value is 0. |
|
fetchFilterH |
Start position of the filter in the H dimension. Value range: fetchFilterH ∈ [0, 254]. The default value is 0. |
|
leftTopW |
Start position of the source operand in the W dimension. Value range: leftTopW ∈ [–255, 32767]. The default value is 0. If padding_left is a, set leftTopW to -a. |
|
leftTopH |
Start position of the source operand in the H dimension. Value range: leftTopH ∈ [–255, 32767]. The default value is 0. If padding_top is a, set leftTopH to -a. |
|
strideW |
Stride of the convolution kernel in the W dimension of the source operand. Value range: strideW ∈ [1, 63]. |
|
strideH |
Stride of the convolution kernel in the H dimension of the source operand. Value range: strideH ∈ [1, 63]. |
|
filterW |
Width of the convolution kernel. Value range: filterW ∈ [1, 255]. |
|
filterH |
Height of the convolution kernel. Value range: filterH ∈ [1, 255]. |
|
dilationFilterW |
Width dilation of the convolution kernel. Value range: dilationFilterW ∈ [1, 255]. |
|
dilationFilterH |
Height dilation of the convolution kernel. Value range: dilationFilterH ∈ [1, 255]. |
|
jumpStride |
In adjacent repetitions, the stride between the start addresses of the destination operand. Value range: jumpStride ∈ [1, 127]. |
|
repeatMode |
Repeat mode.
Value range: repeatMode ∈ [0, 1]. The default value is 0. |
|
repeatTime |
Number of iteration repeats. The addresses of the source and destination operands change upon every iteration. Value range: repeatTime ∈ [1, 255]. |
|
cSize |
Whether to enable the optimization of cSize = 4 (b16) or cSize = 8 (b8). Value range: cSize ∈ [0, 1]. The default value is 0. |
|
padValue |
Padding value. Its data type must be the same as that of src. The default value is 0. To disable padding, set all elements of padList to 0. |
|
Parameter |
Description |
|---|---|
|
padList |
Padding list [padding_left, padding_right, padding_top, padding_bottom]. The value range of each element is [0,255]. The default value is {0, 0, 0, 0}. |
|
l1H |
Height of the source operand. Value range: l1H ∈ [1, 32767]. |
|
l1W |
Weight of the source operand. Value range: l1W ∈ [1, 32767]. |
|
channelSize |
Number of channels of the source operand. Value range: channelSize ∈ [1, 63]. For the following models, the valid values of channelSize are as follows: 4, 8, 16, N × 16 + 4, and N × 16 + 8 for half, 4, 8, 16, 32, N × 32 + 4, N × 32 + 8, and N × 32 + 16 for int8_t/uint8_t, and 8, 16, 32, N × 64, N × 64 + 8, N × 64 + 16, and N × 64 + 32 for int4b_t. N is a positive integer. For the following models, the valid values of channelSize are as follows: 4, N × 8, and N × 8 + 4 for uint32_t/int32_t/float, 4, 8, N × 16, N × 16 + 4, and N × 16 + 8 for half/bfloat16, 4, 8, 16, 32 × N, N × 32 + 4, N × 32 + 8, and N × 32 + 16 for int8_t/uint8_t, and 8, 16, 32, N × 64, N × 64 + 8, N × 64 + 16, and N × 64 + 32 for int4b_t. N is a positive integer. |
|
kExtension |
Transfer length along the width dimension of the destination operand. If the rightmost tile is not covered, the value must be a multiple of 16 for half and a multiple of 32 for int8_t and uint8_t. No multiple requirement applies when the rightmost tile is covered. Value range: kExtension ∈ [1, 65535]. |
|
mExtension |
Transfer length along the height dimension of the destination operand. If the bottom tile is not covered, the value must be a multiple of 16 for half, int8_t, and uint8_t. No multiple requirement applies when the bottom tile is covered. Value range: mExtension ∈ [1, 65535]. |
|
kStartPt |
Start position along the width dimension of the destination operand. The value must be a multiple of 16 for half, and a multiple of 32 for int8_t and uint8_t. Value range: [0, 65535] The default value is 0. |
|
mStartPt |
Start position along the height dimension of the destination operand. If the bottom tile is not covered, the value must be a multiple of 16 for half, int8_t and uint8_t. No multiple requirement applies when the bottom tile is covered. Value range: [0, 65535] The default value is 0. |
|
strideW |
Stride of the convolution kernel along the width dimension of the source operand. Value range: strideW ∈ [1, 63]. |
|
strideH |
Stride of the convolution kernel along the height dimension of the source operand. Value range: strideH ∈ [1, 63]. |
|
filterW |
Width of the convolution kernel. Value range: filterW ∈ [1, 255]. |
|
filterH |
Height of the convolution kernel. Value range: filterH ∈ [1, 255]. |
|
dilationFilterW |
Width dilation of the convolution kernel. Value range: dilationFilterW ∈ [1, 255]. |
|
dilationFilterH |
Height dilation of the convolution kernel. Value range: dilationFilterH ∈ [1, 255]. |
|
enTranspose |
Whether to enable matrix transposition for the entire destination matrix. This parameter is of bool type and takes effect only when the destination TPosition is A2 and the source operand uses the half type. The default value is false.
|
|
enSmallK |
Whether to enable the small k feature. Each tile matrix is 16 × 4. This parameter is of the bool type and defaults to false. This feature is no longer supported in the current product form.
|
|
padValue |
Padding value. Its data type must be the same as that of src. The default value is 0. To disable padding, set all elements of padList to 0. |
|
filterSizeW |
Whether to add 256 elements to the width of the convolution kernel based on filterW. true: yes; false: no. |
|
filterSizeH |
Whether to add 256 elements to the height of the convolution kernel based on filterH. true: yes; false: no. |
|
fMatrixCtrl |
Whether the LoadData3DV2 instruction retrieves FeatureMap attribute descriptions from the left or right matrix. It works with SetFmatrix and can only be set to false (default).
|
Restrictions
- For details about the operand address alignment requirements, see General Address Alignment Restrictions.
- To enable the LoadData3DParamsV1 cSize feature, ensure that the feature map in A1/B1 is 4-channel aligned.
Load3d Data Formats
The input feature map and filter must follow the NC1HWC0 format. C0 is the innermost dimension with a fixed value of 16, or 32 for u8/s8 types. C1 = C/C0.
To simplify the scenario, assume the input feature map has 4 channels, namely Ci = 4. The input feature map in A1 has a shape of (Hi, Wi, Ci). After processing by load3dv1, the data in A2 is shaped as (Wo × Ho, Hk × Wk × Ci). Wo and Ho denote the output dimensions after convolution, while Hk and Wk represent the filter dimensions.
Intuitively, the img2col operation slides the filter across the feature map and unfolds the corresponding feature map data into each row of the output. The filter slides Wo steps along the W dimension, then shifts one step along the H dimension, and repeats this process, generating a total of Wo × Ho rows of output data. In the figure below, the red and yellow data represent the first and second rows respectively. The numbers illustrate the mapping between the original input data, filter, and output data. As demonstrated, load3dv1 first fetches four values corresponding to 00 along the Ci dimension of the input data, followed by four values corresponding to 01. The total length of each row is Hk × Wk × Ci, which equals 3 × 3 × 4 = 36 values.
The following figure shows the format of the feature map.

The following figure shows the filter format:
In the format, n indicates the number of filters, and the dimension layout is (Hk, Wk, Ci, n). Note that this format needs to be converted to match the format of matrix B in MMAD.

In practice, due to constraints on storage and compute resources, the entire convolution computation is processed in blocks, with only one block of data loaded and computed per iteration.

Two blocking schemes are available for the feature map in A2: horizontal blocking and vertical blocking, corresponding to repeatMode 0 and 1 respectively.
Note: The tile matrix shown in the figure is 4 × 4. The actual size is 16 × 16 (or 16 × 32 for u8 and s8 data types).
When repeatMode is set to 0, each repeat changes the data read position within the filter window, followed by a jump to the next C0 position.

When repeatMode is set to 1, the data read position within the filter window stays fixed, and each repeat moves forward by C0 elements in the feature map.

Returns
None
Example
The example supports the
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 |
#include "kernel_operator.h" class KernelLoadData { public: __aicore__ inline KernelLoadData() { coutBlocks = (Cout + 16 - 1) / 16; ho = (H + padTop + padBottom - dilationH * (Kh - 1) - 1) / strideH + 1; wo = (W + padLeft + padRight - dilationW * (Kw - 1) - 1) / strideW + 1; howo = ho * wo; howoRound = ((howo + 16 - 1) / 16) * 16; featureMapA1Size = C1 * H * W * C0; // shape: [C1, H, W, C0] weightA1Size = C1 * Kh * Kw * Cout * C0; // shape: [C1, Kh, Kw, Cout, C0] featureMapA2Size = howoRound * (C1 * Kh * Kw * C0); weightB2Size = (C1 * Kh * Kw * C0) * coutBlocks * 16; m = howo; k = C1 * Kh * Kw * C0; n = Cout; dstSize = coutBlocks * howo * 16; // shape: [coutBlocks, howo, 16] dstCO1Size = coutBlocks * howoRound * 16; fmRepeat = featureMapA2Size / (16 * C0); weRepeat = weightB2Size / (16 * C0); } __aicore__ inline void Init(__gm__ uint8_t* fmGm, __gm__ uint8_t* weGm, __gm__ uint8_t* dstGm) { fmGlobal.SetGlobalBuffer((__gm__ half*)fmGm); weGlobal.SetGlobalBuffer((__gm__ half*)weGm); dstGlobal.SetGlobalBuffer((__gm__ half*)dstGm); pipe.InitBuffer(inQueueFmA1, 1, featureMapA1Size * sizeof(half)); pipe.InitBuffer(inQueueFmA2, 1, featureMapA2Size * sizeof(half)); pipe.InitBuffer(inQueueWeB1, 1, weightA1Size * sizeof(half)); pipe.InitBuffer(inQueueWeB2, 1, weightB2Size * sizeof(half)); pipe.InitBuffer(outQueue , 1, dstCO1Size * sizeof(float)); pipe.InitBuffer(outQueueUB, 1, dstSize * sizeof(half)); } __aicore__ inline void Process() { CopyIn(); Split(); Compute(); CopyUB(); CopyOut(); } private: __aicore__ inline void CopyIn() { AscendC::LocalTensor<half> featureMapA1 = inQueueFmA1.AllocTensor<half>(); AscendC::LocalTensor<half> weightB1 = inQueueWeB1.AllocTensor<half>(); AscendC::DataCopy(featureMapA1, fmGlobal, { 1, static_cast<uint16_t>(featureMapA1Size * sizeof(half) / 32), 0, 0 }); AscendC::DataCopy(weightB1, weGlobal, { 1, static_cast<uint16_t>(weightA1Size * sizeof(half) / 32), 0, 0 }); inQueueFmA1.EnQue(featureMapA1); inQueueWeB1.EnQue(weightB1); } __aicore__ inline void Split() { AscendC::LocalTensor<half> featureMapA1 = inQueueFmA1.DeQue<half>(); AscendC::LocalTensor<half> weightB1 = inQueueWeB1.DeQue<half>(); AscendC::LocalTensor<half> featureMapA2 = inQueueFmA2.AllocTensor<half>(); AscendC::LocalTensor<half> weightB2 = inQueueWeB2.AllocTensor<half>(); uint8_t padList[4] = {padLeft, padRight, padTop, padBottom}; AscendC::LoadData(featureMapA2, featureMapA1, { padList, H, W, 0, 0, 0, -1, -1, strideW, strideH, Kw, Kh, dilationW, dilationH, 1, 0, fmRepeat, 0, (half)(0)}); AscendC::LoadData(weightB2, weightB1, { 0, weRepeat, 1, 0, 0, false, 0 }); inQueueFmA2.EnQue<half>(featureMapA2); inQueueWeB2.EnQue<half>(weightB2); inQueueFmA1.FreeTensor(featureMapA1); inQueueWeB1.FreeTensor(weightB1); } __aicore__ inline void Compute() { AscendC::LocalTensor<half> featureMapA2 = inQueueFmA2.DeQue<half>(); AscendC::LocalTensor<half> weightB2 = inQueueWeB2.DeQue<half>(); AscendC::LocalTensor<float> dstCO1 = outQueueCO1.AllocTensor<float>(); AscendC::Mmad(dstCO1, featureMapA2, weightB2, { m, n, k, 0, false, true }); outQueueCO1.EnQue<float>(dstCO1); inQueueFmA2.FreeTensor(featureMapA2); inQueueWeB2.FreeTensor(weightB2); } __aicore__ inline void CopyUB() { AscendC::LocalTensor<float> dstCO1 = outQueueCO1.DeQue<float>(); AscendC::LocalTensor<half> dstUB = outQueueUB.AllocTensor<half>(); AscendC::DataCopyParams dataCopyParams; dataCopyParams.blockCount = 1; dataCopyParams.blockLen = m * n * sizeof(float) / 1024; AscendC::DataCopyEnhancedParams enhancedParams; enhancedParams.blockMode = AscendC::BlockMode::BLOCK_MODE_MATRIX; AscendC::DataCopy(dstUB, dstCO1, dataCopyParams, enhancedParams); outQueueUB.EnQue<half>(dstUB); outQueueCO1.FreeTensor(dstCO1); } __aicore__ inline void CopyOut() { AscendC::LocalTensor<half> dstUB = outQueueUB.DeQue<half>(); AscendC::DataCopy(dstGlobal, dstUB, m * n); outQueueUB.FreeTensor(dstUB); } private: AscendC::TPipe pipe; // feature map queue AscendC::TQue<AscendC::TPosition::A1, 1> inQueueFmA1; AscendC::TQue<AscendC::TPosition::A2, 1> inQueueFmA2; // weight queue AscendC::TQue<AscendC::TPosition::B1, 1> inQueueWeB1; AscendC::TQue<AscendC::TPosition::B2, 1> inQueueWeB2; // dst queue AscendC::TQue<AscendC::TPosition::CO1, 1> outQueueCO1; AscendC::TQue<AscendC::TPosition::CO2, 1> outQueueUB; AscendC::GlobalTensor<half> fmGlobal, weGlobal, dstGlobal; uint16_t C1 = 2; uint16_t H = 4, W = 4; uint8_t Kh = 2, Kw = 2; uint16_t Cout = 16; uint16_t C0 = 16; uint8_t dilationH = 2, dilationW = 2; uint8_t padTop = 1, padBottom = 1, padLeft = 1, padRight = 1; uint8_t strideH = 1, strideW = 1; uint16_t coutBlocks, ho, wo, howo, howoRound; uint32_t featureMapA1Size, weightA1Size, featureMapA2Size, weightB2Size, dstSize, dstCO1Size; uint16_t m, k, n; uint8_t fmRepeat, weRepeat; }; extern "C" __global__ __aicore__ void load_data_simple_kernel(__gm__ uint8_t* fmGm, __gm__ uint8_t* weGm, __gm__ uint8_t* dstGm) { KernelLoadData op; op.Init(fmGm, weGm, dstGm); op.Process(); } |