Load3D

Product Support

Product	Supported
Atlas A3 training products / Atlas A3 inference products	√
Atlas A2 training products / Atlas A2 inference products	√
Atlas 200I/500 A2 inference products	√
Atlas inference product 's AI Core	√
Atlas inference product 's Vector Core	x
Atlas training products	√

Function

Performs the image-to-column operation to convert multi-dimensional feature maps into two-dimensional matrices. The following data paths are supported: A1->A2 and B1->B2.

Prototype

Load3Dv1 API

        
             template <typename T, const IsResetLoad3dConfig &defaultConfig = IS_RESER_LOAD3D_DEFAULT_CONFIG, typename U = PrimT<T>, typename Std::enable_if<Std::is_same<PrimT<T>, U>::value, bool>::type = true>
__aicore__ inline void LoadData(const LocalTensor<T>& dst, const LocalTensor<T>& src, const LoadData3DParamsV1<U>& loadDataParams)

Load3Dv2 API

        
             template <typename T, const IsResetLoad3dConfig &defaultConfig = IS_RESER_LOAD3D_DEFAULT_CONFIG, typename U = PrimT<T>, typename Std::enable_if<Std::is_same<PrimT<T>, U>::value, bool>::type = true>
__aicore__ inline void LoadData(const LocalTensor<T>& dst, const LocalTensor<T>& src, const LoadData3DParamsV2<U>& loadDataParams)

Parameters

Table 1 Template parameters

Parameter

Description

Data types of the source and destination operands.

Load3Dv1 API:
For Atlas training products , the supported data types are uint8_t, int8_t, and half.

For the Atlas inference product 's AI Core, the supported data types are uint8_t, int8_t, and half.
Load3Dv2 API:
For the Atlas inference product 's AI Core, the supported data types are uint8_t, int8_t, half, and int4b_t.

For Atlas A2 training products / Atlas A2 inference products :
- When TPosition is A1 or A2, the supported data types are uint8_t, int8_t, half, bfloat16_t, uint32_t, int32_t, float, and int4b_t.
- When TPosition is B1 or B2, the supported data types are half, bfloat16_t, uint32_t, int32_t, and float.
For Atlas A3 training products / Atlas A3 inference products :
- When TPosition is A1 or A2, the supported data types are uint8_t, int8_t, half, bfloat16_t, uint32_t, int32_t, float, and int4b_t.
- When TPosition is B1 or B2, the supported data types are half, bfloat16_t, uint32_t, int32_t, and float.
For the Atlas 200I/500 A2 inference products :
- When TPosition is A1 or A2, the supported data types are uint8_t, int8_t, half, bfloat16, uint32_t, int32_t, float, and int4b_t.
- When TPosition is B1 or B2, the supported data types are half, bfloat16_t, uint32_t, int32_t, and float.

defaultConfig

Whether to set related attributes in Load3Dv1 or Load3Dv2. Parameter of the IsResetLoad3dConfig type. The IsResetLoad3dConfig structure is defined as follows:

           
                struct IsResetLoad3dConfig {
   bool isSetFMatrix = true;
   bool isSetPadding = true;
};

If isSetFMatrix is set to true, the attribute description (including l1H, l1W, and padList. For details, see Table 3 and Table 4) of the FeatureMap is set in the API. If this parameter is set to false, the attribute description of the FeatureMap passed in the API does not take effect and you need to set the attribute description through SetFmatrix.

If isSetPadding is set to true, the pad attribute description (that is, the padValue parameter. For details, see Table 3 and Table 4) is set in the API. If this parameter is set to false, the pad attribute passed in the API does not take effect and you need to set the attribute description through SetLoadDataPaddingValue. For details, see Example.

The default value of this parameter is defined as follows:

           
                constexpr IsResetLoad3dConfig IS_RESER_LOAD3D_DEFAULT_CONFIG = {true, true};

Data type of padValue in LoadData3DParamsV1 or LoadData3DParamsV2.

When basic data types are used for dst and src, U must be consistent with the data type T of dst and src. Otherwise, the compilation fails.
When the TensorTrait type is used for dst and src, the LiteType of U must match that of the data type T of dst and src. Otherwise, the compilation fails.

The last template parameter is used only for checking the preceding data types.

**Table 2** Common parameters
Parameter	Input/Output	Description
dst	Output	Destination operand, which is of the LocalTensor type. The sequential arrangement of data is determined by TPosition of the destination operand. The constraints are as follows: A2: ZZ format B2: ZN format A1/B1: No format restriction. Generally, the format is NZ.
src	Input	Source operand, which is of the LocalTensor or GlobalTensor type. Its data type must be the same as that of dst.
loadDataParams	Input	LoadData parameter structure. Supported types are as follows: LoadData3DParamsV1. For details, see Table 3. LoadData3DParamsV2. For details, see Table 4. For details about the definition of the preceding structure parameters, see ${INSTALL_DIR}/include/ascendc/basic_api/interface/kernel_struct_mm.h. Replace *${INSTALL_DIR}* with the CANN installation path.

**Table 3** Parameters in the LoadData3DParamsV1 structure
Parameter	Description
padList	Padding list [padding_left, padding_right, padding_top, padding_bottom]. The value range of each element is [0,255]. The default value is {0, 0, 0, 0}.
l1H	Height of the source operand. Value range: l1H ∈ [1, 32767].
l1W	Width of the source operand. Value range: l1W ∈ [1, 32767].
c1Index	Start position of the source tensor in the C1 dimension. Value range: c1Index ∈ [0, 4095]. The default value is 0.
fetchFilterW	Start position of the filter in the W dimension. Value range: fetchFilterW ∈ [0, 254]. The default value is 0.
fetchFilterH	Start position of the filter in the H dimension. Value range: fetchFilterH ∈ [0, 254]. The default value is 0.
leftTopW	Start position of the source operand in the W dimension. Value range: leftTopW ∈ [–255, 32767]. The default value is 0. If padding_left is a, set leftTopW to -a.
leftTopH	Start position of the source operand in the H dimension. Value range: leftTopH ∈ [–255, 32767]. The default value is 0. If padding_top is a, set leftTopH to -a.
strideW	Stride of the convolution kernel in the W dimension of the source operand. Value range: strideW ∈ [1, 63].
strideH	Stride of the convolution kernel in the H dimension of the source operand. Value range: strideH ∈ [1, 63].
filterW	Width of the convolution kernel. Value range: filterW ∈ [1, 255].
filterH	Height of the convolution kernel. Value range: filterH ∈ [1, 255].
dilationFilterW	Width dilation of the convolution kernel. Value range: dilationFilterW ∈ [1, 255].
dilationFilterH	Height dilation of the convolution kernel. Value range: dilationFilterH ∈ [1, 255].
jumpStride	In adjacent repetitions, the stride between the start addresses of the destination operand. Value range: jumpStride ∈ [1, 127].
repeatMode	Repeat mode. Mode 0: In each repeat, points in the filter window are added, corresponding to the increase in the W dimension of the destination matrix. Mode 1: In each repeat, the upper-left coordinates of the sliding window are added, corresponding to the increase in the H dimension of the destination matrix. Value range: repeatMode ∈ [0, 1]. The default value is 0.
repeatTime	Number of iteration repeats. The addresses of the source and destination operands change upon every iteration. Value range: repeatTime ∈ [1, 255].
cSize	Whether to enable the optimization of cSize = 4 (b16) or cSize = 8 (b8). Value range: cSize ∈ [0, 1]. The default value is 0.
padValue	Padding value. Its data type must be the same as that of src. The default value is 0. To disable padding, set all elements of padList to 0.

**Table 4** Parameters in the LoadData3DParamsV2 structure
Parameter	Description
padList	Padding list [padding_left, padding_right, padding_top, padding_bottom]. The value range of each element is [0,255]. The default value is {0, 0, 0, 0}.
l1H	Height of the source operand. Value range: l1H ∈ [1, 32767].
l1W	Weight of the source operand. Value range: l1W ∈ [1, 32767].
channelSize	Number of channels of the source operand. Value range: channelSize ∈ [1, 63]. For the following models, the valid values of channelSize are as follows: 4, 8, 16, N × 16 + 4, and N × 16 + 8 for half, 4, 8, 16, 32, N × 32 + 4, N × 32 + 8, and N × 32 + 16 for int8_t/uint8_t, and 8, 16, 32, N × 64, N × 64 + 8, N × 64 + 16, and N × 64 + 32 for int4b_t. N is a positive integer. Atlas inference product 's AI Core For the following models, the valid values of channelSize are as follows: 4, N × 8, and N × 8 + 4 for uint32_t/int32_t/float, 4, 8, N × 16, N × 16 + 4, and N × 16 + 8 for half/bfloat16, 4, 8, 16, 32 × N, N × 32 + 4, N × 32 + 8, and N × 32 + 16 for int8_t/uint8_t, and 8, 16, 32, N × 64, N × 64 + 8, N × 64 + 16, and N × 64 + 32 for int4b_t. N is a positive integer. Atlas A2 training products / Atlas A2 inference products Atlas A3 training products / Atlas A3 inference products Atlas 200I/500 A2 inference products
kExtension	Transfer length along the width dimension of the destination operand. If the rightmost tile is not covered, the value must be a multiple of 16 for half and a multiple of 32 for int8_t and uint8_t. No multiple requirement applies when the rightmost tile is covered. Value range: kExtension ∈ [1, 65535].
mExtension	Transfer length along the height dimension of the destination operand. If the bottom tile is not covered, the value must be a multiple of 16 for half, int8_t, and uint8_t. No multiple requirement applies when the bottom tile is covered. Value range: mExtension ∈ [1, 65535].
kStartPt	Start position along the width dimension of the destination operand. The value must be a multiple of 16 for half, and a multiple of 32 for int8_t and uint8_t. Value range: [0, 65535] The default value is 0.
mStartPt	Start position along the height dimension of the destination operand. If the bottom tile is not covered, the value must be a multiple of 16 for half, int8_t and uint8_t. No multiple requirement applies when the bottom tile is covered. Value range: [0, 65535] The default value is 0.
strideW	Stride of the convolution kernel along the width dimension of the source operand. Value range: strideW ∈ [1, 63].
strideH	Stride of the convolution kernel along the height dimension of the source operand. Value range: strideH ∈ [1, 63].
filterW	Width of the convolution kernel. Value range: filterW ∈ [1, 255].
filterH	Height of the convolution kernel. Value range: filterH ∈ [1, 255].
dilationFilterW	Width dilation of the convolution kernel. Value range: dilationFilterW ∈ [1, 255].
dilationFilterH	Height dilation of the convolution kernel. Value range: dilationFilterH ∈ [1, 255].
enTranspose	Whether to enable matrix transposition for the entire destination matrix. This parameter is of bool type and takes effect only when the destination TPosition is A2 and the source operand uses the half type. The default value is false. true: enabled false: disabled
enSmallK	Whether to enable the small k feature. Each tile matrix is 16 × 4. This parameter is of the bool type and defaults to false. This feature is no longer supported in the current product form. true: enabled false: disabled
padValue	Padding value. Its data type must be the same as that of src. The default value is 0. To disable padding, set all elements of padList to 0.
filterSizeW	Whether to add 256 elements to the width of the convolution kernel based on filterW. true: yes; false: no.
filterSizeH	Whether to add 256 elements to the height of the convolution kernel based on filterH. true: yes; false: no.
fMatrixCtrl	Whether the LoadData3DV2 instruction retrieves FeatureMap attribute descriptions from the left or right matrix. It works with SetFmatrix and can only be set to false (default). true: Retrieve FeatureMap attribute descriptions from the right matrix. false: Retrieve FeatureMap attribute descriptions from the left matrix.

Restrictions

For details about the operand address alignment requirements, see General Address Alignment Restrictions.

To enable the LoadData3DParamsV1 cSize feature, ensure that the feature map in A1/B1 is 4-channel aligned.

Load3d Data Formats

The input feature map and filter must follow the NC1HWC0 format. C0 is the innermost dimension with a fixed value of 16, or 32 for u8/s8 types. C1 = C/C0.

To simplify the scenario, assume the input feature map has 4 channels, namely Ci = 4. The input feature map in A1 has a shape of (Hi, Wi, Ci). After processing by load3dv1, the data in A2 is shaped as (Wo × Ho, Hk × Wk × Ci). Wo and Ho denote the output dimensions after convolution, while Hk and Wk represent the filter dimensions.

Intuitively, the img2col operation slides the filter across the feature map and unfolds the corresponding feature map data into each row of the output. The filter slides Wo steps along the W dimension, then shifts one step along the H dimension, and repeats this process, generating a total of Wo × Ho rows of output data. In the figure below, the red and yellow data represent the first and second rows respectively. The numbers illustrate the mapping between the original input data, filter, and output data. As demonstrated, load3dv1 first fetches four values corresponding to 00 along the Ci dimension of the input data, followed by four values corresponding to 01. The total length of each row is Hk × Wk × Ci, which equals 3 × 3 × 4 = 36 values.

The following figure shows the format of the feature map.

The following figure shows the filter format:

In the format, n indicates the number of filters, and the dimension layout is (Hk, Wk, Ci, n). Note that this format needs to be converted to match the format of matrix B in MMAD.

In practice, due to constraints on storage and compute resources, the entire convolution computation is processed in blocks, with only one block of data loaded and computed per iteration.

Two blocking schemes are available for the feature map in A2: horizontal blocking and vertical blocking, corresponding to repeatMode 0 and 1 respectively.

Note: The tile matrix shown in the figure is 4 × 4. The actual size is 16 × 16 (or 16 × 32 for u8 and s8 data types).

When repeatMode is set to 0, each repeat changes the data read position within the filter window, followed by a jump to the next C0 position.

When repeatMode is set to 1, the data read position within the filter window stays fixed, and each repeat moves forward by C0 elements in the feature map.

Returns

None

Example

The example supports the Atlas inference product 's AI Core platform.

      
       
         
         
           #include "kernel_operator.h"

class KernelLoadData {
public:
    __aicore__ inline KernelLoadData()
    {
        coutBlocks = (Cout + 16 - 1) / 16;
        ho = (H + padTop + padBottom - dilationH * (Kh - 1) - 1) / strideH + 1;
        wo = (W + padLeft + padRight - dilationW * (Kw - 1) - 1) / strideW + 1;
        howo = ho * wo;
        howoRound = ((howo + 16 - 1) / 16) * 16;
        featureMapA1Size = C1 * H * W * C0;      // shape: [C1, H, W, C0]
        weightA1Size = C1 * Kh * Kw * Cout * C0; // shape: [C1, Kh, Kw, Cout, C0]
        featureMapA2Size = howoRound * (C1 * Kh * Kw * C0);
        weightB2Size = (C1 * Kh * Kw * C0) * coutBlocks * 16;
        m = howo;
        k = C1 * Kh * Kw * C0;
        n = Cout;
        dstSize = coutBlocks * howo * 16; // shape: [coutBlocks, howo, 16]
        dstCO1Size = coutBlocks * howoRound * 16;
        fmRepeat = featureMapA2Size / (16 * C0);
        weRepeat = weightB2Size / (16 * C0);
    }
    __aicore__ inline void Init(__gm__ uint8_t* fmGm, __gm__ uint8_t* weGm, __gm__ uint8_t* dstGm)
    {
        fmGlobal.SetGlobalBuffer((__gm__ half*)fmGm);
        weGlobal.SetGlobalBuffer((__gm__ half*)weGm);
        dstGlobal.SetGlobalBuffer((__gm__ half*)dstGm);
        pipe.InitBuffer(inQueueFmA1, 1, featureMapA1Size * sizeof(half));
        pipe.InitBuffer(inQueueFmA2, 1, featureMapA2Size * sizeof(half));
        pipe.InitBuffer(inQueueWeB1, 1, weightA1Size * sizeof(half));
        pipe.InitBuffer(inQueueWeB2, 1, weightB2Size * sizeof(half));
        pipe.InitBuffer(outQueue
, 1, dstCO1Size * sizeof(float));
        pipe.InitBuffer(outQueueUB, 1, dstSize * sizeof(half));
    }
    __aicore__ inline void Process()
    {
        CopyIn();
        Split();
        Compute();
        CopyUB();
        CopyOut();
    }

private:
    __aicore__ inline void CopyIn()
    {
        AscendC::LocalTensor<half> featureMapA1 = inQueueFmA1.AllocTensor<half>();
        AscendC::LocalTensor<half> weightB1 = inQueueWeB1.AllocTensor<half>();
        AscendC::DataCopy(featureMapA1, fmGlobal, { 1, static_cast<uint16_t>(featureMapA1Size * sizeof(half) / 32), 0, 0 });
        AscendC::DataCopy(weightB1, weGlobal, { 1, static_cast<uint16_t>(weightA1Size * sizeof(half) / 32), 0, 0 });
        inQueueFmA1.EnQue(featureMapA1);
        inQueueWeB1.EnQue(weightB1);
    }
    __aicore__ inline void Split()
    {
        AscendC::LocalTensor<half> featureMapA1 = inQueueFmA1.DeQue<half>();
        AscendC::LocalTensor<half> weightB1 = inQueueWeB1.DeQue<half>();
        AscendC::LocalTensor<half> featureMapA2 = inQueueFmA2.AllocTensor<half>();
        AscendC::LocalTensor<half> weightB2 = inQueueWeB2.AllocTensor<half>();
        uint8_t padList[4] = {padLeft, padRight, padTop, padBottom};
        AscendC::LoadData(featureMapA2, featureMapA1,
            { padList, H, W, 0, 0, 0, -1, -1, strideW, strideH, Kw, Kh, dilationW, dilationH, 1, 0, fmRepeat, 0, (half)(0)});
        AscendC::LoadData(weightB2, weightB1, { 0, weRepeat, 1, 0, 0, false, 0 });
        inQueueFmA2.EnQue<half>(featureMapA2);
        inQueueWeB2.EnQue<half>(weightB2);
        inQueueFmA1.FreeTensor(featureMapA1);
        inQueueWeB1.FreeTensor(weightB1);
    }
    __aicore__ inline void Compute()
    {
        AscendC::LocalTensor<half> featureMapA2 = inQueueFmA2.DeQue<half>();
        AscendC::LocalTensor<half> weightB2 = inQueueWeB2.DeQue<half>();
        AscendC::LocalTensor<float> dstCO1 = outQueueCO1.AllocTensor<float>();
        AscendC::Mmad(dstCO1, featureMapA2, weightB2, { m, n, k, 0, false, true });
        outQueueCO1.EnQue<float>(dstCO1);
        inQueueFmA2.FreeTensor(featureMapA2);
        inQueueWeB2.FreeTensor(weightB2);
    }
    __aicore__ inline void CopyUB()
    {
        AscendC::LocalTensor<float> dstCO1 = outQueueCO1.DeQue<float>();
        AscendC::LocalTensor<half> dstUB = outQueueUB.AllocTensor<half>();
        AscendC::DataCopyParams dataCopyParams;
        dataCopyParams.blockCount = 1;
        dataCopyParams.blockLen = m * n * sizeof(float) / 1024;
        AscendC::DataCopyEnhancedParams enhancedParams;
        enhancedParams.blockMode = AscendC::BlockMode::BLOCK_MODE_MATRIX;
        AscendC::DataCopy(dstUB, dstCO1, dataCopyParams, enhancedParams);
        outQueueUB.EnQue<half>(dstUB);
        outQueueCO1.FreeTensor(dstCO1);
    }
    __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<half> dstUB = outQueueUB.DeQue<half>();
        AscendC::DataCopy(dstGlobal, dstUB, m * n);
        outQueueUB.FreeTensor(dstUB);
    }

private:
    AscendC::TPipe pipe;
    // feature map queue
    AscendC::TQue<AscendC::TPosition::A1, 1> inQueueFmA1;
    AscendC::TQue<AscendC::TPosition::A2, 1> inQueueFmA2;
    // weight queue
    AscendC::TQue<AscendC::TPosition::B1, 1> inQueueWeB1;
    AscendC::TQue<AscendC::TPosition::B2, 1> inQueueWeB2;
    // dst queue
    AscendC::TQue<AscendC::TPosition::CO1, 1> outQueueCO1;
    AscendC::TQue<AscendC::TPosition::CO2, 1> outQueueUB;
    AscendC::GlobalTensor<half> fmGlobal, weGlobal, dstGlobal;
    uint16_t C1 = 2;
    uint16_t H = 4, W = 4;
    uint8_t Kh = 2, Kw = 2;
    uint16_t Cout = 16;
    uint16_t C0 = 16;
    uint8_t dilationH = 2, dilationW = 2;
    uint8_t padTop = 1, padBottom = 1, padLeft = 1, padRight = 1;
    uint8_t strideH = 1, strideW = 1;
    uint16_t coutBlocks, ho, wo, howo, howoRound;
    uint32_t featureMapA1Size, weightA1Size, featureMapA2Size, weightB2Size, dstSize, dstCO1Size;
    uint16_t m, k, n;
    uint8_t fmRepeat, weRepeat;
};

extern "C" __global__ __aicore__ void load_data_simple_kernel(__gm__ uint8_t* fmGm, __gm__ uint8_t* weGm,
    __gm__ uint8_t* dstGm)
{
    KernelLoadData op;
    op.Init(fmGm, weGm, dstGm);
    op.Process();
}

          

        

      
     

Parent topic: LoadData