InitConstValue

Product Support

Product	Supported
Atlas A3 training products/Atlas A3 inference products	√
Atlas A2 training products/Atlas A2 inference products	√
Atlas 200I/500 A2 inference products	√
Atlas inference product's AI Core	√
Atlas inference product's Vector Core	x
Atlas training products	√

Function

Initializes LocalTensor of a specific TPosition to a specific value.

Prototype

template <typename T, typename U = PrimT<T>, typename Std::enable_if<Std::is_same<PrimT<T>, U>::value, bool>::type = true>
__aicore__ inline void InitConstValue(const LocalTensor<T>& dst, const InitConstValueParams<U>& initConstValueParams)

Parameters

**Table 1** Template parameters
Parameter	Description
T	Data type of dst. For Atlas training products, the supported data type is half. For the Atlas inference product's AI Core, the supported data types are half, int16_t, and uint16_t. For the Atlas A2 training products/Atlas A2 inference products, the supported data types are half, int16_t, uint16_t, bfloat16_t, float, int32_t, and uint32_t. For the Atlas A3 training products/Atlas A3 inference products, the supported data types are half, int16_t, uint16_t, bfloat16_t, float, int32_t, and uint32_t. For the Atlas 200I/500 A2 inference products, the supported data types are half, int16_t, uint16_t, bfloat16_t, float, int32_t, and uint32_t.
U	Data type of the initial value. When dst uses a basic data type, the data type T of U must be the same as that of dst. Otherwise, the compilation fails. When dst uses the TensorTrait type, the LiteType of the data type T of U must be the same as that of dst. Otherwise, the compilation fails. The last template parameter is used only for checking the preceding data types.

**Table 2** Parameters
Parameter	Input/Output	Meaning
dst	Output	Destination operand for the result matrix, of the LocalTensor type. For Atlas training products, the supported TPosition is A1, A2, B1, or B2. For the Atlas inference product's AI Core, the supported TPosition is A1, A2, B1, or B2. For Atlas A2 training products/Atlas A2 inference products, the supported TPosition is A1, A2, B1, or B2. For Atlas A3 training products/Atlas A3 inference products, the supported TPosition is A1, A2, B1, or B2. For Atlas 200I/500 A2 inference products, the supported TPosition is A1, A2, B1, or B2. If TPosition is A1/B1, the start address must be 32-byte aligned. If TPosition is A2/B2, the start address must be 512-byte aligned.
InitConstValueParams	Input	Initialization parameter, of the InitConstValueParams type. For details about the definition, see ${INSTALL_DIR}/include/ascendc/basic_api/interface/kernel_struct_mm.h. Replace *${INSTALL_DIR}* with the file storage path after the CANN software is installed. For details about the parameter description, see Table 3. For Atlas training products, only repeatTimes and initValue can be configured. For the Atlas inference product's AI Core, only repeatTimes and initValue can be configured. For Atlas A2 training products/Atlas A2 inference products, all parameters can be configured. For Atlas A3 training products/Atlas A3 inference products, all parameters can be configured. For Atlas 200I/500 A2 inference products, all parameters can be configured. In the scenario where only repeatTimes and initValue can be configured, configuration of other parameters is invalid. The amount of data (512 bytes) processed in each iteration is fixed, and there is no interval between iterations. In the scenario where all parameters can be configured, the following parameters can be configured: repeatTimes, initValue, blockNum, and dstGap.

**Table 3** Parameters in the **InitConstValueParams** structure
Parameter	Meaning
repeatTimes	Number of iterations. The default value is 0. In the scenario where only repeatTimes and initValue can be configured, repeatTimes ∈ [0, 255]. In the scenario where all parameters can be configured, repeatTimes ∈ [0, 32767].
blockNum	Number of data blocks initialized in each iteration. blockNum ∈ [0, 32767]. The default value is 0. When the location of dst is A1/B1, the size of each data block is 32 bytes. When the location of dst is A2/B2, the size of each data block is 512 bytes.
dstGap	Gap between the end address of the previous iteration and the start address of the next iteration of the destination operand. When the location of dst is A1/B1, the unit is 32 bytes. When the location of dst is A2/B2, the unit is 512 bytes. Value range: dstGap ∈ [0, 32767]. The default value is 0.
initValue	Initialized value. The supported data types are the same as those of dst.

Restrictions

For details about the operand address alignment requirements, see General Address Alignment Restrictions.

Example

#include "kernel_operator.h"

template <typename dst_T, typename fmap_T, typename weight_T, typename dstCO1_T> class KernelCubeMmad {
public:
    __aicore__ inline KernelCubeMmad()
    {
        C0 = 32 / sizeof(fmap_T);
        C1 = channelSize / C0;
        coutBlocks = (Cout + 16 - 1) / 16;
        ho = H - dilationH * (Kh - 1);
        wo = W - dilationW * (Kw - 1);
        howo = ho * wo;
        howoRound = ((howo + 16 - 1) / 16) * 16;
        featureMapA1Size = C1 * H * W * C0;      // shape: [C1, H, W, C0]
        weightA1Size = C1 * Kh * Kw * Cout * C0; // shape: [C1, Kh, Kw, Cout, C0]
        featureMapA2Size = howoRound * (C1 * Kh * Kw * C0);
        weightB2Size = (C1 * Kh * Kw * C0) * coutBlocks * 16;
        m = howo;
        k = C1 * Kh * Kw * C0;
        n = Cout;
        biasSize = Cout;                  // shape: [Cout]
        dstSize = coutBlocks * howo * 16; // shape: [coutBlocks, howo, 16]
        dstCO1Size = coutBlocks * howoRound * 16;
        fmRepeat = featureMapA2Size / (16 * C0);
        weRepeat = weightB2Size / (16 * C0);
    }
    __aicore__ inline void Init(__gm__ uint8_t* fmGm, __gm__ uint8_t* weGm, __gm__ uint8_t* biasGm,
        __gm__ uint8_t* dstGm)
    {
        fmGlobal.SetGlobalBuffer((__gm__ fmap_T*)fmGm);
        weGlobal.SetGlobalBuffer((__gm__ weight_T*)weGm);
        biasGlobal.SetGlobalBuffer((__gm__ dstCO1_T*)biasGm);
        dstGlobal.SetGlobalBuffer((__gm__ dst_T*)dstGm);
        pipe.InitBuffer(inQueueFmA1, 1, featureMapA1Size * sizeof(fmap_T));
        pipe.InitBuffer(inQueueFmA2, 1, featureMapA2Size * sizeof(fmap_T));
        pipe.InitBuffer(inQueueWeB1, 1, weightA1Size * sizeof(weight_T));
        pipe.InitBuffer(inQueueWeB2, 1, weightB2Size * sizeof(weight_T));
        pipe.InitBuffer(inQueueBiasA1, 1, biasSize * sizeof(dstCO1_T));
        pipe.InitBuffer(outQueueCO1, 1, dstCO1Size * sizeof(dstCO1_T));
    }
    __aicore__ inline void Process()
    {
        CopyIn();
        Split();
        Compute();
        CopyOut();
    }

private:
    __aicore__ inline void CopyIn()
    {
        AscendC::LocalTensor<fmap_T> featureMapA1 = inQueueFmA1.AllocTensor<fmap_T>();
        AscendC::LocalTensor<weight_T> weightB1 = inQueueWeB1.AllocTensor<weight_T>();
        AscendC::LocalTensor<dstCO1_T> biasA1 = inQueueBiasA1.AllocTensor<dstCO1_T>();

        AscendC::InitConstValue(featureMapA1, {1, static_cast<uint16_t>(featureMapA1Size * sizeof(fmap_T) / 32), 0, 1});
        AscendC::InitConstValue(weightB1, {1, static_cast<uint16_t>(weightA1Size * sizeof(weight_T) / 32), 0, 2});
        AscendC::DataCopy(biasA1, biasGlobal, { 1, static_cast<uint16_t>(biasSize * sizeof(dstCO1_T) / 32), 0, 0 });

        inQueueFmA1.EnQue(featureMapA1);
        inQueueWeB1.EnQue(weightB1);
        inQueueBiasA1.EnQue(biasA1);
    }
    __aicore__ inline void Split()
    {
        AscendC::LocalTensor<fmap_T> featureMapA1 = inQueueFmA1.DeQue<fmap_T>();
        AscendC::LocalTensor<weight_T> weightB1 = inQueueWeB1.DeQue<weight_T>();
        AscendC::LocalTensor<fmap_T> featureMapA2 = inQueueFmA2.AllocTensor<fmap_T>();
        AscendC::LocalTensor<weight_T> weightB2 = inQueueWeB2.AllocTensor<weight_T>();

        AscendC::InitConstValue(featureMapA2, {1, static_cast<uint16_t>(featureMapA2Size * sizeof(fmap_T) / 512), 0, 1});
        AscendC::InitConstValue(weightB2, { 1, static_cast<uint16_t>(weightB2Size * sizeof(weight_T) / 512), 0, 2});

        inQueueFmA2.EnQue<fmap_T>(featureMapA2);
        inQueueWeB2.EnQue<weight_T>(weightB2);
        inQueueFmA1.FreeTensor(featureMapA1);
        inQueueWeB1.FreeTensor(weightB1);
    }
    __aicore__ inline void Compute()
    {
        AscendC::LocalTensor<fmap_T> featureMapA2 = inQueueFmA2.DeQue<fmap_T>();
        AscendC::LocalTensor<weight_T> weightB2 = inQueueWeB2.DeQue<weight_T>();
        AscendC::LocalTensor<dstCO1_T> dstCO1 = outQueueCO1.AllocTensor<dstCO1_T>();
        AscendC::LocalTensor<dstCO1_T> biasA1 = inQueueBiasA1.DeQue<dstCO1_T>();
        AscendC::Mmad(dstCO1, featureMapA2, weightB2, biasA1, { m, n, k, true, 0, false, false, false });

        outQueueCO1.EnQue<dstCO1_T>(dstCO1);
        inQueueFmA2.FreeTensor(featureMapA2);
        inQueueWeB2.FreeTensor(weightB2);
        inQueueBiasA1.FreeTensor(biasA1);
    }
    __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<dstCO1_T> dstCO1 = outQueueCO1.DeQue<dstCO1_T>();
        AscendC::FixpipeParamsV220 fixpipeParams;
        fixpipeParams.nSize = coutBlocks * 16;
        fixpipeParams.mSize = howo;
        fixpipeParams.srcStride = howo;
        fixpipeParams.dstStride = howo * AscendC::BLOCK_CUBE * sizeof(dst_T) / AscendC::ONE_BLK_SIZE;
        fixpipeParams.quantPre = deqMode;
        AscendC::Fixpipe<dst_T, dstCO1_T, AscendC::CFG_NZ>(dstGlobal, dstCO1, fixpipeParams);
        outQueueCO1.FreeTensor(dstCO1);
    }

private:
    AscendC::TPipe pipe;
    // feature map queue
    AscendC::TQue<AscendC::TPosition::A1, 1> inQueueFmA1;
    AscendC::TQue<AscendC::TPosition::A2, 1> inQueueFmA2;
    // weight queue
    AscendC::TQue<AscendC::TPosition::B1, 1> inQueueWeB1;
    AscendC::TQue<AscendC::TPosition::B2, 1> inQueueWeB2;
    // bias queue
    AscendC::TQue<AscendC::TPosition::A1, 1> inQueueBiasA1;
    // dst queue
    AscendC::TQue<AscendC::TPosition::CO1, 1> outQueueCO1;

    AscendC::GlobalTensor<fmap_T> fmGlobal;
    AscendC::GlobalTensor<weight_T> weGlobal;
    AscendC::GlobalTensor<dst_T> dstGlobal;
    AscendC::GlobalTensor<dstCO1_T> biasGlobal;

    uint16_t channelSize = 32;
    uint16_t H = 4, W = 4;
    uint8_t Kh = 2, Kw = 2;
    uint16_t Cout = 16;
    uint16_t C0, C1;
    uint8_t dilationH = 2, dilationW = 2;
    uint16_t coutBlocks, ho, wo, howo, howoRound;
    uint32_t featureMapA1Size, weightA1Size, featureMapA2Size, weightB2Size, biasSize, dstSize, dstCO1Size;
    uint16_t m, k, n;
    uint8_t fmRepeat, weRepeat;
    AscendC::QuantMode_t deqMode = AscendC::QuantMode_t::F322F16;
};

extern "C" __global__ __aicore__ void cube_mmad_simple_kernel(__gm__ uint8_t *fmGm, __gm__ uint8_t *weGm,
    __gm__ uint8_t *biasGm, __gm__ uint8_t *dstGm)
{
    KernelCubeMmad<half, half, half, half> op;
    op.Init(fmGm, weGm, biasGm, dstGm);
    op.Process();
}

Parent topic: Data Movement