下载
EN
注册

无DataCopyPad的处理方式

Atlas 推理系列产品未提供DataCopyPad的接口,需要对搬进和搬出非对齐场景进行处理,如下为不同场景及其处理方式:

  • Global上逐行搬运长度不对齐数据到Local中,导致Local中每行都存在冗余数据
    • 冗余数据参与计算

      如下图所示,将前11个half数据进行Abs计算,冗余数据可以参与计算,不影响最终结果,该种方式主要用于elemwise计算,这里步骤为:

      1. 使用DataCopy搬运16个half数据到Local中;
      2. 直接使用Abs做整块计算,可以不用计算尾块大小。
      图1 冗余数据参与计算
    • 使用mask掩掉脏数据,一般用于轴归约计算等

      如下图所示,为将前4个half数据进行ReduceMin计算,有效数据后的冗余数据不能参与到计算中,可以通过在使用ReduceMin API时,设置mask掩掉脏数据,这里步骤为:

      1. 使用DataCopy搬运16个half数据到Local中;
      2. 对归约计算的目的操作数DstLocal清零,如使用Duplicate等;
      3. 进行归约操作,将ReduceMin的mask模式设置为前4个数据有效,来掩掉对冗余数据区域的处理。
      图2 冗余数据不参与计算
    • 搬入Local中,逐行调用高阶API Duplicate,脏数据位置填充0值

      如下图所示,对于搬入后的非对齐数据,逐行进行Duplicate清零处理,步骤为:

      1. 使用DataCopy搬运16个half数据到Local;
      2. 使用高阶API Duplicate,按照如下方式设置mask值,控制仅后5个元素位置有效,将冗余数据填充为0。
      uint64_t mask0 = ((uint64_t)1 << 16) - ((uint64_t)1 << 11); 
      uint64_t mask[2] = {mask0, 0};
      图3 逐行填充0值
  • Global搬运非对齐数据到Local, 逐行搬入后,Pad成0值

    如下图所示,将Local内16*16大小的数据库进行脏数据清零,逐行清零性能会很差,可以使用Pad一次性清零,步骤为:

    1. 将16*6的数据从GM上逐行搬入UB后,每行6个有效数据;
    2. Pad接口的场景2相同,将脏数据位置填充为0;
    图4 使用Pad接口补齐
  • Local非对齐拷贝出Global,拷贝长度大于32B
    • Local中存在冗余数据,如果有效数据为32B整除,使用UnPad接口去除冗余数据并完整搬出

      如下图所示,Local内存为16*16, 需要将16*6的有效内存搬到Global中,步骤如下:

      1. 要搬出的有效数据16*6为32B对齐,可以使用UnPad高阶API去除冗余值;
      2. 使用DataCopy搬运出连续的16*6个half数据到Global中。
      图5 使用UnPad去除冗余值
    • 使用GatherMask处理

      如下图所示,为搬出19个half数据到Global中,使用GatherMask处理,步骤如下:

      1. 完整拷贝前16个half(32B)数据到Global中;
      2. 使用GatherMask接口,将SrcLocal[4]~[19]的数Gather到TmpLocal中,TmpLocal从对齐地址开始;
      3. 从TmpLocal中搬运Gather的数据(32B整数倍)到Global中。
      图6 使用GatherMask借位搬运
  • Local非对齐拷贝出Global, 拷贝长度小于32B

    如下图所示,将一段数据分多核拷出, 每个核拷贝出4个数

    1. 将目标Global完整清零,可以通过在HOST清零或者在Kernel侧用UB覆盖的方式处理;
    2. 将本核内的Local数据,除了要搬出的4个有效数,其余冗余部分清零(使用Duplicate);
    3. 使用atomic累加的方式拷贝到Global,因为冗余数据已被清成0值,所以不会出现数据踩踏。
    图7 使用atomic累加的方式处理拷贝长度小于32B的场景

调用示例

  • 如下代码展示,数据大于32B,图1 冗余数据参与计算图6 使用GatherMask借位搬运的搬出数据方式。
    #include "kernel_operator.h"
    
    constexpr int32_t BLOCK_BYTE_SIZE = 36;                            // equivalent to the definition of blockLen of DataCopyPad
    constexpr int32_t BLOCK_GROUP_NUM = 1;                             // equivalent to the definition of blockCount of DataCopyPad
    constexpr int32_t BLOCKLEN_CEIL = (BLOCK_BYTE_SIZE + 32 - 1) / 32 * 32 / sizeof(half); // round up with respect to 32 bytes
    constexpr int32_t BLOCK_ELEMENT_NUM = BLOCK_BYTE_SIZE / sizeof(half);
    constexpr int32_t USE_CORE_NUM = 8; // num of core used
    constexpr int32_t TILE_NUM = 16;    // split data into 16 tiles for each core
    constexpr int32_t BUFFER_NUM = 1;   // tensor num for each queue
    constexpr int32_t TOTAL_LENGTH = USE_CORE_NUM * TILE_NUM * BUFFER_NUM * BLOCK_GROUP_NUM * BLOCK_ELEMENT_NUM;
    constexpr int32_t BLOCK_LENGTH = TOTAL_LENGTH / USE_CORE_NUM;
    constexpr int32_t TILE_LENGTH = BLOCK_LENGTH / TILE_NUM / BUFFER_NUM;
    
    class KernelDataCopyPad
    {
    public:
        __aicore__ inline KernelDataCopyPad(){}
        __aicore__ inline void Init(GM_ADDR inputGM, GM_ADDR outputGM)
        {
            srcGlobal.SetGlobalBuffer((__gm__ half *)(inputGM) + BLOCK_LENGTH * AscendC::GetBlockIdx(), BLOCK_LENGTH);
            dstGlobal.SetGlobalBuffer((__gm__ half *)(outputGM) + BLOCK_LENGTH * AscendC::GetBlockIdx(), BLOCK_LENGTH);
            pipe.InitBuffer(inQueue, BUFFER_NUM, BLOCKLEN_CEIL * sizeof(half));
            pipe.InitBuffer(outQueue, BUFFER_NUM, BLOCKLEN_CEIL * sizeof(half));
            pipe.InitBuffer(outQueueTail, BUFFER_NUM, 32);
            pipe.InitBuffer(tmpPattern, 32);
        }
        __aicore__ inline void Process()
        {
            const int32_t loopCount = TILE_NUM * BUFFER_NUM;
            for (int32_t i = 0; i < loopCount; i++)
            {
                CopyIn(i);
                Compute(i);
                CopyOut(i);
            }
        }
    
    private:
        __aicore__ inline void CopyIn(int32_t progress) // GM->UB
        {
            AscendC::LocalTensor<half> inputLocal = inQueue.AllocTensor<half>();
            for (int i = 0; i < BLOCK_GROUP_NUM; i++)
            {
                uint32_t srcGM_idx = progress * TILE_LENGTH + BLOCK_ELEMENT_NUM * i;
                // each time copy 32 half elements to UB
                AscendC::DataCopy(inputLocal[BLOCKLEN_CEIL * i], srcGlobal[srcGM_idx], BLOCKLEN_CEIL);
            }
            inQueue.EnQue(inputLocal);
        }
        __aicore__ inline void Compute(int32_t progress)
        {
            AscendC::LocalTensor<half> outputLocal = outQueue.AllocTensor<half>();
            AscendC::LocalTensor<half> inputLocal = inQueue.DeQue<half>();
            uint32_t calLen = (TILE_LENGTH * sizeof(half) + 32 - 1) / 32 * 32 / sizeof(half);
            Abs(outputLocal, inputLocal, calLen); // main calculation
            event_t eventIDMTE3ToV = static_cast<event_t>(GetTPipePtr()->FetchEventID(AscendC::HardEvent::MTE3_V));
            AscendC::LocalTensor<uint16_t> bufPattern = tmpPattern.Get<uint16_t>();
            bufPattern.SetValue(0, 0b1111111111111100);                   // select the last 14 elements of the first 16 positions
            bufPattern.SetValue(1, 0b0000000000000011);                   // select the first 2 elements of the next 16 positions
            AscendC::SetFlag<AscendC::HardEvent::MTE3_V>(eventIDMTE3ToV); // setting Buf_Pattern before doing atomic add
            AscendC::WaitFlag<AscendC::HardEvent::MTE3_V>(eventIDMTE3ToV);
            uint32_t mask = 128;
            uint64_t rsvdCnt = 0;
            AscendC::LocalTensor<half> tailLocal = outQueueTail.AllocTensor<half>();
            AscendC::GatherMask(tailLocal, outputLocal, bufPattern, true, mask, {1, 1, 8, 8}, rsvdCnt);
            outQueue.EnQue<half>(outputLocal);
            outQueueTail.EnQue<half>(tailLocal);
            inQueue.FreeTensor(inputLocal);
        }
        __aicore__ inline void CopyOut(int32_t progress)
        {
            AscendC::LocalTensor<half> outputLocal = outQueue.DeQue<half>();
            AscendC::LocalTensor<half> tailLocal = outQueueTail.DeQue<half>();
            uint32_t copyLenMain = TILE_LENGTH * sizeof(half) / 32 * 32 / sizeof(half);
            uint32_t offsetMain = progress * TILE_LENGTH;
            AscendC::DataCopy(dstGlobal[offsetMain], outputLocal, copyLenMain);
            uint32_t tailLen = 32 / sizeof(half);
            uint32_t offsetTail = offsetMain + (TILE_LENGTH - tailLen);
            AscendC::DataCopy(dstGlobal[offsetTail], tailLocal, tailLen);
            outQueue.FreeTensor(outputLocal);
            outQueueTail.FreeTensor(tailLocal);
        }
    
    private:
        AscendC::GlobalTensor<half> srcGlobal;
        AscendC::GlobalTensor<half> dstGlobal;
        AscendC::TPipe pipe;
        AscendC::TQue<AscendC::QuePosition::VECIN, BUFFER_NUM> inQueue;
        AscendC::TQue<AscendC::QuePosition::VECOUT, BUFFER_NUM> outQueue, outQueueTail;
        AscendC::TBuf<AscendC::QuePosition::VECCALC> tmpPattern;
    };
    
    extern "C" __global__ __aicore__ void datacopypad_custom(GM_ADDR inputGM, GM_ADDR outputGM)
    {
        KernelDataCopyPad op;
        op.Init(inputGM, outputGM);
        op.Process();
    }

    main.cpp中的特殊处理inputByteSize如下所示。

    uint32_t blockDim = 8;
    // 2304 is TOTAL_LENGTH,TOTAL_LENGTH = USE_CORE_NUM * TILE_NUM * BUFFER_NUM * BLOCK_GROUP_NUM * BLOCK_ELEMENT_NUM;
    // 2318 is TOTAL_LENGTH + (BLOCKLEN_CEIL - BLOCK_ELEMENT_NUM)
    // borrow the next (BLOCKLEN_CEIL - BLOCK_ELEMENT_NUM) elements of srcGM
    size_t inputByteSize = 2318 * sizeof(int16_t);
    size_t outputByteSize = 2304 * sizeof(int16_t);
  • 如下代码展示,数据小于32B,参考图3 逐行填充0值的方法逐行将Local上脏数据清零,参考图7 使用atomic累加的方式处理拷贝长度小于32B的场景的搬出数据方式。
    #include "kernel_operator.h"
    
    constexpr int32_t BLOCK_BYTE_SIZE = 22;                               // equivalent to the definition of blockLen of DataCopyPad
    constexpr int32_t BLOCK_GROUP_NUM = 15;                               // equivalent to the definition of blockCount of DataCopyPad
    constexpr int32_t BLOCK_ELEMENT_NUM = BLOCK_BYTE_SIZE / sizeof(half); // round up with respect to 32 bytegit a
    constexpr int32_t BLOCKLEN_CEIL = 32 / sizeof(half);                  // since BLOCK_BYTE_SIZE<32
    constexpr int32_t USE_CORE_NUM = 4;                                   // num of core used
    constexpr int32_t TILE_NUM = 1;
    constexpr int32_t BUFFER_NUM = 1;
    constexpr int32_t TOTAL_LENGTH = USE_CORE_NUM * TILE_NUM * BUFFER_NUM * BLOCK_GROUP_NUM * BLOCK_ELEMENT_NUM;
    constexpr int32_t BLOCK_LENGTH = TOTAL_LENGTH / USE_CORE_NUM;         // length computed of each core
    constexpr int32_t TILE_LENGTH = BLOCK_LENGTH / TILE_NUM / BUFFER_NUM; // tensor num for each queue
    
    class KernelDataCopyPad
    {
    public:
        __aicore__ inline KernelDataCopyPad() {}
        __aicore__ inline void Init(GM_ADDR inputGM, GM_ADDR outputGM)
        {
            srcGlobal.SetGlobalBuffer((__gm__ half *)(inputGM) + BLOCK_LENGTH * AscendC::GetBlockIdx(), BLOCK_LENGTH);
            dstGlobal.SetGlobalBuffer((__gm__ half *)(outputGM) + BLOCK_LENGTH * AscendC::GetBlockIdx(), BLOCK_LENGTH);
            pipe.InitBuffer(inQueue, BUFFER_NUM, BLOCK_GROUP_NUM * BLOCKLEN_CEIL * sizeof(half));
            pipe.InitBuffer(outQueue, BUFFER_NUM, BLOCK_GROUP_NUM * BLOCKLEN_CEIL * sizeof(half));
            // 32 magic number
            pipe.InitBuffer(zeroQueue, BUFFER_NUM, 32);
        }
        __aicore__ inline void Process()
        {
            constexpr int32_t loopCount = TILE_NUM * BUFFER_NUM;
            for (int32_t i = 0; i < loopCount; i++)
            {
                CopyIn(i);
                Compute(i);
                CopyOut(i);
            }
        }
    
    private:
        __aicore__ inline void CopyIn(int32_t progress) // GM->UB
        {
            AscendC::LocalTensor<half> inputLocal = inQueue.AllocTensor<half>();
            for (int i = 0; i < BLOCK_GROUP_NUM; i++)
            {
                // each time copy 16 half elements to UB
                AscendC::DataCopy(inputLocal[i * BLOCKLEN_CEIL], srcGlobal[i * BLOCK_ELEMENT_NUM], BLOCKLEN_CEIL);
            }
            inQueue.EnQue(inputLocal);
        }
        __aicore__ inline void Compute(int32_t progress)
        {
            AscendC::LocalTensor<half> outputLocal = outQueue.AllocTensor<half>();
            AscendC::LocalTensor<half> inputLocal = inQueue.DeQue<half>();
            AscendC::LocalTensor<half> zeroTensor = zeroQueue.AllocTensor<half>();
            // use local zero tensor to clear dstGM
            constexpr uint32_t zeroLen = 32 / sizeof(half);
            AscendC::Duplicate<half>(zeroTensor, 0, zeroLen);
            constexpr uint32_t aligneElementSize = 32 / sizeof(half);
            uint32_t copyLen = BLOCK_ELEMENT_NUM * BLOCK_GROUP_NUM / aligneElementSize * aligneElementSize; // round down 165 -> 160
            zeroQueue.EnQue<half>(zeroTensor);
            zeroTensor = zeroQueue.DeQue<half>();
            // clear dstGM before doing calculations
            event_t eventIDMTE3ToV = static_cast<event_t>(GetTPipePtr()->FetchEventID(AscendC::HardEvent::MTE3_V));
            for (int i = 0; i < copyLen / zeroLen; i++)
            {
                AscendC::DataCopy<half>(dstGlobal[i * zeroLen], zeroTensor, zeroLen);
            }
            AscendC::DataCopy<half>(dstGlobal[BLOCK_ELEMENT_NUM * BLOCK_GROUP_NUM - BLOCKLEN_CEIL], zeroTensor, BLOCKLEN_CEIL);
            AscendC::SetFlag<AscendC::HardEvent::MTE3_V>(eventIDMTE3ToV);
            AscendC::WaitFlag<AscendC::HardEvent::MTE3_V>(eventIDMTE3ToV);
            // mask mode controls only the last 5 elements doing Duplicate
            uint64_t mask0 = (1ul << 16) - (1ul << BLOCK_ELEMENT_NUM);
            uint64_t mask[2] = {mask0, 0};
            for (int32_t i = 0; i < BLOCK_GROUP_NUM; i++)
            {
                AscendC::Duplicate<half>(inputLocal[i * BLOCKLEN_CEIL], 0, mask, 1, 1, 1); // clear dummy data on inputLocal
            }
            AscendC::Abs(outputLocal, inputLocal, BLOCKLEN_CEIL * BLOCK_GROUP_NUM);
            outQueue.EnQue<half>(outputLocal);
            inQueue.FreeTensor(inputLocal);
            zeroQueue.FreeTensor(zeroTensor);
        }
        __aicore__ inline void CopyOut(int32_t progress)
        {
            AscendC::LocalTensor<half> outputLocal = outQueue.DeQue<half>();
            AscendC::SetAtomicAdd<half>();
            for (int32_t i = 0; i < BLOCK_GROUP_NUM; i++)
            {
                AscendC::DataCopy<half>(dstGlobal[i * BLOCK_ELEMENT_NUM], outputLocal[i * BLOCKLEN_CEIL], BLOCKLEN_CEIL);
            }
            AscendC::SetAtomicNone();
            outQueue.FreeTensor(outputLocal);
        }
    
    private:
        AscendC::GlobalTensor<half> srcGlobal;
        AscendC::GlobalTensor<half> dstGlobal;
        AscendC::TPipe pipe;
        AscendC::TQue<AscendC::QuePosition::VECIN, BUFFER_NUM> inQueue;
        AscendC::TQue<AscendC::QuePosition::VECOUT, BUFFER_NUM> outQueue;
        AscendC::TQue<AscendC::QuePosition::VECOUT, BUFFER_NUM> zeroQueue;
    };
    extern "C" __global__ __aicore__ void datacopypad_custom(GM_ADDR inputGM, GM_ADDR outputGM)
    {
        KernelDataCopyPad op;
        op.Init(inputGM, outputGM);
        op.Process();
    }

    main.cpp中的特殊处理inputByteSize和outputByteSize,如下所示。

    uint32_t blockDim = 4;
    //665 is TOTAL_LENGTH + (BLOCKLEN_CEIL - BLOCK_ELEMENT_NUM)
    //copy in borrow the next (BLOCKLEN_CEIL - BLOCK_ELEMENT_NUM) elements of srcGM
    size_t inputByteSize = 665 * sizeof(int16_t);
    //copy out atomic add extra (BLOCKLEN_CEIL - BLOCK_ELEMENT_NUM) zeros to dstGM
    size_t outputByteSize = 665 * sizeof(int16_t);
  • 如下代码展示,数据小于32B,参考图2,使用mask掩掉脏数据,参考图7的搬出数据方式。
    #include "kernel_operator.h"
    
    constexpr int32_t BLOCK_BYTE_SIZE = 8; // equivalent to the definition of blockLen of DataCopyPad
    constexpr int32_t BLOCK_GROUP_NUM = 4; // equivalent to the definition of blockCount of DataCopyPad
    constexpr int32_t BLOCK_ELEMENT_NUM = BLOCK_BYTE_SIZE / sizeof(half);
    constexpr int32_t BLOCKLEN_CEIL = 32 / sizeof(half); // since BLOCK_BYTE_SIZE<32
    constexpr int32_t USE_CORE_NUM = 4;                  // num of core used
    constexpr int32_t TILE_NUM = 1;
    constexpr int32_t BUFFER_NUM = 1;
    constexpr int32_t TOTAL_LENGTH = USE_CORE_NUM * TILE_NUM * BUFFER_NUM * BLOCK_GROUP_NUM * BLOCK_ELEMENT_NUM;
    constexpr int32_t BLOCK_LENGTH = TOTAL_LENGTH / USE_CORE_NUM;         // length computed of each core
    constexpr int32_t TILE_LENGTH = BLOCK_LENGTH / TILE_NUM / BUFFER_NUM; // tensor num for each queue
    
    class KernelDataCopyPad
    {
    public:
        __aicore__ inline KernelDataCopyPad() {}
        __aicore__ inline void Init(GM_ADDR inputGM, GM_ADDR outputGM)
        {
            srcGlobal.SetGlobalBuffer((__gm__ half *)(inputGM) + BLOCK_LENGTH * AscendC::GetBlockIdx(), BLOCK_LENGTH);
            dstGlobal.SetGlobalBuffer((__gm__ half *)(outputGM) + BLOCK_LENGTH * AscendC::GetBlockIdx(), BLOCK_LENGTH);
            pipe.InitBuffer(inQueue, BUFFER_NUM, BLOCK_GROUP_NUM * BLOCKLEN_CEIL * sizeof(half));
            pipe.InitBuffer(outQueue, BUFFER_NUM, BLOCK_GROUP_NUM * BLOCKLEN_CEIL * sizeof(half));
            pipe.InitBuffer(zeroQueue, BUFFER_NUM, 32);
            pipe.InitBuffer(workQueue, BUFFER_NUM, 32);
        }
        __aicore__ inline void Process()
        {
            // loop count need to be doubled, due to double buffer
            const int32_t loopCount = TILE_NUM * BUFFER_NUM;
            // tiling strategy, pipeline parallel
            for (int32_t i = 0; i < loopCount; i++)
            {
                CopyIn(i);
                Compute(i);
                CopyOut(i);
            }
        }
    
    private:
        __aicore__ inline void CopyIn(int32_t progress)
        {
            AscendC::LocalTensor<half> inputLocal = inQueue.AllocTensor<half>();
            AscendC::LocalTensor<half> zeroTensor = zeroQueue.AllocTensor<half>();
            for (int i = 0; i < BLOCK_GROUP_NUM; i++)
            {
                AscendC::DataCopy(inputLocal[i * BLOCKLEN_CEIL], srcGlobal[i * BLOCK_ELEMENT_NUM], BLOCKLEN_CEIL); // each time copy 16 half elements to UB
            }
            inQueue.EnQue(inputLocal);
            zeroQueue.EnQue(zeroTensor);
        }
        __aicore__ inline void Compute(int32_t progress)
        {
            AscendC::LocalTensor<half> outputLocal = outQueue.AllocTensor<half>();
            AscendC::LocalTensor<half> workLocal = workQueue.AllocTensor<half>();
            AscendC::LocalTensor<half> inputLocal = inQueue.DeQue<half>();
            AscendC::LocalTensor<half> zeroTensor = zeroQueue.DeQue<half>();
            AscendC::Duplicate<half>(zeroTensor, 0, 32 / sizeof(half)); // set an all 0 tensor
            zeroQueue.EnQue(zeroTensor);
            zeroTensor = zeroQueue.DeQue<half>();
            // clear dstGM before doing calculations
            event_t eventIDMTE3ToV = static_cast<event_t>(GetTPipePtr()->FetchEventID(AscendC::HardEvent::MTE3_V));
            AscendC::DataCopy<half>(dstGlobal, zeroTensor, TILE_LENGTH);
            AscendC::SetFlag<AscendC::HardEvent::MTE3_V>(eventIDMTE3ToV);
            AscendC::WaitFlag<AscendC::HardEvent::MTE3_V>(eventIDMTE3ToV);
            outQueue.EnQue<half>(outputLocal);
            outputLocal = outQueue.DeQue<half>();
            AscendC::Duplicate<half>(outputLocal, 0, BLOCK_GROUP_NUM * BLOCKLEN_CEIL);
            outQueue.EnQue<half>(outputLocal);
            outputLocal = outQueue.DeQue<half>();
            // mask mode controls only the first 4 elements do ReduceMin calculation
            uint64_t Mask0 = ((uint64_t)1 << BLOCK_ELEMENT_NUM) - 1;
            uint64_t Mask[2] = {Mask0, 0};
            // main calculation
            for (int i = 0; i < BLOCK_GROUP_NUM; i++)
            {
                AscendC::ReduceMin<half>(outputLocal[i * BLOCKLEN_CEIL], inputLocal[i * BLOCKLEN_CEIL], workLocal, Mask, 1, 8, false);
            }
            outQueue.EnQue<half>(outputLocal);
            inQueue.FreeTensor(inputLocal);
            workQueue.FreeTensor(workLocal);
            zeroQueue.FreeTensor(zeroTensor);
        }
        __aicore__ inline void CopyOut(int32_t progress)
        {
            AscendC::LocalTensor<half> outputLocal = outQueue.DeQue<half>();
            AscendC::SetAtomicAdd<half>();
            for (int i = 0; i < BLOCK_GROUP_NUM; i++)
            {
                AscendC::DataCopy<half>(dstGlobal[i * BLOCK_ELEMENT_NUM], outputLocal[i * BLOCKLEN_CEIL], BLOCKLEN_CEIL);
            }
            AscendC::SetAtomicNone();
            outQueue.FreeTensor(outputLocal);
        }
    
    private:
        AscendC::GlobalTensor<half> srcGlobal;
        AscendC::GlobalTensor<half> dstGlobal;
        AscendC::TPipe pipe;
        AscendC::TQue<AscendC::QuePosition::VECIN, BUFFER_NUM> inQueue;
        AscendC::TQue<AscendC::QuePosition::VECOUT, BUFFER_NUM> outQueue;
        AscendC::TQue<AscendC::QuePosition::VECOUT, BUFFER_NUM> workQueue;
        AscendC::TQue<AscendC::QuePosition::VECOUT, BUFFER_NUM> zeroQueue;
    };
    extern "C" __global__ __aicore__ void datacopypad_custom(GM_ADDR inputGM, GM_ADDR outputGM)
    {
        KernelDataCopyPad op;
        op.Init(inputGM, outputGM);
        op.Process();
    }

    main.cpp中的特殊处理inputByteSize和outputByteSize,如下所示。

    uint32_t blockDim = 4;
    //76 is TOTAL_LENGTH + (BLOCKLEN_CEIL - BLOCK_ELEMENT_NUM)
    //copy in borrow the next (BLOCKLEN_CEIL - BLOCK_ELEMENT_NUM) elements of srcGM
    size_t inputByteSize = 76 * sizeof(int16_t);
    //copy out atomic add extra (BLOCKLEN_CEIL - BLOCK_ELEMENT_NUM) zeros to dstGM
    size_t outputByteSize = 76 * sizeof(int16_t);
  • 如下代码展示,数据小于32B,图1图5的搬出数据方式。
    #include "kernel_operator.h"
    #include "datacopypad_tiling.h"
    
    constexpr int32_t BLOCK_BYTE_SIZE = 28; // equivalent to the definition of blockLen of DataCopyPad
    constexpr int32_t BLOCK_GROUP_NUM = 16; // equivalent to the definition of blockCount of DataCopyPad
    constexpr int32_t BLOCK_ELEMENT_NUM = BLOCK_BYTE_SIZE / sizeof(half);
    constexpr int32_t BLOCKLEN_CEIL = 32 / sizeof(half); // since BLOCK_BYTE_SIZE<32
    constexpr int32_t USE_CORE_NUM = 8;                  // num of core used
    constexpr int32_t TILE_NUM = 8;                      // split data into 8 tiles for each core
    constexpr int32_t BUFFER_NUM = 2;                    // tensor num for each queue
    constexpr int32_t TOTAL_LENGTH = USE_CORE_NUM * TILE_NUM * BUFFER_NUM * BLOCK_GROUP_NUM * BLOCK_ELEMENT_NUM;
    constexpr int32_t BLOCK_LENGTH = TOTAL_LENGTH / USE_CORE_NUM;         // length computed of each core
    constexpr int32_t TILE_LENGTH = BLOCK_LENGTH / TILE_NUM / BUFFER_NUM; // tensor num for each queue
    class KernelDataCopyPad
    {
    public:
        __aicore__ inline KernelDataCopyPad() {}
        __aicore__ inline void Init(GM_ADDR inputGM, GM_ADDR outputGM)
        {
            srcGlobal.SetGlobalBuffer((__gm__ half *)(inputGM) + BLOCK_LENGTH * AscendC::GetBlockIdx(), BLOCK_LENGTH);
            dstGlobal.SetGlobalBuffer((__gm__ half *)(outputGM) + BLOCK_LENGTH * AscendC::GetBlockIdx(), BLOCK_LENGTH);
            pipe.InitBuffer(inQueue, BUFFER_NUM, BLOCK_GROUP_NUM * BLOCKLEN_CEIL * sizeof(half));
            pipe.InitBuffer(outQueue, BUFFER_NUM, BLOCK_GROUP_NUM * BLOCKLEN_CEIL * sizeof(half));
        }
        __aicore__ inline void Process(DataCopyPadCustomTilingData &tiling)
        {
            const int32_t loopCount = TILE_NUM * BUFFER_NUM;
            for (int32_t i = 0; i < loopCount; i++)
            {
                CopyIn(i);
                Compute(i, tiling);
                CopyOut(i);
            }
        }
    
    private:
        __aicore__ inline void CopyIn(const int32_t progress)
        {
            AscendC::LocalTensor<half> inputLocal = inQueue.AllocTensor<half>();
            for (int32_t i = 0; i < BLOCK_GROUP_NUM; i++)
            {
                const uint32_t srcGmIdx = progress * BLOCK_ELEMENT_NUM * BLOCK_GROUP_NUM + BLOCK_ELEMENT_NUM * i;
                AscendC::DataCopy(inputLocal[BLOCKLEN_CEIL * i], srcGlobal[srcGmIdx], BLOCKLEN_CEIL);
            }
            inQueue.EnQue(inputLocal);
        }
        __aicore__ inline void Compute(const int32_t progress, DataCopyPadCustomTilingData &tiling)
        {
            AscendC::LocalTensor<half> inputLocal = inQueue.DeQue<half>();
            AscendC::LocalTensor<half> outputLocal = outQueue.AllocTensor<half>();
            AscendC::Abs(inputLocal, inputLocal, BLOCK_GROUP_NUM * BLOCKLEN_CEIL); // main calculation
            AscendC::UnPadParams unPadParams;
            unPadParams.rightPad = BLOCKLEN_CEIL - BLOCK_ELEMENT_NUM; // delete 2 dummy half each row
            AscendC::UnPad<half>(outputLocal, inputLocal, unPadParams, tiling.unPadTiling);
            outQueue.EnQue<half>(outputLocal);
            inQueue.FreeTensor(inputLocal);
        }
        __aicore__ inline void CopyOut(const int32_t progress)
        {
            AscendC::LocalTensor<half> outputLocal = outQueue.DeQue<half>();
            AscendC::DataCopy(dstGlobal[progress * TILE_LENGTH], outputLocal, TILE_LENGTH);
            outQueue.FreeTensor(outputLocal);
        }
    
    private:
        AscendC::GlobalTensor<half> srcGlobal;
        AscendC::GlobalTensor<half> dstGlobal;
        AscendC::TPipe pipe;
        AscendC::TQue<AscendC::QuePosition::VECIN, BUFFER_NUM> inQueue;
        AscendC::TQue<AscendC::QuePosition::VECOUT, BUFFER_NUM> outQueue;
    };
    extern "C" __global__ __aicore__ void datacopypad_custom(GM_ADDR inputGM, GM_ADDR outputGM,
                                                             DataCopyPadCustomTilingData tiling)
    {
        KernelDataCopyPad op;
        op.Init(inputGM, outputGM);
        op.Process(tiling);
    }
    tiling.h中注册结构体,如下所示。
    #include "kernel_tiling/kernel_tiling.h"
    struct DataCopyPadCustomTilingData {
        UnPadTiling unPadTiling;
    };
    int32_t GenerateTiling(const std::vector<int64_t> &shape, uint32_t &coreNum,DataCopyPadCustomTilingData &tiling);
    tiling.cpp文件中根据输入shape、剩余的可供计算的空间大小等信息获取UnPad kernel侧接口所需tiling参数。
    #include "graph/tensor.h"
    #include "tiling/tiling_api.h"
    #include "tiling/platform/platform_ascendc.h"
    
    int32_t GenerateTiling(const std::vector<int64_t>& shape, uint32_t& coreNum, DataCopyPadCustomTilingData& tiling)
    {
        platform_ascendc::PlatformAscendC* ascendcPlatform = platform_ascendc::PlatformAscendCManager::GetInstance();
        coreNum = 8;
        ge::Shape srcShape(shape);
        uint32_t tmpMinSize, tmpMaxSize;
        GetUnPadMaxMinTmpSize(*ascendcPlatform, srcShape, sizeof(int16_t), tmpMaxSize, tmpMinSize);
        optiling::UnPadTiling unPadTiling;
        UnPadTilingFunc(srcShape, tmpMaxSize, sizeof(int16_t), unPadTiling);
        unPadTiling.SaveToBuffer(&(tiling.unPadTiling), sizeof(tiling.unPadTiling));
        return 0;
    }
    main.cpp中特殊处理inputByteSize,如下所示。对应的kernel侧在核函数中调用GenerateTiling获取tiling,继而传入UnPad接口参与计算。
    const std::vector<int64_t> shape({ 16, 16 });
    DataCopyPadCustomTilingData tiling;
    uint32_t blockDim = 8;
    (void)GenerateTiling(shape, blockDim, tiling);
    //28674 is TOTAL_LENGTH + (BLOCKLEN_CEIL - BLOCK_ELEMENT_NUM)
    //28672 is TOTAL_LENGTH
    //copy in borrow the next (BLOCKLEN_CEIL - BLOCK_ELEMENT_NUM) elements of srcGM
    size_t inputByteSize = 28674 * sizeof(int16_t);
    size_t outputByteSize = 28672 * sizeof(int16_t); 
    ...
    ICPU_RUN_KF(datacopypad_custom, blockDim, inputGM, outputGM, tiling); 
  • 如下代码展示,数据小于32B,图4图7的搬出数据方式。
    #include "kernel_operator.h"
    #include "datacopypad_tiling.h"
    
    constexpr int32_t BLOCK_BYTE_SIZE = 12; // equivalent to the definition of blockLen of DataCopyPad
    constexpr int32_t BLOCK_GROUP_NUM = 16; // equivalent to the definition of blockCount of DataCopyPad
    constexpr int32_t BLOCK_ELEMENT_NUM = BLOCK_BYTE_SIZE / sizeof(half);
    constexpr int32_t BLOCKLEN_CEIL = 32 / sizeof(half); // since BLOCK_BYTE_SIZE<32
    constexpr int32_t USE_CORE_NUM = 8;                  // num of core used
    constexpr int32_t TILE_NUM = 16;                     // split data into 16 tiles for each core
    constexpr int32_t BUFFER_NUM = 1;                    // tensor num for each queue
    constexpr int32_t TOTAL_LENGTH = USE_CORE_NUM * TILE_NUM * BUFFER_NUM * BLOCK_GROUP_NUM *
                                     BLOCK_ELEMENT_NUM;
    constexpr int32_t BLOCK_LENGTH = TOTAL_LENGTH / USE_CORE_NUM; // length computed of each core
    constexpr int32_t TILE_LENGTH = BLOCK_LENGTH / TILE_NUM / BUFFER_NUM;
    class KernelDataCopyPad
    {
    public:
        __aicore__ inline KernelDataCopyPad() {}
        __aicore__ inline void Init(GM_ADDR inputGM, GM_ADDR outputGM)
        {
            srcGlobal.SetGlobalBuffer((__gm__ half *)(inputGM) + BLOCK_LENGTH * AscendC::GetBlockIdx(), BLOCK_LENGTH);
            dstGlobal.SetGlobalBuffer((__gm__ half *)(outputGM) + BLOCK_LENGTH * AscendC::GetBlockIdx(), BLOCK_LENGTH);
            pipe.InitBuffer(inQueue, BUFFER_NUM, BLOCK_GROUP_NUM * BLOCKLEN_CEIL * sizeof(half));
            pipe.InitBuffer(outQueue, BUFFER_NUM, BLOCK_GROUP_NUM * BLOCKLEN_CEIL * sizeof(half));
            pipe.InitBuffer(zeroQueue, BUFFER_NUM, 32);
        }
        __aicore__ inline void Process(CopyInTilingData &tilingData)
        {
            const int32_t loopCount = TILE_NUM * BUFFER_NUM;
            for (int32_t i = 0; i < loopCount; i++)
            {
                CopyIn(i);
                Compute(i, tilingData);
                CopyOut(i);
            }
        }
    
    private:
        __aicore__ inline void CopyIn(int32_t progress)
        {
            AscendC::LocalTensor<half> inputLocal = inQueue.AllocTensor<half>();
            for (int32_t i = 0; i < BLOCK_GROUP_NUM; i++)
            {
                const uint32_t srcGmIdx = progress * TILE_LENGTH + BLOCK_ELEMENT_NUM * i;
                AscendC::DataCopy(inputLocal[BLOCKLEN_CEIL * i], srcGlobal[srcGmIdx], BLOCKLEN_CEIL);
            }
            inQueue.EnQue(inputLocal);
        }
        __aicore__ inline void Compute(int32_t progress, CopyInTilingData &tilingData)
        {
            AscendC::LocalTensor<half> inputLocal = inQueue.DeQue<half>();
            AscendC::LocalTensor<half> outputLocal = outQueue.AllocTensor<half>();
            AscendC::PadParams padParams;
            padParams.leftPad = 0;
            // change the last 2 elements of each row to 0
            padParams.rightPad = BLOCKLEN_CEIL - BLOCK_ELEMENT_NUM;
            padParams.padValue = 0;
            AscendC::Pad<half>(outputLocal, inputLocal, padParams, tilingData.padtiling);
            AscendC::LocalTensor<half> zeroTensor = zeroQueue.AllocTensor<half>();
            outQueue.EnQue<half>(outputLocal);
            zeroQueue.EnQue<half>(zeroTensor);
            inQueue.FreeTensor(inputLocal);
        }
        __aicore__ inline void CopyOut(int32_t progress)
        {
            AscendC::LocalTensor<half> zeroTensor = zeroQueue.DeQue<half>();
            // setting zero_tensor to before copying to dstGM
            event_t eventIDVToMTE3 = static_cast<event_t>(GetTPipePtr()->FetchEventID(AscendC::HardEvent::V_MTE3));
            constexpr uint32_t zeroLen = 32 / sizeof(half);
            AscendC::Duplicate<half>(zeroTensor, 0, zeroLen); // set all 0 tensor
            AscendC::SetFlag<AscendC::HardEvent::V_MTE3>(eventIDVToMTE3);
            AscendC::WaitFlag<AscendC::HardEvent::V_MTE3>(eventIDVToMTE3);
            // // clear dstGM before doing calculations
            event_t eventIDMTE3ToV = static_cast<event_t>(GetTPipePtr()->FetchEventID(AscendC::HardEvent::MTE3_V));
            constexpr uint32_t rowNum = TILE_LENGTH / zeroLen;
            for (int32_t i = 0; i < rowNum; i++)
            {
                AscendC::DataCopy<half>(dstGlobal[progress * TILE_LENGTH + i * zeroLen], zeroTensor, zeroLen);
            }
            AscendC::SetFlag<AscendC::HardEvent::MTE3_V>(eventIDMTE3ToV);
            AscendC::WaitFlag<AscendC::HardEvent::MTE3_V>(eventIDMTE3ToV);
            AscendC::LocalTensor<half> outputLocal = outQueue.DeQue<half>();
            AscendC::Abs(outputLocal, outputLocal, BLOCK_GROUP_NUM * BLOCKLEN_CEIL); // main calculation
            outQueue.EnQue<half>(outputLocal);
            outputLocal = outQueue.DeQue<half>();
            AscendC::SetAtomicAdd<half>();
            for (int32_t i = 0; i < BLOCK_GROUP_NUM; i++)
            {
                const uint32_t srcGmIdx = progress * TILE_LENGTH + i * BLOCK_ELEMENT_NUM;
                AscendC::DataCopy<half>(dstGlobal[srcGmIdx], outputLocal[i * BLOCK_GROUP_NUM], BLOCKLEN_CEIL);
            }
            AscendC::SetAtomicNone();
            outQueue.FreeTensor(outputLocal);
            zeroQueue.FreeTensor(zeroTensor);
        }
    
    private:
        AscendC::GlobalTensor<half> srcGlobal;
        AscendC::GlobalTensor<half> dstGlobal;
        AscendC::TPipe pipe;
        AscendC::TQue<AscendC::QuePosition::VECIN, BUFFER_NUM> inQueue;
        AscendC::TQue<AscendC::QuePosition::VECOUT, BUFFER_NUM> outQueue;
        AscendC::TQue<AscendC::QuePosition::VECOUT, BUFFER_NUM> zeroQueue;
    };
    extern "C" __global__ __aicore__ void datacopypad_custom(GM_ADDR inputGM, GM_ADDR outputGM, CopyInTilingData tilingData)
    {
        KernelDataCopyPad op;
        op.Init(inputGM, outputGM);
        op.Process(tilingData);
    }
    tiling.h中注册结构体,如下所示。
    #include "kernel_tiling/kernel_tiling.h"
    struct CopyInTilingData {
        PadTiling padtiling;
    };
    int32_t GenerateTiling(const std::vector<int64_t>& shapePad, const std::vector<int64_t>& shapeUsed, uint32_t& coreNum,
        CopyInTilingData& structTilingPad);
    tiling.cpp文件中的函数GenerateTiling根据输入shape、剩余的可供计算的空间大小等信息获取Pad kernel侧接口所需tiling参数。
    #include "graph/tensor.h"
    #include "tiling/tiling_api.h"
    #include "tiling/platform/platform_ascendc.h"
    
    int32_t GenerateTiling(const std::vector<int64_t> &shapePad, const std::vector<int64_t> &shapeUsed, uint32_t &coreNum,
        CopyInTilingData &structTilingPad)
    {
        coreNum = 8;
        ge::Shape srcShape(shapePad);
        ge::Shape oriSrcShape(shapeUsed);
        uint32_t tmpMinSize, tmpMaxSize;
        GetPadMaxMinTmpSize(srcShape, sizeof(int16_t), tmpMaxSize, tmpMinSize); 
        optiling::PadTiling padtiling;
        PadTilingFunc(srcShape, oriSrcShape, tmpMaxSize, sizeof(int16_t), padtiling);
        padtiling.SaveToBuffer(&(structTilingPad.padtiling), sizeof(structTilingPad.padtiling));
        return 0;
    }

    main.cpp中的特殊处理inputByteSize,outputByteSize,如下所示。对应的kernel侧在核函数中调用GenerateTiling获取structTilingPad,继而传入Pad接口参与计算。

    const std::vector<int64_t> shapeUsed({16, 6}); //shape of valid data
    const std::vector<int64_t> shapePad({16, 16});  //original shape
    CopyInTilingData structTilingPad;
    uint32_t blockDim = 8;
    (void)GenerateTiling(shapePad, shapeUsed, blockDim, structTilingPad);
    //12298 is TOTAL_LENGTH + (BLOCKLEN_CEIL - BLOCK_ELEMENT_NUM)
    //copy in borrow the next (BLOCKLEN_CEIL - BLOCK_ELEMENT_NUM) elements of srcGM
    size_t inputByteSize = 12298 * sizeof(int16_t);
    //copy out atomic add extra (BLOCKLEN_CEIL - BLOCK_ELEMENT_NUM) zeros to dstGM
    size_t outputByteSize = 12298 * sizeof(int16_t);
    ...
    ICPU_RUN_KF(datacopypad_custom, blockDim, inputGM, outputGM, structTilingPad);