尾核&尾块

对于不同shape的输入进行数据切分时，可能会发生数据无法平均分配到多个核、同时每个核内的数据无法均分的情况。参考核间均分场景下的尾块处理与核间不均分场景下的尾核处理的处理方式，将两者结合起来考虑整核的尾块、尾核的尾块的处理方式。

Tiling实现

由于本场景中核间、核内的数据均无法均分，在核间不均分场景下的尾核处理定义的Tiling结构体的基础上增加两个成员变量：

formerLastTileLength：数据量多的核最后一个分块大小，即整核的尾块大小。

计算时，先按尾核Tiling中提到的分核策略，切分数据量多的核。

// shape需要对齐到的datablock,
uint32_t totalLengthAligned = ((totalLength + ALIGN_NUM - 1) / ALIGN_NUM) * ALIGN_NUM;
// 计算整核数量
uint32_t formerNum = (totalLengthAligned / ALIGN_NUM) % BLOCK_DIM;
// 计算整核的数据量
uint32_t formerLength = ((totalLengthAligned / BLOCK_DIM + ALIGN_NUM - 1) / ALIGN_NUM) * ALIGN_NUM;

再按尾块Tiling中的切分策略，计算尾块长度。

uint32_t formerTileNum = formerLength / ALIGN_NUM/ UB_BLOCK_NUM;
if ((formerLength / alignNum) % UB_BLOCK_NUM == 0 || formerTileNum == 0) {
    if (formerTileNum == 0) {
        formerTileNum = 1;
    }
    if (formerLength < UB_BLOCK_NUM * ALIGN_NUM) {
        formerTileLength = formerLength / ALIGN_NUM * ALIGN_NUM;
        formerLastTileLength = formerTileLength;
    } else {
        formerTileLength = UB_BLOCK_NUM * ALIGN_NUM;
        formerLastTileLength = formerTileLength;
    }
} else {
    formerTileNum = formerTileNum + 1;
    formerTileLength = UB_BLOCK_NUM * ALIGN_NUM;
    formerLastTileLength = (formerLength - (formerTileNum - 1) * formerTileLength);
}

tailLastTileLength：数据量少的核最后一个分块大小，即尾核的尾块大小。

计算时，先按尾核Tiling中提到的分核策略，切分数据量少的核。

// 计算尾核数量
uint32_t tailNum = BLOCK_DIM - formerNum;
// 计算尾核的数据量
uint32_t tailLength = (totalLengthAligned / BLOCK_DIM / ALIGN_NUM) * ALIGN_NUM;

再按尾块Tiling中的切分策略，计算尾块长度。

uint32_t tailTileNum = tailLength / ALIGN_NUM/ UB_BLOCK_NUM;
if ((tailLength / alignNum) % UB_BLOCK_NUM == 0 || tailTileNum == 0) {
    if (tailTileNum == 0) {
        tailTileNum = 1;
    }
    if (tailLength < UB_BLOCK_NUM * ALIGN_NUM) {
        tailTileLength = tailLength / ALIGN_NUM * ALIGN_NUM;
        tailLastTileLength = tailTileLength;
    } else {
        tailTileLength = UB_BLOCK_NUM * ALIGN_NUM;
        tailLastTileLength = tailTileLength;
    }
} else {
    tailTileNum = tailTileNum + 1;
    tailTileLength = UB_BLOCK_NUM * ALIGN_NUM;
    tailLastTileLength = (tailLength - (tailTileNum - 1) * tailTileLength);
}

算子类实现

Kernel侧Init函数和Process函数的实现需将核间均分场景下的尾块处理与核间不均分场景下的尾核处理的实现结合起来。

Init函数中由于整核和尾核对应的tileLength 和lastTileLength 不同。因此需按照核间不均分场景下的尾核处理中提到的分别处理整核和尾核。

Init函数实现代码如下：

__aicore__ inline void Init(GM_ADDR x, GM_ADDR y, GM_ADDR z, AddCustomTilingData tiling)
{
    if (AscendC::GetBlockIdx() < formerNum) {
        this->tileNum = tiling.formerTileNum;
        this->tileLength = tiling.formerTileLength;
        this->lastTileLength = tiling.formerLastTileLength;

        xGm.SetGlobalBuffer((__gm__ half *)x + tiling.formerLength * AscendC::GetBlockIdx(),
            tiling.formerLength);
        yGm.SetGlobalBuffer((__gm__ half *)y + tiling.formerLength * AscendC::GetBlockIdx(),
            tiling.formerLength);
        zGm.SetGlobalBuffer((__gm__ half *)z + tiling.formerLength * AscendC::GetBlockIdx(),
            tiling.formerLength);
    } else {
        this->tileNum = tiling.tailTileNum;
        this->tileLength = tiling.tailTileLength;
        this->lastTileLength = tiling.tailLastTileLength;

        xGm.SetGlobalBuffer((__gm__ half *)x + tiling.formerLength * tiling.formerNum +
            tiling.tailLength * (AscendC::GetBlockIdx() - tiling.formerNum), tiling.tailLength);
        yGm.SetGlobalBuffer((__gm__ half *)y + tiling.formerLength * tiling.formerNum +
            tiling.tailLength * (AscendC::GetBlockIdx() - tiling.formerNum), tiling.tailLength);
        zGm.SetGlobalBuffer((__gm__ half *)z + tiling.formerLength * tiling.formerNum +
            tiling.tailLength * (AscendC::GetBlockIdx() - tiling.formerNum), tiling.tailLength);
    }

    pipe.InitBuffer(inQueueX, 1, this->tileLength * sizeof(half));
    pipe.InitBuffer(inQueueY, 1, this->tileLength * sizeof(half));
    pipe.InitBuffer(outQueueZ, 1, this->tileLength * sizeof(half));
}

CopyIn函数、CopyOut函数的整块和尾块处理按照核间均分场景下的尾块处理方式，尾块场景单独处理。

CopyIn函数实现代码如下：

__aicore__ inline void CopyIn(int32_t progress)
{
    AscendC::LocalTensor<dataType> xLocal = inQueueX.AllocTensor<dataType>();
    AscendC::LocalTensor<dataType> yLocal = inQueueY.AllocTensor<dataType>();
    if (progress == (this->tileNum * BUFFER_NUM - 1)) {
        AscendC::DataCopy(xLocal, xGm[progress * this->tileLength],
            this->lastTileLength);
        AscendC::DataCopy(yLocal, yGm[progress * this->tileLength],
            this->lastTileLength);
    } else {
        AscendC::DataCopy(xLocal, xGm[progress * this->tileLength], this->tileLength);
        AscendC::DataCopy(yLocal, yGm[progress * this->tileLength], this->tileLength);
    }
    inQueueX.EnQue(xLocal);
    inQueueY.EnQue(yLocal);
}

CopyOut函数实现代码如下：

__aicore__ inline void CopyOut(int32_t progress)
{
    AscendC::LocalTensor<dataType> zLocal = outQueueZ.DeQue<dataType>();
    if (progress == (this->tileNum * BUFFER_NUM - 1)) {
        AscendC::DataCopy(zGm[progress * this->tileLength], zLocal,
            this->lastTileLength);
    } else {
        AscendC::DataCopy(zGm[progress * this->tileLength], zLocal, this->tileLength);
    }
    outQueueZ.FreeTensor(zLocal);
}

父主题： 多核&Tiling切分