尾核&尾块

对于不同shape的输入进行数据切分时,可能会发生数据无法平均分配到多个核、同时每个核内的数据无法均分的情况。参考核间均分场景下的尾块处理核间不均分场景下的尾核处理的处理方式,将两者结合起来考虑整核的尾块、尾核的尾块的处理方式。

Tiling实现

由于本场景中核间、核内的数据均无法均分,在核间不均分场景下的尾核处理定义的Tiling结构体的基础上增加两个成员变量:

算子类实现

Kernel侧Init函数和Process函数的实现需将核间均分场景下的尾块处理核间不均分场景下的尾核处理的实现结合起来。

Init函数中由于整核和尾核对应的tileLength和lastTileLength不同。因此需按照核间不均分场景下的尾核处理中提到的分别处理整核和尾核。

Init函数实现代码如下:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
__aicore__ inline void Init(GM_ADDR x, GM_ADDR y, GM_ADDR z, AddCustomTilingData tiling)
{
    if (AscendC::GetBlockIdx() < formerNum) {
        this->tileNum = tiling.formerTileNum;
        this->tileLength = tiling.formerTileLength;
        this->lastTileLength = tiling.formerLastTileLength;

        xGm.SetGlobalBuffer((__gm__ half *)x + tiling.formerLength * AscendC::GetBlockIdx(),
            tiling.formerLength);
        yGm.SetGlobalBuffer((__gm__ half *)y + tiling.formerLength * AscendC::GetBlockIdx(),
            tiling.formerLength);
        zGm.SetGlobalBuffer((__gm__ half *)z + tiling.formerLength * AscendC::GetBlockIdx(),
            tiling.formerLength);
    } else {
        this->tileNum = tiling.tailTileNum;
        this->tileLength = tiling.tailTileLength;
        this->lastTileLength = tiling.tailLastTileLength;

        xGm.SetGlobalBuffer((__gm__ half *)x + tiling.formerLength * tiling.formerNum +
            tiling.tailLength * (AscendC::GetBlockIdx() - tiling.formerNum), tiling.tailLength);
        yGm.SetGlobalBuffer((__gm__ half *)y + tiling.formerLength * tiling.formerNum +
            tiling.tailLength * (AscendC::GetBlockIdx() - tiling.formerNum), tiling.tailLength);
        zGm.SetGlobalBuffer((__gm__ half *)z + tiling.formerLength * tiling.formerNum +
            tiling.tailLength * (AscendC::GetBlockIdx() - tiling.formerNum), tiling.tailLength);
    }

    pipe.InitBuffer(inQueueX, 1, this->tileLength * sizeof(half));
    pipe.InitBuffer(inQueueY, 1, this->tileLength * sizeof(half));
    pipe.InitBuffer(outQueueZ, 1, this->tileLength * sizeof(half));
}

CopyIn函数、CopyOut函数的整块和尾块处理按照核间均分场景下的尾块处理方式,尾块场景单独处理。

CopyIn函数实现代码如下:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
__aicore__ inline void CopyIn(int32_t progress)
{
    AscendC::LocalTensor<dataType> xLocal = inQueueX.AllocTensor<dataType>();
    AscendC::LocalTensor<dataType> yLocal = inQueueY.AllocTensor<dataType>();
    if (progress == (this->tileNum * BUFFER_NUM - 1)) {
        AscendC::DataCopy(xLocal, xGm[progress * this->tileLength],
            this->lastTileLength);
        AscendC::DataCopy(yLocal, yGm[progress * this->tileLength],
            this->lastTileLength);
    } else {
        AscendC::DataCopy(xLocal, xGm[progress * this->tileLength], this->tileLength);
        AscendC::DataCopy(yLocal, yGm[progress * this->tileLength], this->tileLength);
    }
    inQueueX.EnQue(xLocal);
    inQueueY.EnQue(yLocal);
}

CopyOut函数实现代码如下:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
__aicore__ inline void CopyOut(int32_t progress)
{
    AscendC::LocalTensor<dataType> zLocal = outQueueZ.DeQue<dataType>();
    if (progress == (this->tileNum * BUFFER_NUM - 1)) {
        AscendC::DataCopy(zGm[progress * this->tileLength], zLocal,
            this->lastTileLength);
    } else {
        AscendC::DataCopy(zGm[progress * this->tileLength], zLocal, this->tileLength);
    }
    outQueueZ.FreeTensor(zLocal);
}