Using Shared Temporary Buffer for Operators and High-Level APIs

[Priority] High

[Description] When an operator uses a high-level API that requires a temporary buffer, a temporary buffer needs to be allocated to the high-level API. If this buffer is on the UB, the UB space for other computations of the operator is reduced. As a result, the amount of data transferred in a single computation decreases and the number of transfers increases. In this scenario, the temporary buffer space can be shared to increase the amount of data to be transferred at a time and reduce the number of transfer times, improving the memory usage.

[Negative Example]

The computation of the SoftMax high-level API requires a temporary buffer. The operator has a separate temporary buffer for performing other computations. The UB space is fixed. Assume that 64 KB temporary space can be allocated to SoftMax and Add, and the temporary buffer space tmpSoftmaxBuffer required for SoftMax computation occupies 32 KB. In this case, a maximum of 32 KB can be allocated to LocalTensor tmpSumBuffer for storing Add computation results. If the data amount computed by src0Tensor is 512 KB, then 16 (512/32) transfers are required.
...
TBuf<QuePosition::VECCALC> tmpSoftmaxBuf; 
pipe.InitBuffer(tmpSoftmaxBuf, softmaxBufSize * sizeof(uint8_t));  // Allocate a separate 32 KB temporary buffer for Softmax.
TBuf<QuePosition::VECCALC> tmpSumBuf;
pipe.InitBuffer(tmpSumBuf, sumBufSize * sizeof(T)); // Allocate a separate temporary buffer for Add. softmaxBufSize * sizeof(uint8_t) + sumBufSize * sizeof(T) <= 64KB.
...
for (int i = 0, i < 16; i++) {
    ...
    LocalTensor<uint8_t> tmpSoftmaxTensor = tmpSoftmaxBuf.Get<uint8_t>(softmaxBufSize);
    SoftMax<T, true, true>(dstTensor, expSumTensor, dstMaxTensor, srcTensor, tmpSoftmaxTensor, tiling);
    ...
    DataCopy(src0Tensor, src0Gm, Params);
    ...
    LocalTensor<T> tmpSumTensor = tmpSumBuf.Get<T>(sumBufSize);
    Add<T>(tmpSumTensor, src0Tensor, src1Tensor, count);
    ...
}
...

[Positive Example]

The computation of the SoftMax high-level API requires a temporary buffer, which can be shared by operators for other computations. According to the preceding assumption, only 8 (512/64) transfers are required.

...
TBuf<QuePosition::VECCALC> tmpSharedBuf;
pipe.InitBuffer(tmpSharedBuf, bufferSize); // Share a buffer. bufferSize = MAX(softmaxBufSize * sizeof(uint8_t), sumBufSize * sizeof(T)) <= 64KB
...
for (int i = 0, i < 8; i++) {
    ...
    LocalTensor<uint8_t> tmpSharedTensor = tmpSharedBuf.Get<uint8_t>(softmaxBufSize);
    SoftMax<T, true, true>(dstTensor, expSumTensor, dstMaxTensor, srcTensor, tmpSharedTensor, tiling);
    ...
    DataCopy(src0Tensor, src0Gm, Params);
    ...
    LocalTensor<T> tmpSumTensor = tmpSharedBuf.Get<T>(sumBufSize);
    Add<T>(tmpSumTensor, src0Tensor, src1Tensor, count);
    ...
}
...