Using Shared Temporary Buffer for Operators and High-Level APIs
[Priority] High
[Description] When an operator uses a high-level API that requires a temporary buffer, a temporary buffer needs to be allocated to the high-level API. If this buffer is on the UB, the UB space for other computations of the operator is reduced. As a result, the amount of data transferred in a single computation decreases and the number of transfers increases. In this scenario, the temporary buffer space can be shared to increase the amount of data to be transferred at a time and reduce the number of transfer times, improving the memory usage.
[Negative Example]
...
TBuf<QuePosition::VECCALC> tmpSoftmaxBuf;
pipe.InitBuffer(tmpSoftmaxBuf, softmaxBufSize * sizeof(uint8_t)); // Allocate a separate 32 KB temporary buffer for Softmax.
TBuf<QuePosition::VECCALC> tmpSumBuf;
pipe.InitBuffer(tmpSumBuf, sumBufSize * sizeof(T)); // Allocate a separate temporary buffer for Add. softmaxBufSize * sizeof(uint8_t) + sumBufSize * sizeof(T) <= 64KB.
...
for (int i = 0, i < 16; i++) {
...
LocalTensor<uint8_t> tmpSoftmaxTensor = tmpSoftmaxBuf.Get<uint8_t>(softmaxBufSize);
SoftMax<T, true, true>(dstTensor, expSumTensor, dstMaxTensor, srcTensor, tmpSoftmaxTensor, tiling);
...
DataCopy(src0Tensor, src0Gm, Params);
...
LocalTensor<T> tmpSumTensor = tmpSumBuf.Get<T>(sumBufSize);
Add<T>(tmpSumTensor, src0Tensor, src1Tensor, count);
...
}
...
[Positive Example]
The computation of the SoftMax high-level API requires a temporary buffer, which can be shared by operators for other computations. According to the preceding assumption, only 8 (512/64) transfers are required.
...
TBuf<QuePosition::VECCALC> tmpSharedBuf;
pipe.InitBuffer(tmpSharedBuf, bufferSize); // Share a buffer. bufferSize = MAX(softmaxBufSize * sizeof(uint8_t), sumBufSize * sizeof(T)) <= 64KB
...
for (int i = 0, i < 8; i++) {
...
LocalTensor<uint8_t> tmpSharedTensor = tmpSharedBuf.Get<uint8_t>(softmaxBufSize);
SoftMax<T, true, true>(dstTensor, expSumTensor, dstMaxTensor, srcTensor, tmpSharedTensor, tiling);
...
DataCopy(src0Tensor, src0Gm, Params);
...
LocalTensor<T> tmpSumTensor = tmpSharedBuf.Get<T>(sumBufSize);
Add<T>(tmpSumTensor, src0Tensor, src1Tensor, count);
...
}
...