Using Shared Temporary Buffer for Operators and High-Level APIs
[Priority] High
[Description] If a high-level API used by an operator requires a transfer temporary buffer, such as SoftMax, the temporary buffer reduces the space for other computations of the operator. As a result, the amount of data transferred in a single computation decreases, and the number of transfers increases. In this scenario, the temporary buffer space can be shared to increase the amount of data to be transferred at a time and reduce the number of transfer times, improving the memory usage.
[Negative Example]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | ... constexpr int32_t blockLen = 32 * 1024; TBuf<TPosition::VECCALC> tmpSoftmaxBuf; pipe.InitBuffer(tmpSoftmaxBuf, softmaxBufSize * sizeof(uint8_t)); // Allocate a separate 32 KB temporary buffer for Softmax. TBuf<TPosition::VECCALC> tmpSumBuf; pipe.InitBuffer(tmpSumBuf, sumBufSize * sizeof(T)); // Allocate a separate temporary buffer for Add. softmaxBufSize * sizeof(uint8_t) + sumBufSize * sizeof(T) <= 64KB. ... for (int i = 0; i < 16; i++) { ... LocalTensor<uint8_t> tmpSoftmaxTensor = tmpSoftmaxBuf.Get<uint8_t>(softmaxBufSize); SoftMax<T, true, true>(dstTensor, expSumTensor, dstMaxTensor, srcTensor, tmpSoftmaxTensor, tiling); ... DataCopy(src0Tensor, src0Gm[i * blockLen / sizeof(T)], Params); ... LocalTensor<T> tmpSumTensor = tmpSumBuf.Get<T>(sumBufSize); Add<T>(tmpSumTensor, src0Tensor, src1Tensor, count); ... } ... |
[Positive Example]
The computation of the high-level SoftMax API requires a temporary buffer, which can be shared by operators for other computations. According to the preceding assumption, only 8 (512/64) transfers are required.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | ... constexpr int32_t blockLen = 64 * 1024; TBuf<TPosition::VECCALC> tmpSharedBuf; pipe.InitBuffer(tmpSharedBuf, bufferSize); // Share a buffer. bufferSize = MAX(softmaxBufSize * sizeof(uint8_t), sumBufSize * sizeof(T)) <= 64KB ... for (int i = 0; i < 8; i++) { ... LocalTensor<uint8_t> tmpSharedTensor = tmpSharedBuf.Get<uint8_t>(softmaxBufSize); SoftMax<T, true, true>(dstTensor, expSumTensor, dstMaxTensor, srcTensor, tmpSharedTensor, tiling); ... DataCopy(src0Tensor, src0Gm[i * blockLen / sizeof(T)], Params); ... LocalTensor<T> tmpSumTensor = tmpSharedBuf.Get<T>(sumBufSize); Add<T>(tmpSumTensor, src0Tensor, src1Tensor, count); ... } ... |