Using Shared Temporary Buffer for Operators and High-Level APIs

[Priority] High

[Description] If a high-level API used by an operator requires a transfer temporary buffer, such as SoftMax, the temporary buffer reduces the space for other computes of the operator. As a result, the amount of data moved in a single compute operation decreases, and the number of movements increases. In this scenario, the temporary buffer space can be shared to increase the amount of data to be moved at a time and reduce the number of movement times, improving the memory usage.

[Negative Example]

The compute of the SoftMax high-level API requires a temporary buffer. The operator has a separate temporary buffer for performing other compute operations. The UB space is fixed. Assume that 64 KB temporary space can be allocated to SoftMax and Add, and the temporary buffer space tmpSoftmaxBuffer required for SoftMax compute occupies 32 KB. In this case, a maximum of 32 KB can be allocated to LocalTensor tmpSumBuffer for storing Add compute results. If the data amount computed by src0Tensor is 512 KB, then 16 (512/32) movements are required.

...
constexpr int32_t blockLen = 32 * 1024;
TBuf<TPosition::VECCALC> tmpSoftmaxBuf; 
pipe.InitBuffer(tmpSoftmaxBuf, softmaxBufSize * sizeof(uint8_t));  // Allocate a separate 32 KB temporary buffer for Softmax.
TBuf<TPosition::VECCALC> tmpSumBuf;
pipe.InitBuffer(tmpSumBuf, sumBufSize * sizeof(T)); // Allocate a separate temporary buffer for Add. softmaxBufSize * sizeof(uint8_t) + sumBufSize * sizeof(T) <= 64KB.
...
for (int i = 0; i < 16; i++) {
    ...
    LocalTensor<uint8_t> tmpSoftmaxTensor = tmpSoftmaxBuf.Get<uint8_t>(softmaxBufSize);
    SoftMax<T, true, true>(dstTensor, expSumTensor, dstMaxTensor, srcTensor, tmpSoftmaxTensor, tiling);
    ...
    DataCopy(src0Tensor, src0Gm[i * blockLen / sizeof(T)], Params);
    ...
    LocalTensor<T> tmpSumTensor = tmpSumBuf.Get<T>(sumBufSize);
    Add<T>(tmpSumTensor, src0Tensor, src1Tensor, count);
    ...
}
...

[Positive Example]

The compute of the high-level SoftMax API requires a temporary buffer, which can be shared by operators for other compute operations. According to the preceding assumption, only 8 (512/64) movements are required.

...
constexpr int32_t blockLen = 64 * 1024;
TBuf<TPosition::VECCALC> tmpSharedBuf;
pipe.InitBuffer(tmpSharedBuf, bufferSize); // Share a buffer. bufferSize = MAX(softmaxBufSize * sizeof(uint8_t), sumBufSize * sizeof(T)) <= 64KB
...
for (int i = 0; i < 8; i++) {
    ...
    LocalTensor<uint8_t> tmpSharedTensor = tmpSharedBuf.Get<uint8_t>(softmaxBufSize);
    SoftMax<T, true, true>(dstTensor, expSumTensor, dstMaxTensor, srcTensor, tmpSharedTensor, tiling);
    ...
    DataCopy(src0Tensor, src0Gm[i * blockLen / sizeof(T)], Params);
    ...
    LocalTensor<T> tmpSumTensor = tmpSharedBuf.Get<T>(sumBufSize);
    Add<T>(tmpSumTensor, src0Tensor, src1Tensor, count);
    ...
}
...

Parent topic: Memory Access