Using Shared Temporary Buffer for Operators and High-Level APIs

[Priority] High

[Description] If a high-level API used by an operator requires a transfer temporary buffer, such as SoftMax, the temporary buffer reduces the space for other computations of the operator. As a result, the amount of data transferred in a single computation decreases, and the number of transfers increases. In this scenario, the temporary buffer space can be shared to increase the amount of data to be transferred at a time and reduce the number of transfer times, improving the memory usage.

[Negative Example]

The computation of the high-level SoftMax API requires a temporary buffer. The operator has a separate temporary buffer for performing other computations. The UB space is fixed. Assume that 64 KB temporary space can be allocated to SoftMax and Add, and the temporary buffer space tmpSoftmaxBuffer required for SoftMax computation occupies 32 KB. In this case, a maximum of 32 KB can be allocated to LocalTensor tmpSumBuffer for storing Add computation results. If the data amount computed by src0Tensor is 512 KB, then 16 (512/32) transfers are required.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
...
constexpr int32_t blockLen = 32 * 1024;
TBuf<TPosition::VECCALC> tmpSoftmaxBuf; 
pipe.InitBuffer(tmpSoftmaxBuf, softmaxBufSize * sizeof(uint8_t));  // Allocate a separate 32 KB temporary buffer for Softmax.
TBuf<TPosition::VECCALC> tmpSumBuf;
pipe.InitBuffer(tmpSumBuf, sumBufSize * sizeof(T)); // Allocate a separate temporary buffer for Add. softmaxBufSize * sizeof(uint8_t) + sumBufSize * sizeof(T) <= 64KB.
...
for (int i = 0; i < 16; i++) {
    ...
    LocalTensor<uint8_t> tmpSoftmaxTensor = tmpSoftmaxBuf.Get<uint8_t>(softmaxBufSize);
    SoftMax<T, true, true>(dstTensor, expSumTensor, dstMaxTensor, srcTensor, tmpSoftmaxTensor, tiling);
    ...
    DataCopy(src0Tensor, src0Gm[i * blockLen / sizeof(T)], Params);
    ...
    LocalTensor<T> tmpSumTensor = tmpSumBuf.Get<T>(sumBufSize);
    Add<T>(tmpSumTensor, src0Tensor, src1Tensor, count);
    ...
}
...

[Positive Example]

The computation of the high-level SoftMax API requires a temporary buffer, which can be shared by operators for other computations. According to the preceding assumption, only 8 (512/64) transfers are required.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
...
constexpr int32_t blockLen = 64 * 1024;
TBuf<TPosition::VECCALC> tmpSharedBuf;
pipe.InitBuffer(tmpSharedBuf, bufferSize); // Share a buffer. bufferSize = MAX(softmaxBufSize * sizeof(uint8_t), sumBufSize * sizeof(T)) <= 64KB
...
for (int i = 0; i < 8; i++) {
    ...
    LocalTensor<uint8_t> tmpSharedTensor = tmpSharedBuf.Get<uint8_t>(softmaxBufSize);
    SoftMax<T, true, true>(dstTensor, expSumTensor, dstMaxTensor, srcTensor, tmpSharedTensor, tiling);
    ...
    DataCopy(src0Tensor, src0Gm[i * blockLen / sizeof(T)], Params);
    ...
    LocalTensor<T> tmpSumTensor = tmpSharedBuf.Get<T>(sumBufSize);
    Add<T>(tmpSumTensor, src0Tensor, src1Tensor, count);
    ...
}
...