How to Use the Temporary Space on the Kernel

Generally, the internal implementation of the kernel API involves complex mathematical computation, requiring additional temporary space to store intermediate variables generated during computation. Except for cube computation, HCCL communication, and convolution computation, you can handle the temporary space in most high-level APIs by passing the temporary space allocated in advance through the input parameter sharedTmpBuffer of the kernel API or by allocating a temporary space through the API framework.
  • Pass the temporary space through the input parameter sharedTmpBuffer. The kernel API uses the passed tensor as the temporary space. This enables you to manage the sharedTmpBuffer space and reuse the buffer after calling the API, so that the buffer is not repeatedly allocated and deallocated, improving the flexibility and buffer utilization.
  • Allocate a temporary space through the API framework. You do not need to allocate the temporary space on the kernel, but you need to reserve the size of the temporary space. That is, when allocating the memory space, you need to subtract the size of the temporary space to be reserved from the available space.

Regardless of which method you use, you need to obtain the size (BufferSize) of the temporary space required by the kernel API in advance when allocating a tensor space or reserving a temporary space. Therefore, the GetxxxMaxMinTmpSize (xxx indicates the corresponding kernel API) API is provided in the corresponding API category to obtain the size range of the space to be reserved. You can call the GetxxxMaxMinTmpSize API on the host to obtain the maximum and minimum sizes of the reserved or allocated temporary space. Based on this range, select a proper space size as the tiling parameter and pass it to the kernel.

  • To ensure correct functions, the temporary space to be reserved or allocated cannot be less than the minimum temporary space.
  • Within the range from the minimum to the maximum, as the temporary space increases, the API compute performance in the kernel can be optimized to some extent. For better performance, reserve or allocate the temporary space based on the actual memory usage.

Take the Asin API as an example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
// The data type T of operator inputs is half. For isReuseSource, pass the default value false.
auto shape_input = context->GetInputTensor(0)->GetOriginShape();    
std::vector<int64_t> srcDims = {shape_input.GetDim(0), shape_input.GetDim(1)};
uint32_t srcSize = 1;
for (auto dim : srcDims) {
    srcSize *= dim;
}
uint32_t typeSize = 2;
ge::Shape shape(srcDims);
uint32_t minValue = 0;
uint32_t maxValue = 0;
AscendC::GetAsinMaxMinTmpSize(shape, typeSize, false, maxValue, minValue);

auto platformInfo = context->GetPlatformInfo();
auto ascendcPlatform = platform_ascendc::PlatformAscendC(platformInfo);
uint64_t tailSize = 0; // remaining space of UB
ascendcPlatform.GetCoreMemSize(platform_ascendc::CoreMemType::UB, tailSize); // In this example, full UB space is used. In the actual situation, the already used UB space must be subtracted from tailSize.
auto tmpSize = tailSize >= maxValue ? maxValue : tailSize;

AsinCustomTilingData tiling;
tiling.set_tmpBufferSize(tmpSize); // Set the temporary space size to the tiling parameter.

In addition, most high-level APIs provide the GetxxxTmpBufferFactorSize API, which is used to obtain maxLiveNodeCnt and extraBuf. maxLiveNodeCnt indicates how many times the temporary space is the space occupied by the data volume computed at a time. extraBuf indicates the size of the temporary space required by the kernel API. When the space size is fixed, maxLiveNodeCnt and extraBuf are used to compute the maximum number of elements that can be computed by an operator at a time.

Example:

  • The Mean API needs to be called for operator implementation. You need to reserve space (total available space) of the currBuff size and use the GetMeanTmpBufferFactorSize API to obtain the output values of maxLiveNodeCnt and extraBuf and compute the maximum number of elements in a single computation as follows:

    currentShapeSize = (currBuff – extraBuf)/maxLiveNodeCnt/typeSize

  • Two kernel APIs KernelIntf1 and KernelIntf2 need to be called for operator implementation. Two output values (maxLiveNodeCnt and extraBuf) of two GetXxxTmpBufferFactorSize (Xxx indicates the two high-level APIs to be called) APIs and the existing temporary space (currBuff) are used to compute the maximum number of elements in a single computation (currentShapeSize).

    currentShapeSize1 = (currBuff - extraBuf1) / maxLiveNodeCnt1 / typeSize

    currentShapeSize2 = (currBuff - extraBuf2) / maxLiveNodeCnt2 / typeSize

    currentShapeSize = min(currentShapeSize1 , currentShapeSize2)

Note that currBuff indicates the available space for API computation. The space such as user input and output needs to be excluded.

The following example assumes that both the Asin and Acos APIs need to be called in an operator:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
// The input data type T of the operator is half.
auto shape_input = context->GetInputTensor(0)->GetOriginShape();
std::vector<int64_t> srcDims = { shape_input.GetDim(0), shape_input.GetDim(1) };
uint32_t srcSize = 1;
uint32_t srcCurSize = 1;
for (auto dim : srcDims) {
    srcSize *= dim;
}
uint32_t typeSize = 2;

auto platformInfo = context->GetPlatformInfo();
auto ascendcPlatform = platform_ascendc::PlatformAscendC(platformInfo);
uint64_t tailSize = 0; // remaining space of UB
ascendcPlatform.GetCoreMemSize(platform_ascendc::CoreMemType::UB, tailSize);

uint32_t asinMaxLiveNodeCount = 0;
uint32_t asinExtraBuf = 0;

uint32_t acosMaxLiveNodeCount = 0;
uint32_t acosExtraBuf = 0;

AscendC::GetAsinTmpBufferFactorSize(typeSize, asinMaxLiveNodeCount, asinExtraBuf);
AscendC::GetAcosTmpBufferFactorSize(typeSize, acosMaxLiveNodeCount, acosExtraBuf);
// The size of tmp must be subtracted from the size of the input and output occupied by the API called on UB.
// This example includes the input and output of the Asin and Acos APIs. The output of Asin is used as the input of Acos. Therefore, three src spaces are required.
auto tmpSize = tailSize - srcSize * typeSize * 3;
assert(tmpSize >= asinExtraBuf);
assert(tmpSize >= acosExtraBuf);
// Calculate the maximum number of elements that can be computed by the Asin operator at a time.
if (asinMaxLiveNodeCount != 0) {
    srcAsinCurSize = (tmpSize - asinExtraBuf) / asinMaxLiveNodeCount / typeSize;
} else {
    srcAsinCurSize = srcSize;
}
// Calculate the maximum number of elements that can be computed by the Acos operator at a time.
if (acosMaxLiveNodeCount != 0) {
    srcAcosCurSize = (tmpSize - acosExtraBuf) / acosMaxLiveNodeCount / typeSize; 
} else {
    srcAcosCurSize = srcSize;
}
srcCurSize = std::min(srcAsinCurSize, srcAcosCurSize);

AsinCustomTilingData tiling;
tiling.set_srcCurSize(srcCurSize); // Set the maximum number of elements that can be computed at a time to the tiling parameter.