Asynchronous Model Inference

This section describes how an asynchronous inference API can work with the callback function. You can deliver a callback task at a specified interval to obtain the asynchronous inference result of the previous period.

API Call Sequence

If synchronization in the asynchronous scenario is involved, ensure that your app contains the related code logic. For details about the API call sequence, see the following figure.

Figure 1 Synchronization flowchart in the callback scenario

The key APIs are described as follows:

  1. Create a callback function in advance to obtain and process the model inference or operator execution result.
  2. Create a thread and customize a thread function in advance. Once aclrtProcessReport is called in the thread function to set the timeout interval, aclrtLaunchCallback will deliver a callback task after a specified period of time.
  3. Call the aclrtSubscribeReport API to specify the thread for processing the callback function in the stream. This thread must be the same as that created in 2.
  4. To run asynchronous model execution, use aclmdlExecuteAsync.

    Also call aclrtSynchronizeStream to wait for the stream tasks to complete.

    You can obtain the asynchronous inference results of all images at once after the aclrtSynchronizeStream call. However, if the image data volume is large, the wait time could be long. In this case, you can deliver a callback task at a specified interval to obtain the asynchronous inference result of the previous period.

  5. Call aclrtLaunchCallback to deliver a callback task in the task queue of the stream. When the callback task is executed in the system, it is also executed in the thread subscribed to the stream (aclrtSubscribeReport). The callback function must be the same as that in 1.

    Each time aclrtLaunchCallback is called, the callback function is triggered.

  6. Call aclrtUnSubscribeReport to unsubscribe from a thread, so that the callback function in the stream is no longer handled by the specified thread.

Sample Code

You can view the complete code in Image Classification with ResNet-50 (Asynchronous Inference).

This section focuses on the code logic of asynchronous model inference. For details about how to initialize and deinitialize AscendCL, see Initializing AscendCL. For details about how to allocate and deallocate runtime resources, see Runtime Resource Allocation and Deallocation.

After APIs are called, you need to add exception handling branches and record error logs and info logs. The following is a code snippet of key steps only, which is not ready to be built or run.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
#include "acl/acl.h"
// ......
// 1. Initialize AscendCL.

// 2. Allocate runtime resources.

// Obtain the run mode of the AI software stack. The API calls for memory allocation and memory copy are different in different run modes.
aclrtRunMode runMode;
extern bool g_isDevice;
aclrtGetRunMode(&runMode);
g_isDevice = (runMode == ACL_DEVICE);

// 3. Allocate model inference resources.

// The two dots (..) indicate a path relative to the directory of the executable file.
// For example, if the executable file is stored in the out directory, the two dots (..) point to the parent directory of the out directory.
const char* omModelPath = "../model/resnet50.om";

// 3.1 Load a model.
// Obtain the weight memory size and workspace size required for model execution, and allocate memory as required.
aclmdlQuerySize(omModelPath, &modelMemSize_, &modelWeightSize_);
aclrtMalloc(&modelMemPtr_, modelMemSize_, ACL_MEM_MALLOC_HUGE_FIRST);
aclrtMalloc(&modelWeightPtr_, modelWeightSize_, ACL_MEM_MALLOC_HUGE_FIRST);

// Load the offline model file. After the model is successfully loaded, the model ID is returned.
aclmdlLoadFromFileWithMem(modelPath, &modelId_, modelMemPtr_,
        modelMemSize_, modelWeightPtr_, modelWeightSize_);

// 3.2 Obtain the model description based on the model ID.
modelDesc_ = aclmdlCreateDesc();
aclmdlGetDesc(modelDesc_, modelId_);

// 3.3 Customize the InitMemPool function for memory pool initialization, to store the input and output data for model inference.
// -----Key implementation of the InitMemPool function.-----
string testFile[] = {
        "../data/dog1_1024_683.bin",
        "../data/dog2_1024_683.bin"
    };
size_t fileNum = sizeof(testFile) / sizeof(testFile[0]);
// g_memoryPoolSize indicates the number of memory blocks in the memory pool. The default value is 100.
for (size_t i = 0; i < g_memoryPoolSize; ++i) {
        size_t index = i % (sizeof(testFile) / sizeof(testFile[0]));
        // model process
        uint32_t devBufferSize;
        // Customize the GetDeviceBufferOfFile function to implement the following functions:
        // Obtain the buffer for storing the input image data together with the buffer size, and transfer the image data to the device.
        void *picDevBuffer = Utils::GetDeviceBufferOfFile(testFile[index], devBufferSize);
        aclmdlDataset *input = nullptr;
// Customize the CreateInput function to create data input of type aclmdlDataset for storing the input data of model inference.
        Result ret = CreateInput(picDevBuffer, devBufferSize, input);
        aclmdlDataset *output = nullptr;
        // Customize the CreateOutput function to create data output of type aclmdlDataset for storing the output data of model inference. modelDesc indicates the model description.
        CreateOutput(output, modelDesc);
        {
            std::lock_guard<std::recursive_mutex> lk(freePoolMutex_);
            freeMemoryPool_[input] = output;
        }
}
// -----Key implementation of the InitMemPool function.-----

// 4. Model inference
// 4.1 Create a thread tid and subscribe to thread tid for handling the callback function in the stream.
// ProcessCallback is a thread function. Call aclrtProcessReport in the thread function, and the callback function is triggered after a specified period of time.
pthread_t tid;
(void)pthread_create(&tid, nullptr, ProcessCallback, &s_isExit);

// 4.2 Subscribe to a thread for handling the callback function in the stream.
aclrtSubscribeReport(tid, stream_);

// 4.2 Create a callback function for processing the model inference result. The callback function is user-defined.
void ModelProcess::CallBackFunc(void *arg)
{
    std::map<aclmdlDataset *, aclmdlDataset *> *dataMap =
        (std::map<aclmdlDataset *, aclmdlDataset *> *)arg;

    aclmdlDataset *input = nullptr;
    aclmdlDataset *output = nullptr;
    MemoryPool *memPool = MemoryPool::Instance();

    for (auto& data : *dataMap) {
        ModelProcess::OutputModelResult(data.second);
        memPool->FreeMemory(data.first, data.second);
    }

    delete dataMap;
}

// 4.3 Customize the ExecuteAsync function to perform model inference.
//-----Start of the key implementation of the user-defined ExecuteAsync function.-----
// g_callbackInterval indicates the callback interval. The default value is 1, indicating that a callback task is delivered every asynchronous inference.
bool isCallback = (g_callbackInterval != 0);
    size_t callbackCnt = 0;
    std::map<aclmdlDataset *, aclmdlDataset *> *dataMap = nullptr;
    aclmdlDataset *input = nullptr;
    aclmdlDataset *output = nullptr;
    MemoryPool *memPool = MemoryPool::Instance();
    // g_executeTimes indicates the number asynchronous model inferences to run. The default value is 100.
    for (uint32_t cnt = 0; cnt < g_executeTimes; ++cnt) {
        if (memPool->mallocMemory(input, output) != SUCCESS) {
            ERROR_LOG("get free memory failed");
            return FAILED;
        }
        // Perform asynchronous inference.
        aclError ret = aclmdlExecuteAsync(modelId_, input, output, stream_);

        if (isCallback) {
            if (dataMap == nullptr) {
                dataMap = new std::map<aclmdlDataset *, aclmdlDataset *>;
                if (dataMap == nullptr) {
                    ERROR_LOG("malloc list failed, modelId is %u", modelId_);
                    memPool->FreeMemory(input, output);
                    return FAILED;
                }
            }
            (*dataMap)[input] = output;
            callbackCnt++;
            if ((callbackCnt % g_callbackInterval) == 0) {
                // Add a callback function to be executed to the stream.
                ret = aclrtLaunchCallback(CallBackFunc, (void *)dataMap, ACL_CALLBACK_BLOCK, stream_);
                if (ret != ACL_SUCCESS) {
                    ERROR_LOG("launch callback failed, index=%zu", callbackCnt);
                    memPool->FreeMemory(input, output);
                    delete dataMap;
                    return FAILED;
                }
                dataMap = nullptr;
            }

        }
    }
//-----End of the key implementation of the user-defined ExecuteAsync function.-----

// 4.4 For asynchronous inference, block app execution until the stream has completed all preceding requested tasks.
aclrtSynchronizeStream(stream_);

// 4.5 Unsubscribe from a thread and specify a callback function in a stream to not be handled by a specified thread.
aclrtUnSubscribeReport(static_cast<uint64_t>(tid), stream_);
s_isExit = true;
(void)pthread_join(tid, nullptr);

// 5. Deallocate runtime resources.

// 6. Deinitialize AscendCL.

// ......