Sample Code Analysis
This case adopts the modular design and is implemented based on the C++ language, graph engine (GE) APIs, and Ascend Computing Language (acl) APIs. To obtain the complete code, click the link. The core components are as follows:
- ModelInference::Builder: a builder that is used to configure model parameters, encapsulate the build process of the ModelInference object, and provide chain-type configuration APIs.
- ModelInference: a core class that provides key capabilities such as model initialization, resource management, and task scheduling.
- ModelInference::GraphWorker: a worker thread that executes asynchronous inference tasks.
- ModelInference::GraphTask: a task unit that encapsulates the complete life cycle of a single inference request (input/output and callback).
Figure 1 shows the Unified Modeling Language (UML) class diagram of each component.
API Call Sequence
The following figure shows the execution process and involved APIs of the sample code.

- Call the aclInit API, initialize acl, and call "aclrtSetDevice" to specify the target device.
- Build a ModelInference instance, and set feature switches, including enabling batch H2D transmission, configuring the AI Core control policy, and enabling multi-instance parallelism.
- Perform ModelInference initialization.
- Call the Session constructor to create a Session object, apply for Session resources, and configure the ge.aicoreNum parameter in options of the Session.
- Call GEInitialize to initialize the system.
- Call aclgrphParseTensorFlow for model parsing to obtain the graph.
- Call AddGraph to add a graph to the Session object.
- Call CompileGraph to complete the graph build.
- Call "aclrtGetDevice" to obtain the information about the device in use.
- Create multiple threads, each of which passes the same session, graph ID, and device ID.
- Submit the inference task to the worker thread. The following uses a worker thread as an example to describe the execution process.
- Call "aclrtSetDevice" to specify the device in use and call "aclrtCreateStream" to create a stream.
- Call LoadGraph (asynchronous graph execution scenario) to load the graph model to the stream created in the previous step. The listening task queue receives and executes tasks.
- Call "aclrtMalloc" to allocate the device memory, and call "aclrtMemcpyBatch" to transfer data from the host to the device in a batch. (If batch H2D transmission is enabled, use the aclrtMemcpyBatch API. If this function is disabled, use the aclrtMemcpy API.)
- Call ExecuteGraphWithStreamAsync to run the graph asynchronously.
- Call "aclrtSynchronizeStream" to block program running until all tasks in the specified stream are complete.
- Call aclrtMemcpyBatchy to transfer data from the device to the host in a batch.
- Call "aclrtFree" to free memory.
- Execute the customized callback function.
- Call GEFinalize to destroy system allocations, and call "aclFinalize" to destroy resources related to acl.
Example
- Include header files, including those of acl, C or C++ standard library, GE, and sample ModelInference.
1 2 3 4 5 6 7 8 9 10 11 12 13 14
#include <acl.h> #include <acl_rt.h> #include <sstream> #include <random> #include <unordered_map> #include <chrono> #include <atomic> #include <complex> #include <iostream> #include <vector> #include <map> #include "model_inference.h" #include <getopt.h> #include <string>
- Initialize acl resources and set the device.
1 2 3 4 5 6 7 8 9 10 11 12 13
// Initialize acl. aclError aerr = aclInit(nullptr); if (aerr != ACL_ERROR_NONE) { std::cerr << "Failed to init ACL, error=" << aerr << std::endl; return -1; } // Specify the compute device. aerr = aclrtSetDevice(0); if (aerr != ACL_ERROR_NONE) { std::cerr << "aclrtSetDevice failed, ret=" << aerr << std::endl; aclFinalize(); return -1; }
- Set inference parameters.
1 2 3 4
// Model file path. const std::string model_path = "../data/DCN_v2.pb"; // Model file type. const std::string model_type = "TensorFlow";
- Specify parameters for model parsing. The sample model has 27 inputs.
1 2 3 4 5 6 7 8 9 10 11 12 13 14
std::stringstream ss; // Define the number of input nodes. int input_size = 27; for (int i = 1; i < input_size; ++i) ss << "Input_" << i << ":" << batchSize << ";"; ss << "Input:" << batchSize << ",8"; // Build a map to configure the parsing parameters of the model. std::map<ge::AscendString, ge::AscendString> parser = { // Set the output node. {ge::AscendString(ge::ir_option::OUT_NODES), ge::AscendString("Identity:0")}, // Set the input shape. {ge::AscendString(ge::ir_option::INPUT_SHAPE), ge::AscendString(ss.str().c_str())} };
- Build and initialize a ModelInference instance.
1 2 3 4 5 6 7 8 9 10 11
// Create a ModelInference instance. auto model_inference = gerec::ModelInference::Builder(cfg.model_path, cfg.model_type) .InputBatchCopy(enableBatchH2D) // Enable batch H2D transmission. .AiCoreNum(aiCoreNum) // Configure the AI Core control policy. .MultiInstanceNum(multiInstanceNum) // Enable multi-instance parallelism. .GraphParserParams(cfg.parser_params) // Set graph parsing parameters. .Build(); if (model_inference->Init() != ge::SUCCESS) { std::cerr << "Init ModelInference failed" << std::endl; return ge::FAILED; }
- Submit the inference task.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
// Use the callback function for clearing and data collection after asynchronous inference is complete. auto callback = [&](std::shared_ptr<std::vector<gert::Tensor>> outputs, std::shared_ptr<std::vector<gert::Tensor>> inputs, bool status, long long exec_us) { if (status) { // If the inference is successful, update the number of success times and the total execution time. success_count.fetch_add(1, std::memory_order_relaxed); // Increase the number of success times. total_exec_us.fetch_add(exec_us, std::memory_order_relaxed); // Accumulate the execution time (in microseconds). } // Free the memory occupied by the output/input tensors. FreeHostTensors(outputs); FreeHostTensors(inputs); }; // Perform asynchronous inference for multiple times. for (int i = 0; i < num_runs; ++i) { if (model_inference->RunGraphAsync(all_inputs[i], all_outputs[i], callback) != ge::SUCCESS) { std::cerr << "RunGraphAsync failed at " << i << std::endl; return ge::FAILED; } }
The RunGraphAsync API adopts asynchronous execution and needs to be bound to a callback function to process the inference result. The callback function must meet the following signature rules:
using Callback = std::function<void( std::shared_ptr<std::vector<gert::Tensor>> outputs, // Output tensor list. std::shared_ptr<std::vector<gert::Tensor>> inputs, // Input tensor list. bool status, // Operation execution status. long long exec_us // Execution latency (in microseconds) )>; - Destroy allocations.
1 2
// Deinitialize acl. ret = aclFinalize();
Parent topic: Best Inference Practices for Recommendation Networks
