Sample Code Analysis

This case adopts the modular design and is implemented based on the C++ language, graph engine (GE) APIs, and Ascend Computing Language (acl) APIs. To obtain the complete code, click the link. The core components are as follows:

ModelInference::Builder: a builder that is used to configure model parameters, encapsulate the build process of the ModelInference object, and provide chain-type configuration APIs.
ModelInference: a core class that provides key capabilities such as model initialization, resource management, and task scheduling.
ModelInference::GraphWorker: a worker thread that executes asynchronous inference tasks.
ModelInference::GraphTask: a task unit that encapsulates the complete life cycle of a single inference request (input/output and callback).

Figure 1 shows the Unified Modeling Language (UML) class diagram of each component.

Figure 1 Sample code structure

API Call Sequence

The following figure shows the execution process and involved APIs of the sample code.

Call the aclInit API, initialize acl, and call "aclrtSetDevice" to specify the target device.
Build a ModelInference instance, and set feature switches, including enabling batch H2D transmission, configuring the AI Core control policy, and enabling multi-instance parallelism.
Perform ModelInference initialization.
1. Call the Session constructor to create a Session object, apply for Session resources, and configure the ge.aicoreNum parameter in options of the Session.
2. Call GEInitialize to initialize the system.
3. Call aclgrphParseTensorFlow for model parsing to obtain the graph.
4. Call AddGraph to add a graph to the Session object.
5. Call CompileGraph to complete the graph build.
6. Call "aclrtGetDevice" to obtain the information about the device in use.
7. Create multiple threads, each of which passes the same session, graph ID, and device ID.
Submit the inference task to the worker thread. The following uses a worker thread as an example to describe the execution process.
1. Call "aclrtSetDevice" to specify the device in use and call "aclrtCreateStream" to create a stream.
2. Call LoadGraph (asynchronous graph execution scenario) to load the graph model to the stream created in the previous step. The listening task queue receives and executes tasks.
  1. Call "aclrtMalloc" to allocate the device memory, and call "aclrtMemcpyBatch" to transfer data from the host to the device in a batch. (If batch H2D transmission is enabled, use the aclrtMemcpyBatch API. If this function is disabled, use the aclrtMemcpy API.)
  2. Call ExecuteGraphWithStreamAsync to run the graph asynchronously.
  3. Call "aclrtSynchronizeStream" to block program running until all tasks in the specified stream are complete.
  4. Call aclrtMemcpyBatchy to transfer data from the device to the host in a batch.
  5. Call "aclrtFree" to free memory.
  6. Execute the customized callback function.
Call GEFinalize to destroy system allocations, and call "aclFinalize" to destroy resources related to acl.

Example

Include header files, including those of acl, C or C++ standard library, GE, and sample ModelInference.

        
             #include <acl.h>
#include <acl_rt.h>
#include <sstream>
#include <random>
#include <unordered_map>
#include <chrono>
#include <atomic>
#include <complex>
#include <iostream>
#include <vector>
#include <map>
#include "model_inference.h"
#include <getopt.h>
#include <string>

Initialize acl resources and set the device.

        
             // Initialize acl.
aclError aerr = aclInit(nullptr);
if (aerr != ACL_ERROR_NONE) {
  std::cerr << "Failed to init ACL, error=" << aerr << std::endl;
  return -1;
}
// Specify the compute device.
aerr = aclrtSetDevice(0);
if (aerr != ACL_ERROR_NONE) {
  std::cerr << "aclrtSetDevice failed, ret=" << aerr << std::endl;
  aclFinalize();
  return -1;
}

Set inference parameters.

        
             // Model file path.
const std::string model_path = "../data/DCN_v2.pb";
// Model file type.
const std::string model_type = "TensorFlow";

Specify parameters for model parsing. The sample model has 27 inputs.

        
             std::stringstream ss;
// Define the number of input nodes.
int input_size = 27;
for (int i = 1; i < input_size; ++i) ss << "Input_" << i << ":" << batchSize << ";";
ss << "Input:" << batchSize << ",8";
// Build a map to configure the parsing parameters of the model.
std::map<ge::AscendString, ge::AscendString> parser = {
    // Set the output node.
    {ge::AscendString(ge::ir_option::OUT_NODES), 
     ge::AscendString("Identity:0")},
    // Set the input shape.
    {ge::AscendString(ge::ir_option::INPUT_SHAPE), 
     ge::AscendString(ss.str().c_str())}
};

Build and initialize a ModelInference instance.

        
             // Create a ModelInference instance.
auto model_inference = gerec::ModelInference::Builder(cfg.model_path, cfg.model_type)
                       .InputBatchCopy(enableBatchH2D)        // Enable batch H2D transmission.
                       .AiCoreNum(aiCoreNum)                  // Configure the AI Core control policy.
                       .MultiInstanceNum(multiInstanceNum)    // Enable multi-instance parallelism.
                       .GraphParserParams(cfg.parser_params) // Set graph parsing parameters.
                       .Build();
if (model_inference->Init() != ge::SUCCESS) {
  std::cerr << "Init ModelInference failed" << std::endl;
  return ge::FAILED;
}

Submit the inference task.

        
         
           
           
             // Use the callback function for clearing and data collection after asynchronous inference is complete.
auto callback = [&](std::shared_ptr<std::vector<gert::Tensor>> outputs,
                    std::shared_ptr<std::vector<gert::Tensor>> inputs, bool status, long long exec_us) {
  if (status) {
    // If the inference is successful, update the number of success times and the total execution time.
    success_count.fetch_add(1, std::memory_order_relaxed);        // Increase the number of success times.
    total_exec_us.fetch_add(exec_us, std::memory_order_relaxed);  // Accumulate the execution time (in microseconds).
  }
  // Free the memory occupied by the output/input tensors.
  FreeHostTensors(outputs);
  FreeHostTensors(inputs);
};

// Perform asynchronous inference for multiple times.
for (int i = 0; i < num_runs; ++i) {
  if (model_inference->RunGraphAsync(all_inputs[i], all_outputs[i], callback) != ge::SUCCESS) {
    std::cerr << "RunGraphAsync failed at " << i << std::endl;
    return ge::FAILED;
  }
}

            

          

        
       

The RunGraphAsync API adopts asynchronous execution and needs to be bound to a callback function to process the inference result. The callback function must meet the following signature rules:

using Callback = std::function<void(
    std::shared_ptr<std::vector<gert::Tensor>> outputs,   // Output tensor list.
    std::shared_ptr<std::vector<gert::Tensor>> inputs,    // Input tensor list.
    bool status,                                          // Operation execution status.
    long long exec_us                                     // Execution latency (in microseconds)
    )>;

Destroy allocations.

        
             // Deinitialize acl.
ret = aclFinalize();

Parent topic: Best Inference Practices for Recommendation Networks