Running a Graph Asynchronously in the Single-Process and Multi-Device Mode

This section describes how to use a single process to manage multiple devices, that is, a single process can run on different devices concurrently.

Overview

This feature is not supported by the Atlas 200I/500 A2 inference products .

The following figure illustrates the API call sequence.

GEInitializeV2: initializes the system and allocates resources. (This API can be called before graph construction.)
aclInit: initializes acl APIs.
Session constructor: creates multiple Session objects and allocates session resources. Each session is passed with a different value of ge.session_device_id to run the model on different devices.
Create multiple threads, each of which passes a different session. The following uses one thread as an example to describe the process:
1. aclrtSetDevice: specifies the running device; aclrtCreateStream: creates a stream; aclrtMalloc: allocates the device memory.
2. AddGraph: adds a graph to the Session object.
3. CompileGraph: builds the graph.
4. LoadGraph: (asynchronous graph execution scenario) loads the graph model to the stream created in 4.a.
5. aclrtMemcpy: transfers data from the host to the device.
6. RunGraphWithStreamAsync: runs the graph asynchronously.
7. aclrtSynchronizeStream: waits for stream tasks to complete.
8. aclrtMemcpy: transfers data back from the device to the host.
9. aclrtFree: destroys memory allocations.
GEFinalizeV2: destroys system allocations; aclFinalize: destroys acl allocations.

Example

Include header files, including those of acl and C or C++ standard library.

        
             #include "ge_api_v2.h"
#include "acl.h"
#include "acl_rt.h"
#include "graph/ascend_string.h"
#include <thread>

Allocate system resources.

After a graph is defined, call GEInitializeV2 to initialize the system (or call it before defining a graph) and allocate system resources. The sample code is as follows:

        
             std::map<AscendString, AscendString>config = {{"ge.exec.deviceId", "0"},
                                              {"ge.graphRunMode", "1"}};
Status ret = ge::GEInitializeV2(config);

Set the GE initialization configuration by using config. Configure ge.exec.deviceId to specify the device where a GE instance runs, and ge.graphRunMode to specify the graph run mode (set to 0 for online inference and 1 for training). For more configurations, see Command-Line Options.

You are advised not to configure the dump information in GE options and the dump information configured when the acl initialization API is called at the same time. Otherwise, exceptions may occur. This rule applies to other parameters with the same function.

Initialize acl resources.

        
             std::string aclConfigPath = "xx/xx/xx";
aclError retInit = aclInit(aclConfigPath);
if (retInit != ACL_ERROR_NONE) {
    // ...
    // ...
    return FAILED;
}

Create multiple sessions.

To run a defined graph, create a Session object. options in the Session can be used to load configuration parameters. For details about the supported configuration parameters, see Command-Line Options.

         
              int thread_num = 8; // The number 8 is an example.
for (int i= 0; i < thread_num; ++i) {   // Create multiple Session objects. In options of each session, different values of ge.session_device_id are transferred.
    std::map<ge::AscendString, ge::AscendString> options = { // Construct the session configuration.
	{"ge.session_device_id",std::to_string(i).c_str()},
    };	
    ge::GeSession *session = new ge::GeSession(options); // Create a session and pass the configuration map.
    if (session == nullptr) { // Check whether the session is created successfully.
	std::cout << "create session failed!" << std::endl;
	ge::GEFinalizeV2();
	return FAILED;
    }
    sessions.push_back(session);
}

Create multiple threads, each of which passes different values of Session and ge.session_device_id to run the graph asynchronously.

        
             // Defines the container for storing all thread objects.
std::vector<std::thread> threads;
// Create multiple threads and save them to the container.
for (int i= 0; i < thread_num; i++) {
    std::thread worker_thread(exec_func, i); // Create a thread and execute exec_func. The exec_func thread function is shown in steps 5.a to 5.g.
    threads.emplace_back(std::move(worker_thread)); //Move worker_thread to the threads container through std::move.
}
// Wait until all threads are complete.
for (int i = 0; i < thread_num; i++) {
    threads.at(i).join();
}

The steps for executing exec_func asynchronously on a single thread are as follows:

Specify the running device, create a stream, and allocate memory.

          
               // Specify the compute device.
int32_t deviceId = 0;
retInit = aclrtSetDevice(deviceId);

// Create a stream.
aclrtStream stream = nullptr;
aclError aclRet = aclrtCreateStreamWithConfig(&stream, 0, ACL_STREAM_FAST_LAUNCH);

// Allocate the device memory.
void* devPtrB = NULL;
aclRet = aclrtMalloc(&devPtrB, data_size, ACL_MEM_MALLOC_HUGE_FIRST);

Add a graph object.

          
               uint32_t graph_id = 0;
ge::Graph graph;
sess_ = sessionList[index];
ge::Status ret = sess_ -> AddGraph(graph_id, graph, graph_options);
if(ret != SUCCESS) {
  // ...
  // ...
  // Destroy allocations.
  ge::GEFinalizeV2();
  delete session;
  return FAILED;
}

Set the run configuration by using options. For details, see the Session constructor. The graph execution result will be saved to the output_cov tensor.

(Optional) Build the graph.

If no CompileGraph API has been called, the LoadGraph API will automatically call CompileGraph to complete the build.

           
                uint32_t graph_id = 0;
ret = sess_ -> CompileGraph(graph_id);
if(ret != SUCCESS) {
  // ...
  // ...
  // Destroy allocations and the session.
  ge::GEFinalizeV2();
  delete session;
  return FAILED;
}

(Optional) Load the graph to the created stream.

If no LoadGraph API has been called, the RunGraphWithStreamAsync API will automatically call LoadGraph to complete the loading. options in LoadGraph can be used to load configuration parameters. For details about the supported configuration parameters, see Command-Line Options.

           
                std::map <AscendString, AscendString> options;
uint32_t graph_id = 0;
ret = sess_ -> LoadGraph(graph_id, options, stream);
if(ret != SUCCESS) {
  // ...
  // ...
  // Destroy allocations and the session.
  ge::GEFinalizeV2();
  delete session;
  return FAILED;
}

Transfer data.

          
               // Copy the memory and transfer data from the host to the device.
// hostPtrA indicates the pointer to the source memory address on the host. devPtrB indicates the pointer to the destination memory address on the device. size indicates the memory size.
aclrtMemcpy(devPtrB, size, hostPtrA, size, ACL_MEMCPY_HOST_TO_DEVICE);

Run the graph asynchronously and return the execution result.

          
               std::vector<gert::Tensor> input;
std::vector<gert::Tensor> output;
ret = sess_->RunGraphWithStreamAsync(graph_id, stream, input, output);

// Call aclrtSynchronizeStream to wait for the stream tasks to complete.
aclRet = aclrtSynchronizeStream(stream);
// Copy the memory and transfer the device data back to the host.
// devPtrA indicates the pointer to the source memory address on the device. hostPtrB indicates the pointer to the destination memory address on the host. size indicates the memory size.
aclrtMemcpy(hostPtrB, size, devPtrA, size, ACL_MEMCPY_DEVICE_TO_HOST);

Destroy memory allocations.

          
               // Destroy memory allocations.
ret = aclrtFree(devPtrB);

Destroy allocations.

        
             // Destroy allocations of each session.
for (auto session : sessions) {
    delete session;        
}
// Destroy graph allocations.
ret = ge::GEFinalizeV2();
// Deinitialize acl.
ret = aclFinalize();

Parent topic: Running a Graph Asynchronously