Custom Concurrency Using CustomStreamPassFunc

Overview

Due to factors such as the operator execution time, core usage, and bandwidth, the concurrency effect of an algorithm varies on different hardware and cannot be accurately estimated in the build phase. Currently, there is no perfect algorithm that can achieve the optimal concurrency on any network or hardware.

To resolve the preceding problems, this section introduces a feature of CustomStreamPassFunc, which is a custom logical stream allocation pass function. It enables you to analyze the specific hardware utilization of a network based on the graph topology and profiling, thus flexibly adjusting the concurrency effect to achieve the optimal performance with one policy for each network and card. This feature can be used in two scenarios:

  • Fine-tune the built-in logical stream allocation result.
  • Develop a proprietary logical stream allocation algorithm based on the graph structure.

Related concepts:

  • Logical stream: corresponds to the physical stream. It refers to the stream allocated in advance on the graph based on certain conditions (such as topology sequence and engine ownership). Tasks on the logical stream are executed in sequence. The streams described in this document are logical streams.
  • Stream allocation: specifies several tasks that can be executed concurrently, improving hardware utilization and reducing model execution time.

This section uses the preceding scenario 1 as an example to describe how to customize concurrency using the custom logical stream allocation pass. The overall process is as follows:

You can call the REGISTER_CUSTOM_PASS registration macro to register a pass based on the specified pass name. By compiling a graph modification function into a dynamic library plugin, the registered pass is called after the logical stream allocation phase. You can fine-tune the built-in stream allocation result. The sample code is as follows:
1
2
3
4
5
6
7
8
#include "register_custom_pass.h"
// User-defined logic stream allocation function
Status CustomStreamPassFunc(const ConstGraphPtr &graph, StreamPassContext &stream_context) {
    // Define the logical stream allocation behavior.
    return GRAPH_SUCCESS;
}
// Register the pass. You do not need to specify the stage. By default, the pass is executed after the logical stream allocation phase.
REGISTER_CUSTOM_PASS("pass_name").CustomAllocateStreamPassFn(CustomStreamPassFunc);
  • register_custom_pass.h: a header file stored in the /cann/include/register/ directory of the CANN installation directory. If this header file is included, you can use related classes and APIs for pass registration.
  • Status: operation status. If the operation is successful, ge::GRAPH_SUCCESS is returned. If the operation fails, other values are returned. You are advised to use a value less than 0 as the returned error code. A value greater than 0 may conflict with the error code used by the framework.
  • CustomStreamPassFunc: execution function of the custom pass. For details, see Callback Function CustomAllocateStreamPassFunc.
  • graph: graph to which the logical stream is to be allocated. The type is ConstGraphPtr.
  • stream_context: StreamPassContext object. For details, see the methods provided in StreamPassContext.
  • REGISTER_CUSTOM_PASS: a macro used to register a custom pass. pass_name can be set to any name. For details, see REGISTER_CUSTOM_PASS.
  • CustomAllocateStreamPassFn: an object used to register a function for executing the custom logical stream allocation pass function. For details, see CustomAllocateStreamPassFn.

When the forcible single-stream function is enabled (ge.enableSingleStream in Command-Line Options is set to true), the custom logical stream allocation pass function is not executed.

Example

  • Prerequisites

    You have analyzed the optimization points based on the structure shown in the following figure and profiling.

    Figure 1 Original graph

    According to the profiling result, operators 1 and 2 with concurrency conditions (no data dependency or control dependency) are executed in serial mode (after logical stream allocation, operators 1 and 2 are allocated with the same stream ID). You can use the function described in this section to change the execution to parallel mode.

    If operators with concurrency conditions use the same compute resources, resource preemption and waiting may occur and concurrency benefits may not be obtained. In this case, you need to perform profiling as required. The following uses this example to describe how to adjust the execution mode.

  • Development procedure
    1. Include the header file.
      1
      2
      3
      #include <iostream>
      //Include the header file of the custom pass API.
      #include "register_custom_pass.h"
      
    2. Develop a custom pass to allocate a new stream ID to node 1. (The following code is only an example and cannot be executed.)
      Note: After obtaining the graph to which the logical stream is to be allocated, you can only modify the nodes in the graph. Specifically, you can only use the GetDirectNode API to obtain all nodes in the graph.
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      #include <iostream>
      #include "register_custom_pass.h"
      //Customize the stream allocation function by using the pass approach.
      graphStatus AllocateStreamPass(const ConstGraphPtr &graph, StreamPassContext &context) {
          // Traverse directly-connected child nodes in the graph (that is, non-recursive subgraphs).
          for (const auto &node : graph->GetDirectNode()) {
              AscendString node_name;
              node.GetName(node_name); // Obtain the node name.
              //Check whether the node name is Abs_1.
              if (std::string(node_name.GetString()) == "Abs_1/unary_ops_composition_0_SquareAbs_1/unary_ops_composition_1_Abs") {
                  //Allocate a new stream ID to the node.
                  context.SetStreamId(node, context.AllocateNextStreamId());
              }
          }
          return SUCCESS;
      }
      //Register a custom pass.
      REGISTER_CUSTOM_PASS("AllocateStreamPass").CustomAllocateStreamPassFn(AllocateStreamPass);
      

How to Use a Custom Pass

After a custom pass is obtained, this part describes how to build the custom logical stream allocation function into a dynamic library plugin so that the registered pass can be called after the logical stream allocation phase.

  • Prerequisites

    Install the CANN software package as instructed in CANN Software Installation Guide.

  • Program compilation
    1. Obtain the CMakeLists.txt script by referring to Sample Usage, and place the custom logical stream allocation function file AllocateStreamPass.cpp in the src directory according to the directory structure in the sample.
    2. Modify the following variables in the CMakeLists.txt file as needed:
      • ASCEND_PATH: path for storing files after the CANN software is installed, for example, /usr/local/Ascend/cann with the root installation user.
      • target_include_directories: header file to be included. In this example, no modification is required. If you have developed your own code and need to add a header file, append a line to the sample code. Do not delete the original lines. If the network contains a custom operator, include the header file of its prototype definition.
      • target_link_libraries: library to be linked. In this example, no modification is required. If you have developed your own code and need to add a library to be linked, append a line to the sample code. Do not delete the original lines.
    3. Run the build commands:
      mkdir build && cd build
      cmake .. && make

      After the build is complete, the dynamic library file libAllocateStreamPass.so is generated in the build directory.

    4. Copy libAllocateStreamPass.so to the ${ASCEND_PATH}/opp/vendors/xxx/custom_fusion_passes/ directory. xxx is a user-defined directory. (Soft links can be set. The .so file must be readable to executable users.)

      Multiple ${ASCEND_PATH}/opp/vendors/xxx directories are sorted in text order and then traversed to search for the custom_fusion_passes/ subdirectory. The .so files in a single subdirectory are loaded in text order, while the files whose names do not end with .so are skipped during loading.

      • xxx: There must be only one level of custom directory.
      • custom_fusion_passes: The directory cannot contain subdirectories.
  • Custom pass usage (model files can be built using any of the following entries)
    To check whether the custom pass takes effect, dump the graph before model build by setting the DUMP_GE_GRAPH environment variable and build the model from the following entries:

Result Verification

If a dynamic-shape model contains a part that can be offloaded, the framework splits the model into dynamic scheduling and offload scheduling (static subgraph). Offload scheduling may involve multiple small models. When you allocate custom streams for a dynamic-static hybrid model, there are multiple dump graphs before and after the custom pass. The first dump graph corresponds to the root graph, and the subsequent dump graphs correspond to the static models in the model. If you want to view the allocation result in the dump graph, you are advised to view the dump graph of the last custom stream allocation before the graph build.

After the dump environment variable is set, the graph files such as ge_onnx*.pbtxt are generated in the current path after the program is executed. You can obtain the following two graphs and use visualization software such as Netron to view the graphs.

  • ge_onnx_xxx _RunCustomPass_BeforeAssignLogicStream*.pbtxt: graph before the pass is executed. * indicates the pass name. For details about the graph structure, see Figure 1.
  • ge_onnx_xxx_RunCustomPass_AfterAssignLogicStream*.pbtxt: Pass: graph after the pass is executed. * indicates the pass name. The graph structure is as follows:

    As shown in the figure, a new stream ID is allocated to operator 1.

You can also view the final effect based on the graph compiled by the model and profiling. As shown in Figure 2, inter-stream synchronization operators such as Send and Recv are generated before and after operator 1 and the profiling result (Figure 3) is generated after operator execution. Therefore, operators 1 and 2 are concurrently executed on two streams. For details about the profiling operation, see Profile Data Collection.

Figure 2 Compiled graph
Figure 3 Profiling result