Working Principles of the GE

Graph Build

The GE provides two graph build methods:

Executing models (such as onnx/pb) in graph mode: You can use the ATC command-line tool or the C++ API Parser to map operators of the frontend framework to CANN operators one by one and parse framework model files (in formats such as *.onnx and *.pb) into computational graphs indicated by Ascend IR.
Figure 1 Parsing files into computational graphs using the Parser API

For details about the ATC command-line tool, see ATC Instructions.

For details about how to use the C++ API Parser to parse a model into a graph, see Using Parser APIs to Parse the Original Model into a Graph.

Using graph development APIs to build a new graph: You can use graph build APIs to combine computational functions (operators) to build a computational graph indicated by Ascend IR. The following figure shows the basic process of building a graph. For more information, see Using Graph Development APIs to Construct a New Graph.
Figure 2 Building a computational graph

Graph Compilation and Optimization

For the computational graph indicated by Ascend IR, the GE performs a series of compilation and optimization and generates an offline model in OM format to adapt to the running requirements of the underlying hardware. The main process is as follows:

Graph preparation: Based on information such as the input tensor descriptions, logic, and attributes of operators, the output tensor descriptions (including tensor shapes, data types, and formats) are inferred in advance. In this case, memory can be statically allocated to all tensors in the graph build preparation phase of the operators, avoiding overheads caused by dynamic memory allocation. This process is called inferShapeAndType and inferFormat. In addition, algorithm-level optimizations that are irrelevant to hardware are performed, including but not limited to constant folding and redundant branch elimination.
Graph partitioning: Operators are classified based on the execution engine (such as AI Core or AI CPU) and partitioned into different subgraphs to facilitate subsequent optimization on different hardware.
Graph optimization: Graph optimization methods such as operator fusion are used to improve graph execution performance. Hardware-irrelevant optimization can be performed. For example, multiple operators can be fused into one or more operators to save the computing time. Hardware-related optimization can also be performed. For example, UB fusion can be used to shorten the data transfer time in the hardware memory, thereby improving execution efficiency.
Graph compilation: Running resources, including memory and stream resources, are allocated based on the computational graph, and an .om offline model is compiled and generated.

Graph Loading and Execution

The offline model file generated after compilation is loaded, running resources are allocated, and streams and tasks are delivered to the device for execution. The process is as follows:

Graph loading: The offline model is parsed, memory resources are allocated, and stream running resources are created.
Graph execution: Input data is copied, streams and tasks are delivered to the device, and the corresponding operators are executed by the AI Core or AI CPU. After completing the computing, the device returns the result to the host user program.

Parent topic: Concepts and Principles