Introduction to AOE

This section describes AOE-related concepts, architecture, and tuning process.

In the Atlas 200I/500 A2 inference products scenario, only tuning in offline inference is supported.

AOE

Ascend Optimization Engine (AOE) is an automatic tuning tool that makes full use of limited hardware resources to meet the performance requirements of operators and the entire network.

It continuously iterates tiling policies through a closed-loop feedback mechanism of policy generation, compilation, and verification in the operating environment, and finally obtains the optimal one. This helps fully utilize hardware resources, improve network performance, and achieve the optimal effect.

Figure 1 Architecture

Application layer: tuning entry.
- AOE: AOE process. The AOE mentioned in the following sections refers to the AOE process, which is used for:
  - Tuning in offline inference scenarios. For details, see Tuning in Offline Inference Scenarios.
  - Tuning in the PyTorch training scenario. For details, see Offline Tuning in PyTorch-based Training Scenarios.
  - Tuning in the IR graph construction scenario. For details, see Tuning in IR Graph Construction Scenarios.
- TensorFlow Adapter (TFAdapter): performs tuning in the TensorFlow-based training and online inference scenarios. For details, see Online Tuning in TensorFlow-based Training Scenarios and Tuning in TensorFlow-based Online Inference Scenarios.
- PyTorch Adapter (PyTorchAdapter): performs tuning in the PyTorch-based training and online inference scenarios. For details, see Offline Tuning in PyTorch-based Training Scenarios and Tuning in PyTorch-based Online Inference Scenarios.
Tuning layer: tuning mode. The following modes are supported:
- Subgraph tuning: Subgraph Auto Tuning (SGAT) can be used to tune the subgraph splitting policy, verify the performance in the operating environment, and solidify the optimal tiling policy into the model repository to obtain the tuned model.
- Operator tuning: Operator Auto Tuning (OPAT) can be used to tune operators, verify the performance in the operating environment, and solidify the optimal operator tiling policy into the operator repository.
- Gradient tuning: Gradient Auto Tuning (GDAT) can be used to tune the AllReduce fusion policy and verify the performance in the operating environment to obtain the optimal AllReduce fusion policy.
You are advised to perform subgraph tuning and then operator tuning. The reason is that performing subgraph tuning first can generate the graph partition mode. After subgraph tuning is complete, the operators are partitioned into the final shapes. Operator tuning can then be performed based on the final shapes. If operator tuning is performed first, the shapes of the tuned operators are not the final shapes after operator partitioning, which does not meet the actual application scenarios.
Execute layer: This layer supports compilation (Compiler) and running (Runner) in the operating environment.

SGAT

SGAT is an optimizer that improves the performance of subgraphs. A complete network can be partitioned into multiple subgraphs. SGAT can be used to generate different tiling policies for these subgraphs. It obtains the profile data of each tiling policy iteration to find the optimal tiling policy and achieve the optimal subgraph performance. The tuning result is saved in the form of a subgraph repository.

SGAT supports resumption from breakpoints. When the tuning becomes abnormal, it can be resumed from the breakpoint.

Figure 2 shows the subgraph tuning process.

Figure 2 Subgraph tuning process

OPAT

OPAT is an optimizer that improves operator performance. AOE inputs an entire graph to OPAT. OPAT internally performs operator fusion, divides the fused graph by operator, generates different operator tiling policies for those fused operator subgraphs to achieve optimal operator performance, and stores the optimal policy in the operator repository.

The current version of AOE supports only the auto tuning of AI Core operators whose compute logic is implemented using DSL APIs. For details about the supported operators, see Operator List.

Figure 3 shows the operator tuning process.

Figure 3 Operator tuning process

GDAT

GDAT is a tuning tool that shortens communication hangover by maximizing the parallel degree of backward propagation and gradient aggregation communication. In a distributed training scenario, a gradient aggregation operation is performed after gradients are calculated between devices. The fusion policy of the gradient aggregation operator affects communication hangover after backward propagation, thereby affecting performance and linearity of cluster training. A preferred gradient data splitting principle is to minimize the hangover time as much as possible.

Figure 4 shows the gradient tuning process.

Figure 4 Gradient tuning process

Parent topic: AOE (Ascend EP)