Overview

Before You Start

Application scenarios

For a pure static shape network or a dynamic shape network with few shape changes, if you want to quickly improve the network model execution performance, you can use the operator compilation tool op_compiler to compile and generate a static kernel package during network deployment to improve the operator call performance.

Tuning principles

In static kernel compilation, the shape size of the operator is specified during compilation but not running. The operator compilation tool obtains shape information from the input operator information file and compiles an operator binary file for each shape to improve operator efficiency and performance.

The advantages of static kernel compilation are as follows:

The sizes of all tensors are known before compilation, improving the memory utilization.
During compilation, strategic optimization can be implemented based on the actual shape size.
AI processors show better performance in parallel instruction execution than logic computation. Frequent scalar operations may interrupt parallel instruction execution, resulting in performance deterioration. Scalar computation can be completed during static compilation. This can improve the performance.
The operation data size is fixed, and the compiler will not insert extra synchronization instructions. In this way, instructions can be executed in parallel mode, improving the execution performance.

Restrictions

Currently, only the static compilation and tuning compilation modes are supported.
The static compilation mode supports the following product models:
- Atlas Training Series Product
Constraints related to the tuning compilation mode:
- Different users are not allowed to use the same device for tuning at the same time.
- Before tuning, disable the Profiling function to avoid affecting the tuning result. For details about how to disable the profiling function, see the Performance Tuning Tool User Guide .

Basic Workflow

Figure 1 Schematic diagram of using the static kernel to improve performance

Figure 1 shows the basic process of improving the operator execution performance in the network model by compiling the static kernel. The tuning procedure is as follows:

Dump operator information.
Before operator tuning, obtain operator information in the network model.
- Method 1: If Python-based APIs are used for programming, dump the operator .json file using the Ascend PyTorch Profiler API.
- Method 2: If AscendCL C++ APIs are used for programming, dump the operator .json file using the aclopStartDumpArgs and aclopStopDumpArgs APIs.
Compile the static kernel package.
Use the operator compilation tool to compile the dumped operator information statistics file (*.json) and generate a kernel package.

The operator compilation tool is a command line tool provided by Ascend CANN for compiling operators and generating operator binary files. When the shape of an operator is fixed or slightly changed, you can use this tool to compile and install the static kernel package to improve the call performance of the operator. For details about the tool, see Operator Compilation Tool User Guide.
1. Select a compilation mode.
  The operator compilation tool uses the static compilation mode by default. If you want to further improve the operator performance, you can try to optimize the operator, that is, enable the tuning compilation mode.
2. Pack the static kernel.
  The operator compilation tool packs the kernel files generated after compilation into a .run package.
Install the static kernel package.
Upload the static kernel package to the server where the target network model runs and execute the .run package to complete the installation.
Verify the model execution effect.
After the static kernel package is installed, run the target model again. Compare the running performance of the entire network and the single-operator running time before and after the static kernel package is installed.

Parent topic: Using the Static Kernel to Improve the Model Execution Performance