Operator Tuning

This section provides guidance on tuning operators in PyTorch-based training scenarios, including tuning precautions, operator graph dump, environment variable configuration, and tuning commands.

Tuning Precautions

Ensure that the training script can be successfully executed on the Ascend AI Processor and the function and accuracy meet the expectation.
You are not advised binding the training process to a specified CPU. Use the default CPU scheduling policy. Otherwise, the tuning effect may be affected.
To improve the tuning efficiency, you are advised to control the number of training steps as much as possible. Generally, a complete graph execution process can be completed through one step. Ensure that all operators in the graph can be traversed for tuning.
Currently, only static operators are supported. Dynamic operators are not supported.
Only single-device scripts can be used to dump graphs.
AOE does not allow different users to use the same device for tuning at the same time.
If there is only one AOE process, ensure that the following conditions are met. If there are multiple AOE processes, perform the expansion based on the following conditions.
- Available disk space in the home directory of the user who performs tuning: ≥ 20 GB
- Available memory: ≥ 32 GB
- Recommended quantity of Host CPUs during operator tuning: ≥ TE_PARALLEL_COMPILER + TUNING_PARALLEL_NUM + 1 + min(Number of CPU cores/2, 8) + 58. For details about TE_PARALLEL_COMPILER and TUNING_PARALLEL_NUM, see Table 1 and Table 1.
- Number of device cores ≥ Maximum number of cores used by all operators in the model
- Device memory: related to the model and model memory overcommitment.
Before tuning, disable the profiling function to avoid affecting the tuning result. For details about how to disable the profiling function, see the Performance Tuning Tool User Guide .

Dumping the Operator Graph

Method 1: Dump the operator graph by calling aclGenGraphAndDumpForOp.

Method 2: Add the following code to the model script to dump the operator graph to the local host:

# Confirm the PyTorch framework version in the first line of the model script.
import torch
if torch.__version__ >= "1.8":
    import torch_npu
import torch.npu

def train_model():
    # For version 1.8 or later, refer to the following setting to enable the AOE dump interface. dump_path is the path for storing the operator subgraph. Set it as required.
    torch_npu.npu.set_aoe(dump_path)

    train_model_one_step() # Model training process. Generally, only one step is required for model training.
# dump_path indicates the path for saving the dumped operator graph. It is mandatory and cannot be empty. If the configured path does not exist, the system will create a path. Multi-level directories are supported.

The ResNet-50 model is used as an example. The modification is as follows:

#line 427~437
model.train()
optimizer.zero_grad()
end = time.time()
torch_npu.npu.set_aoe(dump_path) # Enable the interface.
    # Graph mode
    if args.graph_mode:
        print("args.graph_mode")
        torch.npu.enable_graph_mode()

    if i > 0:             # Only one step is required.
        exit()
    if i > 100:
        pass
    # measure data loading time
    data_time.update(time.time() - end)

    if args.gpu is not None:
        images = images.cuda(args.gpu, non_blocking=True)

Reference link: https://gitee.com/ascend/ModelZoo-PyTorch/blob/master/PyTorch/built-in/cv/classification/ResNet50_for_PyTorch/pytorch_resnet50_apex.py

Configuring Environment Variables

Before tuning, add the following environment variables:

Basic environment variables of the CANN software
The CANN portfolio provides a process-level environment variable setting script to automatically set environment variables. The following commands are used as examples, in which the default installation paths are under the root or non-root user. Replace them with actual installation paths.
```
# Install Toolkit as the root user.
. /usr/local/Ascend/ascend-toolkit/set_env.sh 
# Install Toolkit as a non-root user.
. ${HOME}/Ascend/ascend-toolkit/set_env.sh 
```
AOE depends on Python. Take Python 3.7.5 as an example. Run the following commands as the running user to configure the environment variables related to Python 3.7.5:
```
# Set the Python 3.7.5 library path.
export LD_LIBRARY_PATH=/usr/local/python3.7.5/lib:$LD_LIBRARY_PATH
# If multiple Python 3 versions exist in the user environment, use Python 3.7.5.
export PATH=/usr/local/python3.7.5/bin:$PATH
```
Replace the Python 3.7.5 installation path based on the actual requirements. You can also write the preceding commands to the ~/.bashrc file and run the source ~/.bashrc command to make the modification take effect immediately.

Before tuning, you can configure other optional environment variables by referring to the following example. For details, see Table 1.

export ASCEND_DEVICE_ID=0
export TUNE_BANK_PATH=/home/HwHiAiUser/custom_tune_bank
export TE_PARALLEL_COMPILER=8
export REPEAT_TUNE=False

You can write the commands for configuring environment variables to the custom script for future use.

**Table 1** Environment variables
Environment Variable	Description
ASCEND_DEVICE_ID	Logical ID of the Ascend AI Processor. The value range is [0, N – 1] and the default value is 0. N indicates the number of devices on the physical machine, VM, or in a container.
TUNE_BANK_PATH	Path of the custom repository generated after tuning. The path must be an absolute path or a relative path to the path of the AOE tuning engine. The path must exist and the user must have the read, write, and execute permissions on the path. If the path specified by TUNE_BANK_PATH does not exist or the user does not have the required permissions on the path, the tuning process will report an error and exit. The priority of the paths for storing the custom repository is: TUNE_BANK_PATH > ASCEND_CACHE_PATH > default path. For details about TUNE_BANK_PATH and ASCEND_CACHE_PATH, see the Environment Variables. Custom operator repository If this environment variable is not configured, run the env command to check whether ASCEND_CACHE_PATH exists. If it exists, the custom repository is stored in ${ASCEND_CACHE_PATH}/aoe_data/${soc_version}. If it does not exist, the custom repository is stored in ${HOME}/Ascend/latest/data/aoe/custom/op/${soc_version} by default. If this environment variable is configured, the optimal policy after tuning is stored in *${soc_version}* in the configured path. NOTE: If multiple users share the repository, the users must set TUNE_BANK_PATH to the same path and have the read and write permissions on the path. If the repository path is customized before tuning, you also need to configure this environment variable if you want to use the custom repository during model conversion.
TE_PARALLEL_COMPILER	Environment variable required for operator build. Parallel build is especially useful when a deep network is to build. The value of TE_PARALLEL_COMPILER indicates the number of operator build processes, which must be an integer. If the value is greater than 1, parallel build is enabled. In a scenario where AOE tuning is enabled, the maximum value of the environment variable is calculated as follows: Maximum value = Number of CPU cores x 80%/Number of Ascend AI Processors. The value ranges from 1 to 32. The default value is 8. This environment variable can accelerate operator build. Therefore, it can accelerate the tuning of processes related to operator build.
REPEAT_TUNE	Whether to initiate tuning again. This environment variable takes effect only when subgraph tuning or operator tuning is enabled. If it is set to False and a network tuning case (a tiling policy for a specific shape) is available in the repository (built-in or custom), the tuning process of the case is skipped. When the logic of some operators is changed, for example, the ND input support is added to the GEMM operator, you need to set this environment variable to True and initiate tuning again. The value can be True or False. The default value is False.

Performing Tuning

Use AOE to tune the prepared operator graph. The following is an example:

aoe --job_type=2 --model_path=dump_path

For more AOE parameters, see AOE Command-Line Options.

Parent topic: Offline Tuning in PyTorch-based Training Scenarios