Subgraph/Operator Tuning

This section provides guidance on tuning subgraphs and operators in TensorFlow-based training scenarios, including tuning precautions, environment variable configuration, tuning commands, tuning result check, and performance verification.

Tuning Precautions

Ensure that the training script can be successfully executed on the Ascend AI Processor and the function and accuracy meet the expectation.
You are not advised binding the training process to a specified CPU. Use the default CPU scheduling policy. Otherwise, the tuning effect may be affected.
To improve the tuning efficiency, you are advised to control the number of training steps as much as possible. Generally, a complete graph execution process can be completed through one step. Ensure that all operators in the graph can be traversed for tuning.
Currently, only static operators are supported. Dynamic operators are not supported.
AOE does not allow different users to use the same device for tuning at the same time.
Before tuning, disable the profiling function and the static memory allocation mode (that is, set static_memory_policy to 0, for example, custom_op.parameter_map["static_memory_policy"].i = 0) to avoid affecting the tuning result. For details about how to disable the profiling function, see the Performance Tuning Tool User Guide .
Tuning is not supported in multi-device scenarios.
To perform tuning in the single-device scenario, ensure that the following conditions are met.
- Available disk space in the home directory of the user who performs tuning: ≥ 20 GB
- Available memory ≥ Memory required for model training x TUNING_PARALLEL_NUM. For details about TUNING_PARALLEL_NUM, see Configuration File.
- Recommended quantity of Host CPUs during operator tuning: ≥ Number of processes in the training script x (TE_PARALLEL_COMPILER + TUNING_PARALLEL_NUM + min(Number of CPU cores/2, 8) + 50). For details about TE_PARALLEL_COMPILER and TUNING_PARALLEL_NUM, see Table 1 and Configuration File.
- Recommended quantity of Host CPUs during subgraph tuning: ≥ Number of processes in the training script x (2 x TUNING_PARALLEL_NUM + TE_PARALLEL_COMPILER). For details about TE_PARALLEL_COMPILER and TUNING_PARALLEL_NUM, see Table 1 and Configuration File.
- Number of device cores ≥ Maximum number of cores used by all operators in the model
- Device memory: related to the model and model memory overcommitment.

Environment Variable Configuration

Before tuning, add the following environment variables:

Basic environment variables of the CANN software
The CANN portfolio provides a process-level environment variable setting script to automatically set environment variables. The following commands are used as examples, in which the default installation paths are under the root or non-root user. Replace them with actual installation paths.
```
# Install Toolkit as the root user.
. /usr/local/Ascend/ascend-toolkit/set_env.sh 
# Install Toolkit as a non-root user.
. ${HOME}/Ascend/ascend-toolkit/set_env.sh 
```
AOE depends on Python. Take Python 3.7.5 as an example. Run the following commands as the running user to configure the environment variables related to Python 3.7.5:
```
# Set the Python 3.7.5 library path.
export LD_LIBRARY_PATH=/usr/local/python3.7.5/lib:$LD_LIBRARY_PATH
# If multiple Python 3 versions exist in the user environment, use Python 3.7.5.
export PATH=/usr/local/python3.7.5/bin:$PATH
```
Replace the Python 3.7.5 installation path based on the actual requirements. You can also write the preceding commands to the ~/.bashrc file and run the source ~/.bashrc command to make the modification take effect immediately.

Configure the tuning mode.

# 1: subgraph tuning; 2: operator tuning
export AOE_MODE=2

Before tuning, you can configure other optional environment variables by referring to the following example. For details, see Table 1.

export ASCEND_DEVICE_ID=0
export TUNE_BANK_PATH=/home/HwHiAiUser/custom_tune_bank
export TE_PARALLEL_COMPILER=8
export REPEAT_TUNE=False

You can write the commands for configuring environment variables to the custom script for future use.

In addition to using environment variables to specify the tuning mode, you can also modify the corresponding parameters in the training script to specify the tuning mode. For details, see Configuring the Tuning Mode by Modifying the Training Script (TensorFlow 1.15). Considering the operation complexity, you are advised to specify the tuning mode by setting the environment variable.

**Table 1** Environment variables
Environment Variable	Description	Mandatory/Optional
AOE_MODE	Tuning mode. The options are as follows: 1: subgraph tuning. 2: operator tuning.	Mandatory
ASCEND_DEVICE_ID	Logical ID of Ascend AI Processor. The value range is [0, N – 1] and the default value is 0. N indicates the number of devices on the physical machine, VM, or in a container.	Optional
TUNE_BANK_PATH	Path of the custom repository generated after tuning. The path must be an absolute path or a relative path to the path of the AOE tuning engine. The path must exist and the user must have the read, write, and execute permissions on the path. If the path specified by TUNE_BANK_PATH does not exist or the user does not have the required permissions on the path, the tuning process will report an error and exit. The priority of the paths for storing the custom repository is: TUNE_BANK_PATH > ASCEND_CACHE_PATH > default path. For details about TUNE_BANK_PATH and ASCEND_CACHE_PATH, see the Environment Variables. Custom subgraph repository If this environment variable is not configured, run the env command to check whether ASCEND_CACHE_PATH exists. If it exists, the custom repository is stored in ${ASCEND_CACHE_PATH}/aoe_data. If it does not exist, the custom repository is stored in ${HOME}/Ascend/latest/data/aoe/custom/graph/${soc_version} by default. If this environment variable is configured, the custom repository is stored in the path specified by this environment variable. Custom operator repository If this environment variable is not configured, run the env command to check whether ASCEND_CACHE_PATH exists. If it exists, the custom repository is stored in ${ASCEND_CACHE_PATH}/aoe_data/${soc_version}. If it does not exist, the custom repository is stored in ${HOME}/Ascend/latest/data/aoe/custom/op/${soc_version} by default. If this environment variable is configured, the optimal policy after tuning is stored in *${soc_version}* in the configured path. NOTE: If multiple users share the repository, the users must set TUNE_BANK_PATH to the same path and have the read and write permissions on the path. If the repository path is customized before tuning, you also need to configure this environment variable if you want to use the custom repository during model conversion.	Optional
TE_PARALLEL_COMPILER	Environment variable required for operator build. Parallel build is especially useful when a deep network is to build. The value of TE_PARALLEL_COMPILER indicates the number of operator build processes, which must be an integer. If the value is greater than 1, parallel build is enabled. In a scenario where AOE tuning is enabled, the maximum value of the environment variable is calculated as follows: Maximum value = Number of CPU cores x 80%/Number of Ascend AI Processors. The value ranges from 1 to 32. The default value is 8. This environment variable can accelerate operator build. Therefore, it can accelerate the tuning of processes related to operator build.	Optional
REPEAT_TUNE	Whether to initiate tuning again. This environment variable takes effect only when subgraph tuning or operator tuning is enabled. If it is set to False and a network tuning case (a tiling policy for a specific shape) is available in the repository (built-in or custom), the tuning process of the case is skipped. When the logic of some operators is changed, for example, the ND input support is added to the GEMM operator, you need to set this environment variable to True and initiate tuning again. The value can be True or False. The default value is False.	Optional

Tuning Procedure

Run the training script to automatically tune subgraphs or operators based on the specified tuning mode.

To specify the tuning path, you can modify work_path or aoe_config.work_path in the training script before training.
To tune a specified operator, you can modify aoe_config_file or aoe_config.aoe_config_file in the training script before training.

Tuning Result Viewing

View the tuning result. The key log information about tuning in the training process is as follows. For details about the tuning result files, see Operator Tuning Result File and Subgraph Tuning Result File.

# Enable TFAdapter tuning.
in tune mode, training graph handled by tools.
# Start the tool for tuning.
Aoe tuning graph.

After the tuning is complete, if the conditions for generating a custom repository are met (see Figure 2 and Figure 3), a custom repository is generated.

The priority of the paths for storing the custom repository is: TUNE_BANK_PATH > ASCEND_CACHE_PATH > default path. For details about TUNE_BANK_PATH and ASCEND_CACHE_PATH, see the Environment Variables.

Custom subgraph repository
If TUNE_BANK_PATH and ASCEND_CACHE_PATH are not configured, the custom repository is stored in ${HOME}/Ascend/latest/data/aoe/custom/graph/${soc_version} by default. You can run the env command to check whether they are configured.
Custom operator repository
If TUNE_BANK_PATH and ASCEND_CACHE_PATH are not configured, the custom repository is stored in ${HOME}/Ascend/latest/data/aoe/custom/op/${soc_version} by default. You can run the env command to check whether they are configured.

Performance Verification

After subgraph tuning is complete, restore the code and use the tuned custom repository to perform training again to check whether the performance is improved. For details about how to use the custom repository, see Usage of Tuned Custom Repositories.

After operator tuning is complete, restore the code and refresh the operator compilation cache. That is, set op_compiler_cache_mode to force (see the TensorFlow 1.15 Model Porting Guide) and perform training again using the tuned custom repository to check whether the performance is improved. For details about how to use the custom repository, see Usage of Tuned Custom Repositories.

Operator Tuning Result File

The priority of the paths for storing the operator tuning result file is: ASCEND_WORK_PATH > default path (tuning working directory). To be specific, if ASCEND_WORK_PATH is not configured, this file is stored in the default path (tuning working directory). You can run the env command to check whether ASCEND_WORK_PATH is configured. For details about ASCEND_WORK_PATH, see the Environment Variables.

During tuning, the result file generated in real time is named aoe_result_opat_${timestamp}_${pidxxx}.json, which records the information about the tuned operators. ${timestamp} is in the format of YYYYMMDD_HHMMSSMS. The variable ${pidxxx} indicates the process ID.

The content format is as follows. Multiple tuning tasks can be included. For details about the fields, see Table 2. tid indicates the thread ID.

{
  "report_${timestamp}_${tid}": [
    {
      "basic": {
        "tuning_name": "Tuning task name",
        "tuning_time(s)": 44
      }
    },
    {
      "OPAT": {
        "opat_tuning_result": "tuning successful",
        "repo_modified_operators": [
          {
            "op_name": "bert/encoder/layer_10/attention/self/Softmax_OPAT_0",
            "op_type": "SoftmaxV2",
            "tune_performance": {
              "Format": {
                "performance_after_tune(us)": 26.876,
                "performance_before_tune(us)": 58.781,
                "performance_improvement": "118.71%",
                "update_mode": "add"
              }
            }
          },
          {
            "op_name": "bert/encoder/layer_8/attention/output/dense/MatMulbert/encoder/layer_8/attention/output/add",
            "op_type": "MatMulV2",
            "tune_performance": {
              "Schedule": {
                "performance_after_tune(us)": 15.71,
                "performance_before_tune(us)": 16.71,
                "performance_improvement": "6.37%",
                "update_mode": "add"
              }
            }
          }
        ],
        "repo_summary": {
          "repo_add_num": 2,
          "repo_hit_num": 10,
          "repo_reserved_num": 10,
          "repo_unsatisfied_num": 1,
          "repo_update_num": 0,
          "total_num": 13
        }
      }
    }
  ],
  "report_${timestamp}_${tid}": [
   ........
   .....

If the tuning fails (tuning failed is displayed in opat_tuning_result), the op_name list of the operators that fail to be tuned is also displayed.

      "tuning_failed_operators": [
        "res4a_branch1"
       ]

**Table 2** Fields
Field Name				Description
basic
-	tuning_name	-	-	Tuning task name.
-	tuning_time(s)	-	-	Tuning duration, in seconds. This field is not recorded in tuning interruption scenarios (such as coredump and OOM).
OPAT NOTE: If no operator is available to be tuned, information in this segment does not exist.
-	opat_tuning_result	-	-	Tuning result, which can be "tuning successful" during a tuning success, "tuning failed" during a tuning failure, or "tune tuning incomplete" during an incomplete tuning or abnormal exit.
-	repo_modified_operators	-	-	Details about operators whose tiling policies are added or updated after tuning.
-	-	op_name	-	Operator name.
-	-	op_type	-	Operator type. There can be one or more types. If there are multiple types, use [] to enclose them.
-	-	tune_performance	-	Detailed information about operator performance improvement.
-	-	Format, Schedule, or Impl	-	Operator tuning mode. The options are as follows: Format: This field is available only when Format is enabled during operator tuning and the performance is improved through the Format tuning. Schedule: This field is available only when the performance is improved through the Schedule tuning. Impl: This field is available only when the performance is improved through the Impl tuning.
-	-	-	performance_after_tune(us)	Operator execution time after tuning, in μs.
-	-	-	performance_before_tune(us)	Operator execution time before tuning, in μs.
-	-	-	performance_improvement	Percentage of reduced operator execution time after tuning.
-	-	-	update_mode	Update mode of the operator tiling policies. The options are as follows: add: adds operator tiling policies. update: updates operator tiling policies.
NOTE: The information from op_name to update_mode is displayed for each operator whose tiling policies are added or updated.
-	repo_summary	-	-	Information about operators in each state during tuning.
-	-	repo_add_num	-	Number of the titling policies that are not in the repository before tuning and are added to the repository after tuning.
-	-	repo_hit_num	-	Number of the titling policies that are in the repository during tuning.
-	-	repo_reserved_num	-	Number of the titling policies that are in the repository before tuning and remain unchanged after tuning.
-	-	repo_unsatisfied_num	-	Number of the titling policies that are not in the repository before tuning and are not written into the repository after tuning.
-	-	repo_update_num	-	Number of the titling policies that are in the repository before tuning and are updated after tuning.
-	-	total_num	-	Total number of titling policies that are tuned in the tuning task. repo_hit_num=repo_update_num+repo_reserved_num total_num=repo_add_num+repo_hit_num+repo_unsatisfied_num
-	tuning_failed_operators	-	-	op_name list of operators that fail to be tuned. NOTE: This field is optional. It is recorded only when the value of opat_tuning_result is tuning failed.

Subgraph Tuning Result File

The priority of the paths for storing the subgraph tuning result file is: ASCEND_WORK_PATH > default path (tuning working directory). To be specific, if ASCEND_WORK_PATH is not configured, this file is stored in the default path (tuning working directory). You can run the env command to check whether ASCEND_WORK_PATH is configured. For details about ASCEND_WORK_PATH, see the Environment Variables.

During tuning, the result file generated in real time is named aoe_result_sgat_${timestamp}_${pidxxx}.json, which records the information about the tuned subgraphs. ${timestamp} is in the format of YYYYMMDD_HHMMSSMS. The variable ${pidxxx} indicates the process ID.

The content format is as follows. Multiple tuning tasks can be included. For details about the fields, see Table 3. tid indicates the thread ID.

"report_${timestamp}_${tid}": [
    {
      "basic": {
        "tuning_name": "Tuning task name",
        "tuning_time(s)": 19
      }
    },
    {
      "SGAT": {
        "model_baseline_performance(ms)": 5.600486,
        "model_performance_improvement": "55.11%",
        "model_result_performance(ms)": 3.610442,
        "repo_modified_subgraphs": {
          "add_repo_subgraphs": [
            {
              "performance_after_tune(ms)": 3.573203,
              "performance_before_tune(ms)": 5.58434,
              "performance_improvement": "56.28%",
              "repo_key": "1024942313106047484"
            }
          ]
          "update_repo_subgraphs": [
            {
              "performance_after_tune(ms)": 2.573203,
              "performance_before_tune(ms)": 4.58434,
              "performance_improvement": "78.15%",
              "repo_key": "1024942313106057586"
            }
          ]
        },
        "repo_summary": {
          "repo_add_num": 1,
          "repo_hit_num": 1,
          "repo_reserved_num": 0,
          "repo_unsatisfied_num": 120,
          "repo_update_num": 1,
          "total_num": 121
        }
      }
    }
  ],
  "report_${timestamp}_${tid}": [
   .......
   .......

**Table 3** Fields
Field Name				Description
basic
-	tuning_name	-	-	Tuning task name.
-	tuning_time(s)	-	-	Tuning duration, in seconds.
SGAT NOTE: If subgraph tuning fails, information in this segment does not exist.
-	model_baseline_performance(ms)	-	-	Model execution time before tuning, in ms.
-	model_performance_improvement	-	-	Percentage of reduced model execution time after tuning.
-	model_result_performance(ms)	-	-	Model execution time after tuning, in ms.
-	repo_modified_subgraphs	-	-	Details about subgraphs whose tiling policies are added or updated after tuning.
-	-	add_repo_subgraphs	-	Subgraphs whose tiling policies are added after tuning. There can be no or multiple subgraphs.
-	-	-	performance_before_tune(ms)	Subgraph execution time before tuning, in ms.
-	-	-	performance_after_tune(ms)	Subgraph execution time after tuning, in ms.
-	-	-	performance_improvement	Percentage of reduced subgraph execution time after tuning.
-	-	-	repo_key	Subgraph key value after tuning, which is used to query the tuning repository.
-	-	update_repo_subgraphs	-	Subgraphs whose tiling policies are updated after tuning. There can be no or multiple subgraphs.
-	-	-	performance_before_tune(ms)	Subgraph execution time before tuning, in ms.
-	-	-	performance_after_tune(ms)	Subgraph execution time after tuning, in ms.
-	-	-	performance_improvement	Percentage of reduced subgraph execution time after tuning.
-	-	-	repo_key	Subgraph key value after tuning, which is used to query the tuning repository.
-	repo_summary	-	-	Number of subgraphs in each state during tuning.
-	-	repo_add_num	-	Number of subgraphs whose titling policies are not in the repository before tuning and are added to the repository after tuning.
-	-	repo_hit_num	-	Number of subgraphs whose titling policies are in the repository during tuning.
-	-	repo_reserved_num	-	Number of subgraphs whose titling policies are in the repository before tuning and remain unchanged after tuning.
-	-	repo_unsatisfied_num	-	Number of subgraphs whose titling policies are not in the repository before tuning and are not written into the repository after tuning.
-	-	repo_update_num	-	Number of subgraphs whose titling policies are in the repository before tuning and are updated after tuning.
-	-	total_num	-	Total number of subgraphs that are tuned in the tuning task. repo_hit_num=repo_update_num+repo_reserved_num total_num=repo_add_num+repo_hit_num+repo_unsatisfied_num

Parent topic: Online Tuning in TensorFlow-based Training Scenarios