Subgraph/Operator Tuning
This section provides guidance on tuning subgraphs and operators in TensorFlow-based training scenarios, including tuning precautions, environment variable configuration, tuning commands, tuning result check, and performance verification.
Tuning Precautions
- Ensure that the training script can be successfully executed on the Ascend AI Processor and the function and accuracy meet the expectation.
- You are not advised binding the training process to a specified CPU. Use the default CPU scheduling policy. Otherwise, the tuning effect may be affected.
- To improve the tuning efficiency, you are advised to control the number of training steps as much as possible. Generally, a complete graph execution process can be completed through one step. Ensure that all operators in the graph can be traversed for tuning.
- Currently, only static operators are supported. Dynamic operators are not supported.
- AOE does not allow different users to use the same device for tuning at the same time.
- Before tuning, disable the profiling function and the static memory allocation mode (that is, set static_memory_policy to 0, for example, custom_op.parameter_map["static_memory_policy"].i = 0) to avoid affecting the tuning result. For details about how to disable the profiling function, see the Performance Tuning Tool User Guide .
- Tuning is not supported in multi-device scenarios.
- To perform tuning in the single-device scenario, ensure that the following conditions are met.
- Available disk space in the home directory of the user who performs tuning: ≥ 20 GB
- Available memory ≥ Memory required for model training x TUNING_PARALLEL_NUM. For details about TUNING_PARALLEL_NUM, see Configuration File.
- Recommended quantity of Host CPUs during operator tuning: ≥ Number of processes in the training script x (TE_PARALLEL_COMPILER + TUNING_PARALLEL_NUM + min(Number of CPU cores/2, 8) + 50). For details about TE_PARALLEL_COMPILER and TUNING_PARALLEL_NUM, see Table 1 and Configuration File.
- Recommended quantity of Host CPUs during subgraph tuning: ≥ Number of processes in the training script x (2 x TUNING_PARALLEL_NUM + TE_PARALLEL_COMPILER). For details about TE_PARALLEL_COMPILER and TUNING_PARALLEL_NUM, see Table 1 and Configuration File.
- Number of device cores ≥ Maximum number of cores used by all operators in the model
- Device memory: related to the model and model memory overcommitment.
Environment Variable Configuration
- Basic environment variables of the CANN software
The CANN portfolio provides a process-level environment variable setting script to automatically set environment variables. The following commands are used as examples, in which the default installation paths are under the root or non-root user. Replace them with actual installation paths.
# Install Toolkit as the root user. . /usr/local/Ascend/ascend-toolkit/set_env.sh # Install Toolkit as a non-root user. . ${HOME}/Ascend/ascend-toolkit/set_env.sh - AOE depends on Python. Take Python 3.7.5 as an example. Run the following commands as the running user to configure the environment variables related to Python 3.7.5:
# Set the Python 3.7.5 library path. export LD_LIBRARY_PATH=/usr/local/python3.7.5/lib:$LD_LIBRARY_PATH # If multiple Python 3 versions exist in the user environment, use Python 3.7.5. export PATH=/usr/local/python3.7.5/bin:$PATH
Replace the Python 3.7.5 installation path based on the actual requirements. You can also write the preceding commands to the ~/.bashrc file and run the source ~/.bashrc command to make the modification take effect immediately.
- Configure the tuning mode.
# 1: subgraph tuning; 2: operator tuning export AOE_MODE=2
- Before tuning, you can configure other optional environment variables by referring to the following example. For details, see Table 1.
export ASCEND_DEVICE_ID=0 export TUNE_BANK_PATH=/home/HwHiAiUser/custom_tune_bank export TE_PARALLEL_COMPILER=8 export REPEAT_TUNE=False
- You can write the commands for configuring environment variables to the custom script for future use.
- In addition to using environment variables to specify the tuning mode, you can also modify the corresponding parameters in the training script to specify the tuning mode. For details, see Configuring the Tuning Mode by Modifying the Training Script (TensorFlow 1.15). Considering the operation complexity, you are advised to specify the tuning mode by setting the environment variable.
Tuning Procedure
Run the training script to automatically tune subgraphs or operators based on the specified tuning mode.
- To specify the tuning path, you can modify work_path or aoe_config.work_path in the training script before training.
- To tune a specified operator, you can modify aoe_config_file or aoe_config.aoe_config_file in the training script before training.
Tuning Result Viewing
1 2 3 4 | # Enable TFAdapter tuning. in tune mode, training graph handled by tools. # Start the tool for tuning. Aoe tuning graph. |
After the tuning is complete, if the conditions for generating a custom repository are met (see Figure 2 and Figure 3), a custom repository is generated.
- Custom subgraph repository
If TUNE_BANK_PATH and ASCEND_CACHE_PATH are not configured, the custom repository is stored in ${HOME}/Ascend/latest/data/aoe/custom/graph/${soc_version} by default. You can run the env command to check whether they are configured.
- Custom operator repository
If TUNE_BANK_PATH and ASCEND_CACHE_PATH are not configured, the custom repository is stored in ${HOME}/Ascend/latest/data/aoe/custom/op/${soc_version} by default. You can run the env command to check whether they are configured.
Performance Verification
After subgraph tuning is complete, restore the code and use the tuned custom repository to perform training again to check whether the performance is improved. For details about how to use the custom repository, see Usage of Tuned Custom Repositories.
After operator tuning is complete, restore the code and refresh the operator compilation cache. That is, set op_compiler_cache_mode to force (see the TensorFlow 1.15 Model Porting Guide) and perform training again using the tuned custom repository to check whether the performance is improved. For details about how to use the custom repository, see Usage of Tuned Custom Repositories.
Operator Tuning Result File
The priority of the paths for storing the operator tuning result file is: ASCEND_WORK_PATH > default path (tuning working directory). To be specific, if ASCEND_WORK_PATH is not configured, this file is stored in the default path (tuning working directory). You can run the env command to check whether ASCEND_WORK_PATH is configured. For details about ASCEND_WORK_PATH, see the Environment Variables.
During tuning, the result file generated in real time is named aoe_result_opat_${timestamp}_${pidxxx}.json, which records the information about the tuned operators. ${timestamp} is in the format of YYYYMMDD_HHMMSSMS. The variable ${pidxxx} indicates the process ID.
The content format is as follows. Multiple tuning tasks can be included. For details about the fields, see Table 2. tid indicates the thread ID.
{
"report_${timestamp}_${tid}": [
{
"basic": {
"tuning_name": "Tuning task name",
"tuning_time(s)": 44
}
},
{
"OPAT": {
"opat_tuning_result": "tuning successful",
"repo_modified_operators": [
{
"op_name": "bert/encoder/layer_10/attention/self/Softmax_OPAT_0",
"op_type": "SoftmaxV2",
"tune_performance": {
"Format": {
"performance_after_tune(us)": 26.876,
"performance_before_tune(us)": 58.781,
"performance_improvement": "118.71%",
"update_mode": "add"
}
}
},
{
"op_name": "bert/encoder/layer_8/attention/output/dense/MatMulbert/encoder/layer_8/attention/output/add",
"op_type": "MatMulV2",
"tune_performance": {
"Schedule": {
"performance_after_tune(us)": 15.71,
"performance_before_tune(us)": 16.71,
"performance_improvement": "6.37%",
"update_mode": "add"
}
}
}
],
"repo_summary": {
"repo_add_num": 2,
"repo_hit_num": 10,
"repo_reserved_num": 10,
"repo_unsatisfied_num": 1,
"repo_update_num": 0,
"total_num": 13
}
}
}
],
"report_${timestamp}_${tid}": [
........
.....
If the tuning fails (tuning failed is displayed in opat_tuning_result), the op_name list of the operators that fail to be tuned is also displayed.
"tuning_failed_operators": [
"res4a_branch1"
]
Field Name |
Description |
|||
|---|---|---|---|---|
basic |
||||
- |
tuning_name |
- |
- |
Tuning task name. |
- |
tuning_time(s) |
- |
- |
Tuning duration, in seconds. This field is not recorded in tuning interruption scenarios (such as coredump and OOM). |
OPAT NOTE:
If no operator is available to be tuned, information in this segment does not exist. |
||||
- |
opat_tuning_result |
- |
- |
Tuning result, which can be "tuning successful" during a tuning success, "tuning failed" during a tuning failure, or "tune tuning incomplete" during an incomplete tuning or abnormal exit. |
- |
repo_modified_operators |
- |
- |
Details about operators whose tiling policies are added or updated after tuning. |
- |
- |
op_name |
- |
Operator name. |
- |
- |
op_type |
- |
Operator type. There can be one or more types. If there are multiple types, use [] to enclose them. |
- |
- |
tune_performance |
- |
Detailed information about operator performance improvement. |
- |
- |
Format, Schedule, or Impl |
- |
Operator tuning mode. The options are as follows:
|
- |
- |
- |
performance_after_tune(us) |
Operator execution time after tuning, in μs. |
- |
- |
- |
performance_before_tune(us) |
Operator execution time before tuning, in μs. |
- |
- |
- |
performance_improvement |
Percentage of reduced operator execution time after tuning. |
- |
- |
- |
update_mode |
Update mode of the operator tiling policies. The options are as follows:
|
NOTE:
The information from op_name to update_mode is displayed for each operator whose tiling policies are added or updated. |
||||
- |
repo_summary |
- |
- |
Information about operators in each state during tuning. |
- |
- |
repo_add_num |
- |
Number of the titling policies that are not in the repository before tuning and are added to the repository after tuning. |
- |
- |
repo_hit_num |
- |
Number of the titling policies that are in the repository during tuning. |
- |
- |
repo_reserved_num |
- |
Number of the titling policies that are in the repository before tuning and remain unchanged after tuning. |
- |
- |
repo_unsatisfied_num |
- |
Number of the titling policies that are not in the repository before tuning and are not written into the repository after tuning. |
- |
- |
repo_update_num |
- |
Number of the titling policies that are in the repository before tuning and are updated after tuning. |
- |
- |
total_num |
- |
Total number of titling policies that are tuned in the tuning task.
|
- |
tuning_failed_operators |
- |
- |
op_name list of operators that fail to be tuned. NOTE:
This field is optional. It is recorded only when the value of opat_tuning_result is tuning failed. |
Subgraph Tuning Result File
The priority of the paths for storing the subgraph tuning result file is: ASCEND_WORK_PATH > default path (tuning working directory). To be specific, if ASCEND_WORK_PATH is not configured, this file is stored in the default path (tuning working directory). You can run the env command to check whether ASCEND_WORK_PATH is configured. For details about ASCEND_WORK_PATH, see the Environment Variables.
During tuning, the result file generated in real time is named aoe_result_sgat_${timestamp}_${pidxxx}.json, which records the information about the tuned subgraphs. ${timestamp} is in the format of YYYYMMDD_HHMMSSMS. The variable ${pidxxx} indicates the process ID.
The content format is as follows. Multiple tuning tasks can be included. For details about the fields, see Table 3. tid indicates the thread ID.
"report_${timestamp}_${tid}": [
{
"basic": {
"tuning_name": "Tuning task name",
"tuning_time(s)": 19
}
},
{
"SGAT": {
"model_baseline_performance(ms)": 5.600486,
"model_performance_improvement": "55.11%",
"model_result_performance(ms)": 3.610442,
"repo_modified_subgraphs": {
"add_repo_subgraphs": [
{
"performance_after_tune(ms)": 3.573203,
"performance_before_tune(ms)": 5.58434,
"performance_improvement": "56.28%",
"repo_key": "1024942313106047484"
}
]
"update_repo_subgraphs": [
{
"performance_after_tune(ms)": 2.573203,
"performance_before_tune(ms)": 4.58434,
"performance_improvement": "78.15%",
"repo_key": "1024942313106057586"
}
]
},
"repo_summary": {
"repo_add_num": 1,
"repo_hit_num": 1,
"repo_reserved_num": 0,
"repo_unsatisfied_num": 120,
"repo_update_num": 1,
"total_num": 121
}
}
}
],
"report_${timestamp}_${tid}": [
.......
.......
Field Name |
Description |
|||
|---|---|---|---|---|
basic |
||||
- |
tuning_name |
- |
- |
Tuning task name. |
- |
tuning_time(s) |
- |
- |
Tuning duration, in seconds. |
SGAT NOTE:
If subgraph tuning fails, information in this segment does not exist. |
||||
- |
model_baseline_performance(ms) |
- |
- |
Model execution time before tuning, in ms. |
- |
model_performance_improvement |
- |
- |
Percentage of reduced model execution time after tuning. |
- |
model_result_performance(ms) |
- |
- |
Model execution time after tuning, in ms. |
- |
repo_modified_subgraphs |
- |
- |
Details about subgraphs whose tiling policies are added or updated after tuning. |
- |
- |
add_repo_subgraphs |
- |
Subgraphs whose tiling policies are added after tuning. There can be no or multiple subgraphs. |
- |
- |
- |
performance_before_tune(ms) |
Subgraph execution time before tuning, in ms. |
- |
- |
- |
performance_after_tune(ms) |
Subgraph execution time after tuning, in ms. |
- |
- |
- |
performance_improvement |
Percentage of reduced subgraph execution time after tuning. |
- |
- |
- |
repo_key |
Subgraph key value after tuning, which is used to query the tuning repository. |
- |
- |
update_repo_subgraphs |
- |
Subgraphs whose tiling policies are updated after tuning. There can be no or multiple subgraphs. |
- |
- |
- |
performance_before_tune(ms) |
Subgraph execution time before tuning, in ms. |
- |
- |
- |
performance_after_tune(ms) |
Subgraph execution time after tuning, in ms. |
- |
- |
- |
performance_improvement |
Percentage of reduced subgraph execution time after tuning. |
- |
- |
- |
repo_key |
Subgraph key value after tuning, which is used to query the tuning repository. |
- |
repo_summary |
- |
- |
Number of subgraphs in each state during tuning. |
- |
- |
repo_add_num |
- |
Number of subgraphs whose titling policies are not in the repository before tuning and are added to the repository after tuning. |
- |
- |
repo_hit_num |
- |
Number of subgraphs whose titling policies are in the repository during tuning. |
- |
- |
repo_reserved_num |
- |
Number of subgraphs whose titling policies are in the repository before tuning and remain unchanged after tuning. |
- |
- |
repo_unsatisfied_num |
- |
Number of subgraphs whose titling policies are not in the repository before tuning and are not written into the repository after tuning. |
- |
- |
repo_update_num |
- |
Number of subgraphs whose titling policies are in the repository before tuning and are updated after tuning. |
- |
- |
total_num |
- |
Total number of subgraphs that are tuned in the tuning task.
|