Configuring the Tuning Mode by Modifying the Training Script (TensorFlow 1.15)

Before You Start

In addition to configuring the tuning mode by using the AOE_MODE environment variable, you can also configure the tuning mode by modifying the training script. If both of them are configured, the configuration by modifying the training script takes effect preferentially.

Procedure

(Optional) Configure the required environment variable.

export ASCEND_DEVICE_ID=0  # This environment variable specifies the logical ID of Ascend AI Processor. The value range is [0, N–1], where N indicates the number of devices on the physical machine, VM, or container. The default value is 0.

Modify the training script and set the following parameters to enable AOE tuning.

Table 1 Description of related parameters

Parameter

Description

aoe_mode

(Mandatory) Tuning mode of AOE.

1: subgraph tuning.
2: operator tuning.
4: gradient splitting tuning.
In the data parallel scenario, AllReduce is used to aggregate gradients. The gradient splitting mode is closely related to the distributed training performance. If the splitting is improper, the communication hangover time is long after the backward propagation is complete, affecting the cluster training performance and linearity. It is sophisticated to perform manual tuning through the gradient splitting API (set_split_strategy_by_idx or set_split_strategy_by_size) of collective communication. The AOE tool collects profile data in the real-device environment and automatically looks up for the optimal splitting policy. You only need to set the obtained policy to your network by passing it to the set_split_strategy_by_idx call.

NOTE:
The tuning mode can be configured by modifying the training script or the AOE_MODE environment variable. If both the training script and the AOE_MODE environment variable are configured, the configuration by modifying the training script takes effect preferentially.

work_path

(Optional) Working directory of AOE, which stores the configuration and result files. By default, the files are generated in the current directory.

The value is a character string. Create the specified path in advance in the environment (either container or host) where training is performed. The running user configured during installation must have the read and write permissions on this path. The path can be an absolute path or a path relative to the path where the training script is executed.

An absolute path starts with a slash (/), for example, /home/HwHiAiUser/output.
A relative path starts with a directory name, for example, output.

Example:

custom_op.parameter_map["work_path"].s = tf.compat.as_bytes("/home/HwHiAiUser/output")

aoe_config_file

(Optional)

The following functions can be implemented through the configuration file:

Specify the operator name or operator type to tune a specified operator. For details about the operators that can be specified, see Operator List. Only operators listed in Operator List can be tuned. This option is used in the following scenarios:
After profiling a network, you can use this option to tune a particular operator with low performance.

It can be set to the OP Name/OP Type of a specific node in the network model adapted to the Ascend AI Processor after GE/FE processing. The OP Name/OP Type can be obtained from the tuned profile data. For details, see the Performance Tuning Tool User Guide .

Set the tuning mode, including the high-performance mode and normal mode.
Set the tuning feature, including in-depth operator tuning and operator format tuning.

Example:

custom_op.parameter_map["aoe_config_file"].s=tf.compat.as_bytes("/home/HwHiAiUser/cfg/tuning_config.cfg")

The file name extension is not limited to .cfg. The file content must be in .json format. Only one file is supported.

The /home/HwHiAiUser/cfg/tuning_config.cfg file contains the information about the operators to be tuned, tuning mode, and specified tuning feature. You can change the path including the name of the .cfg file according to the actual situation. The content format of the tuning_config.cfg file is as follows:

{
       "tune_ops_name":["bert/embeddings/addbert/embeddings/add_1","loss/MatMul"],
       "tune_ops_type":["Add", "Mul"]
       "tune_optimization_level":"O1",
       "feature":["deeper_opat,op_format"]
}

tune_ops_name: name of the specified operator, with support for whole word match. You can specify one or more operator names. If multiple operator names are specified, separate them with commas (,).
tune_ops_type: specified operator type, with support for whole word match. You can specify one or more operator types. If multiple operator types are specified, separate them with commas (,). If a fused operator contains the specified operator type, the fused operator will also be tuned.
tune_optimization_level: tuning mode. The value O1 indicates the high-performance tuning mode, and the value O2 indicates the normal mode. For more information, see --tune_optimization_level.
feature: tuning features. Multiple features are separated by commas (,). Currently, the following features are supported:
- deeper_opat: in-depth operator tuning. If the value is deeper_opat, in-depth operator tuning is enabled. In this case, aoe_mode must be set to 2. For more information, see --Fdeeper_opat.
- op_format: operator format tuning. aoe_mode must be set to 2. For details, see --Fop_format.

NOTE:

The content of the preceding configuration file must be placed in braces ({}), and the tune_ops_type, tune_ops_name, and feature configurations must be placed in square brackets ([]).
Either or both of tune_ops_type and tune_ops_name can exist at the same time. If both of them exist, use the union set.

[For the training script for manual porting] Modify the script as follows:

If the initialize_system API is used in the ported training script, enable AOE tuning as follows:

npu_init = npu_ops.initialize_system()
npu_shutdown = npu_ops.shutdown_system()
config = tf.ConfigProto()
...
custom_op.parameter_map["aoe_mode"].s = tf.compat.as_bytes("2")
...
with tf.Session(config=config) as sess:

    sess.run(npu_init)
    # Call the HCCL API...
    # Perform training...
    sess.run(npu_shutdown)

If the initialize_system API is not used in the ported training script, enable AOE tuning as follows.

For the training script in Estimator mode, configure the aoe_mode and work_path parameters in NPURunConfig to enable AOE tuning.

import tensorflow as tf
from npu_bridge.npu_init import *

session_config=tf.ConfigProto()
config = NPURunConfig(session_config=session_config, aoe_mode="2")

For the training script in sess.run mode, configure aoe_mode and work_path in session to enable AOE tuning.

import tensorflow as tf
from npu_bridge.npu_init import *

config = tf.ConfigProto()
custom_op =  config.graph_options.rewrite_options.custom_optimizers.add()
custom_op.name =  "NpuOptimizer"
custom_op.parameter_map["aoe_mode"].s = tf.compat.as_bytes("2")
config.graph_options.rewrite_options.remapping = RewriterConfig.OFF

[For the training script for tool-based porting] Modify the script as follows:

If the init_resource API is used in the ported training script, enable AOE tuning as follows:

if __name__ == '__main__':

  session_config = tf.ConfigProto()
  custom_op = session_config.graph_options.rewrite_options.custom_optimizers.add()
  custom_op.name = "NpuOptimizer"
  custom_op.parameter_map["aoe_mode"].s = tf.compat.as_bytes("2")

  (npu_sess, npu_shutdown) = init_resource(config=session_config)
  tf.app.run()
  shutdown_resource(npu_sess, npu_shutdown)
  close_session(npu_sess)

If the init_resource API is not used in the ported training script, enable AOE tuning as follows.

For the training script in Estimator mode, configure tuning parameters in npu_run_config_init in the ported script.

session_config = tf.ConfigProto(allow_soft_placement=True)
custom_op = session_config.graph_options.rewrite_options.custom_optimizers.add()
custom_op.name = 'NpuOptimizer'
custom_op.parameter_map["aoe_mode"].s = tf.compat.as_bytes("2")
run_config = tf.estimator.RunConfig(
    train_distribute=distribution_strategy,
    session_config=session_config,
    save_checkpoints_secs=60*60*24)

classifier = tf.estimator.Estimator(
    model_fn=model_function, model_dir=flags_obj.model_dir, config=npu_run_config_init(run_config=run_config))

Manually pass the run configuration function in the script (for example, session_config) to RunConfig if it is not passed.

session_config = tf.ConfigProto()
custom_op = session_config.graph_options.rewrite_options.custom_optimizers.add()
custom_op.name = 'NpuOptimizer'
custom_op.parameter_map["aoe_mode"].s = tf.compat.as_bytes("2")
run_config = tf.estimator.RunConfig(
    train_distribute=distribution_strategy,
    session_config=session_config,
    save_checkpoints_secs=60*60*24)

classifier = tf.estimator.Estimator(
    model_fn=model_function, model_dir=flags_obj.model_dir, config=npu_run_config_init(run_config=run_config))

For the training script in sess.run mode, configure tuning parameters in npu_config_proto.

Find npu_config_proto in your script.

with tf.Session(config=npu_config_proto()) as sess:
    sess.run(tf.global_variables_initializer())
    interaction_table.init.run()

Configure tuning parameters.

config_proto = tf.ConfigProto()
custom_op = config_proto.graph_options.rewrite_options.custom_optimizers.add()
custom_op.name = 'NpuOptimizer'
custom_op.parameter_map["aoe_mode"].s = tf.compat.as_bytes("2")
config = npu_config_proto(config_proto=config_proto)
with tf.Session(config=config) as sess:
    sess.run(tf.global_variables_initializer())
    interaction_table.init.run()

Start training.

Parent topic: Appendixes