Session Configuration Options
Basic Options
|
Option |
Description |
Application Scenarios |
|---|---|---|
|
graph_run_mode |
Graph run mode.
Configuration example: custom_op.parameter_map["graph_run_mode"].i = 1 |
Training/Online inference |
|
session_device_id |
Logical ID of a device. Setting this parameter allows you to run different models on multiple devices by executing a single training script. You can create different sessions for different graphs and pass different session_device_id values. Example: config_0 = tf.ConfigProto()
custom_op = config_0.graph_options.rewrite_options.custom_optimizers.add()
custom_op.name = "NpuOptimizer"
custom_op.parameter_map["session_device_id"].i = 0
config_0.graph_options.rewrite_options.remapping = RewriterConfig.OFF
config_0.graph_options.rewrite_options.memory_optimization = RewriterConfig.OFF
with tf.Session(config=config_0) as sess_0:
sess_0.run(...)
config_1 = tf.ConfigProto()
custom_op = config_1.graph_options.rewrite_options.custom_optimizers.add()
custom_op.name = "NpuOptimizer"
custom_op.parameter_map["session_device_id"].i = 1
config_1.graph_options.rewrite_options.remapping = RewriterConfig.OFF
config_1.graph_options.rewrite_options.memory_optimization = RewriterConfig.OFF
with tf.Session(config=config_1) as sess_1:
sess_1.run(...)
config_7 = tf.ConfigProto()
custom_op = config_7.graph_options.rewrite_options.custom_optimizers.add()
custom_op.name = "NpuOptimizer"
custom_op.parameter_map["session_device_id"].i = 7
config_7.graph_options.rewrite_options.remapping = RewriterConfig.OFF
config_7.graph_options.rewrite_options.memory_optimization = RewriterConfig.OFF
with tf.Session(config=config_7) as sess_7:
sess_7.run(...) |
Training/Online inference |
|
deterministic |
Whether to enable deterministic computing. If enabled, the same output is generated if an operator is executed for multiple times with the same hardware and input. The values are as follows:
By default, deterministic computing does not need to be enabled, because it slows down operator execution and affects performance. If it is disabled, the results of multiple executions may be different. This is generally caused by asynchronous multi-thread executions during operator implementation, which changes the accumulation sequence of floating-point numbers. However, if a model produces inconsistent execution results across multiple runs or requires accuracy optimization, you can enable deterministic computing to assist with model debugging and tuning. Note that if you want a completely definite result, you need to set a definite random seed in the training script to ensure that the random numbers generated in the program are also definite. Example: custom_op.parameter_map["deterministic"].i = 1 |
Training/Online inference |
Memory Management
Dynamic Shape
In the scenario of dynamic dimension size profiles, input_shape, dynamic_dims, and dynamic_node_type must be used together.
|
Option |
Description |
Application Scenarios |
|---|---|---|
|
input_shape |
Input shape. Configuration example: custom_op.parameter_map["input_shape"].s = tf.compat.as_bytes("data:1,1,40,-1;label:1,-1;mask:-1,-1")
In the preceding example, the network model has three inputs: data (1, 1, 40, –1), label (1, –1), and mask (–1, –1). Separate the name and shapes of each input with colons (:). –1 indicates a dynamic dimension, whose size profiles are configured by using dynamic_dims. Notes:
|
Online inference |
|
dynamic_dims |
Input dimension size choices. Separate the dimension sizes by a semicolon (;). The dimension values match to the –1 placeholders in the input_shape argument with ordering preserved, and the number of –1 placeholders equals the number of dimension sizes of each profile. Set at least two dynamic dimension size profiles. The argument of dynamic_dims must match that of input_shape, as failure to do so may lead to an error and system's exit. Example: custom_op.parameter_map["dynamic_dims"].s = tf.compat.as_bytes("20,20,1,1;40,40,2,2;80,60,4,4")
Based on the input_shape information in the preceding example, the supported input shape profiles are as follows:
Notes:
For the following products, the profile range is (1,100]. That is, at least two profiles must be set, and a maximum of 100 profiles are supported.
|
Online inference |
|
dynamic_node_type |
Type of the dynamic input node.
Only one type of dynamic inputs is allowed, dataset or placeholder.
Example:
custom_op.parameter_map["dynamic_node_type"].i = 0 |
Online inference |
|
compile_hybrid_mode |
Whether to enable the hybrid compilation and execution for dynamic dimension size profiles and dynamic shapes.
Notes:
Configuration example: custom_op.parameter_map["compile_hybrid_mode"].i = 1 |
Online inference |
|
ac_parallel_enable |
Whether to allow AI CPU operators and AI Core operators to run in parallel in a dynamic shape graph.
In a dynamic shape graph, when this option is enabled, the system automatically identifies AI CPU operators that can be concurrently executed with the AI Core operators in the graph. Operators of different engines are distributed to different flows to implement parallel execution among multiple engines, improving resource utilization and dynamic shape execution performance.
Configuration example: custom_op.parameter_map["ac_parallel_enable"].s = tf.compat.as_bytes("1") |
Training/Online inference |
|
compile_dynamic_mode |
Whether to generalize all input shapes in the graph.
Configuration example: custom_op.parameter_map["compile_dynamic_mode"].b = True Note: This option cannot be used together with parameters for dynamic dimension size profiles (input_shape, dynamic_dims, and dynamic_node_type). |
Training/Online inference |
|
all_tensor_not_empty |
Whether to remove control nodes for empty tensor checks in the execution graph. In dynamic shape graph scenarios, control nodes are typically inserted to check whether a node is empty to prevent empty tensor nodes from being sent to the device. If you are certain that the graph does not contain empty tensors, you can enable this option to remove these control nodes and improve graph execution performance.
Configuration example: custom_op.parameter_map["all_tensor_not_empty"].b = True |
Training/Online inference |
Mixed Computing
|
Option |
Description |
Application Scenarios |
|---|---|---|
|
mix_compile_mode |
Mixed computing
In full offload mode, all compute operators are offloaded to the device. As a supplement to the full offload mode, mixed computing allows certain operators to be executed online within the frontend framework, improving the Ascend AI Processor's adaptability to TensorFlow. Example: custom_op.parameter_map["mix_compile_mode"].b = True |
Training/Online inference |
|
in_out_pair_flag |
Whether to offload operators specified by in_out_pair to the Ascend AI Processor in mixed computing scenarios.
Example: custom_op.parameter_map['in_out_pair_flag'].b = False |
Online inference |
|
in_out_pair |
Names of the input-layer and output-layer operators offloaded (or not) in mixed computing scenarios. Note that this option supports only one operator configured within the range of [in_nodes, out_nodes]. Example: # Enable mixed computing. custom_op.parameter_map["mix_compile_mode"].b = True # Offload operators within the [in_nodes, out_nodes] range to the Ascend AI Processor for execution, and execute other operators in the frontend framework. in_nodes.append('import/conv2d_1/convolution') out_nodes.append('import/conv2d_59/BiasAdd') out_nodes.append('import/conv2d_67/BiasAdd') out_nodes.append('import/conv2d_75/BiasAdd') all_graph_iop.append([in_nodes, out_nodes]) custom_op.parameter_map['in_out_pair'].s = tf.compat.as_bytes(str(all_graph_iop)) # Alternatively, retain operators within the [in_nodes, out_nodes] range for execution in the frontend framework, and offload other operators to the Ascend AI Processor for execution. in_nodes.append('import/conv2d_1/convolution') out_nodes.append('import/conv2d_59/BiasAdd') out_nodes.append('import/conv2d_67/BiasAdd') out_nodes.append('import/conv2d_75/BiasAdd') all_graph_iop.append([in_nodes, out_nodes]) custom_op.parameter_map['in_out_pair_flag'].b = False custom_op.parameter_map['in_out_pair'].s = tf.compat.as_bytes(str(all_graph_iop)) |
Online inference |
Debugging
Accuracy Tuning
|
Option |
Description |
Application Scenarios |
|---|---|---|
|
precision_mode_v2 |
Operator precision mode, which must be of the string type.
In training scenarios:
In online inference scenarios, the default value is fp16. Example: custom_op.parameter_map["precision_mode_v2"].s = tf.compat.as_bytes("origin")
NOTE:
|
Training/Online inference |
|
precision_mode |
Operator precision mode, which must be of the string type.
In training scenarios:
In online inference scenarios, the default value is force_fp16. Example: custom_op.parameter_map["precision_mode"].s = tf.compat.as_bytes("allow_mix_precision")
NOTE:
|
Training/Online inference |
|
modify_mixlist |
When mixed precision is enabled, you can use this parameter to specify the path and file name of the blocklist, trustlist, and graylist, and specify the operators that allow precision reduction and those that do not allow precision reduction. You can enable the mixed precision by configuring precision_mode_v2 (recommended) or precision_mode in the script.
The blocklist, trustlist, and graylist are stored in a single JSON file. Configuration example:
custom_op.parameter_map["modify_mixlist"].s = tf.compat.as_bytes("/home/test/ops_info.json")
You can specify the operator types in ops_info.json as follows. Separate operators with commas (,). {
"black-list": { // Blocklist
"to-remove": [ // Move an operator from the blocklist to the graylist.
"Xlog1py"
],
"to-add": [ // Move an operator from the trustlist or graylist to the blocklist.
"MatMul",
"Cast"
]
},
"white-list": { // Trustlist
"to-remove": [ // Move an operator from the trustlist to the graylist.
"Conv2D"
],
"to-add": [ // Move an operator from the blocklist or graylist to the trustlist.
"Bias"
]
}
}
Note: The operators in the preceding example configuration file are for reference only. The configuration should be based on the actual hardware environment and the built-in tuning policies of the operators. You can query the built-in tuning policy of each operator in mixed precision mode in CANN software installation directory/opp/built-in/op_impl/ai_core/tbe/config/<soc_version>/aic-<soc_version>-ops-info-<opType>.json. Example: "Conv2D":{
"precision_reduce":{
"flag":"true"
},
...
}
|
Training/Online inference |
|
customize_dtypes |
If precision_mode_v2 or precision_mode is used to set the global precision mode of a network, precision problems may occur on particular operators. In this case, you can use customize_dtypes to configure the precision mode of these operators, and still compile other operators using the precision mode specified by precision_mode_v2 or precision_mode. Note if precision_mode_v2 is set to origin or precision_mode is set to must_keep_origin_dtype, customize_dtypes does not take effect. Set it to the path (including the name of the configuration file), for example, /home/test/customize_dtypes.cfg. Configuration example: custom_op.parameter_map["customize_dtypes"].s = tf.compat.as_bytes("/home/test/customize_dtypes.cfg")
List the names or types of operators whose precision needs customization in the configuration file. Each operator occupies a line, and the operator type must be defined based on Ascend IR. If both operator name and type are configured for an operator, the operator name applies during building. The structure of the configuration file is as follows: # By operator name Opname1::InputDtype:dtype1,dtype2,...OutputDtype:dtype1,... Opname2::InputDtype:dtype1,dtype2,...OutputDtype:dtype1,... # By operator type OpType::TypeName1:InputDtype:dtype1,dtype2,...OutputDtype:dtype1,... OpType::TypeName2:InputDtype:dtype1,dtype2,...OutputDtype:dtype1,... Example: # By operator name resnet_v1_50/block1/unit_3/bottleneck_v1/Relu::InputDtype:float16,int8,OutputDtype:float16,int8 # By operator type OpType::Relu:InputDtype:float16,int8,OutputDtype:float16,int8
NOTE:
|
Online inference/Training |
Accuracy comparison
Performance Tuning
- Basic configuration
Option
Description
Application Scenarios
iterations_per_loop
Number of iterations per loop set by using set_iteration_per_loop in sess.run mode, that is, the number of iterations per training loop every sess.run() call on the device side.
The value must be the same as that of iterations_per_loop set by set_iteration_per_loop for function verification.
Example:
custom_op.parameter_map["iterations_per_loop"].i = 10
Training
- Advanced setting
Option
Description
Application Scenarios
hcom_parallel
Enables AllReduce gradient update and forward and backward propagation in parallel during distributed training.
- True (default): enabled.
- False: disabled.
For a small network (for example, ResNet-18), you are advised to set this option to False.
Example:
custom_op.parameter_map["hcom_parallel"].b = True
Training
enable_small_channel
Small channel optimization enable. If it is enabled, performance benefits are yielded at the convolutional layers with channel size <= 4.- 0: disabled. This function is disabled by default in the training scenario (graph_run_mode is 1). You are advised not to enable this function in the training scenario.
- 1 (default): enabled. This option cannot be modified in online inference scenarios (graph_run_mode is 0).
NOTE:After this function is included, performance benefits can be obtained on the ResNet50, ResNet101, and ResNet152 networks. For other network models, the performance may deteriorate.
Example:
custom_op.parameter_map["enable_small_channel"].i = 1
Online inference/Training
op_precision_mode
High-precision or high-performance mode of an operator. You can pass a custom mode configuration file op_precision.ini to set different modes for operators.
You can set this option by operator type (low priority) or node name (high priority). Example:[ByOpType] optype1=high_precision optype2=high_performance optype3=enable_hi_float_32_execution optype4=support_out_of_bound_index [ByNodeName] nodename1=high_precision nodename2=high_performance nodename3=enable_hi_float_32_execution nodename4=support_out_of_bound_index
- high_precision: high precision.
- high_performance: high performance.
- enable_float_32_execution: The FP32 data type is used for internal processing of operators. In this scenario, the FP32 data type is not automatically converted to the HF32 data type. If you are using the HF32 data type for computation and find that the accuracy drop exceeds your expectation, enable this option to specify the use of FP32 for internal computation of certain operators in order to maintain accuracy.
This option is supported only by the following products:
Atlas A3 training products /Atlas A3 inference products Atlas A2 training products /Atlas A2 inference products - enable_hi_float_32_execution: The HF32 data type is used for internal processing of operators. After this option is enabled, the FP32 data type is automatically converted to the HF32 data type. This configuration can reduce the space occupied by data and improve performance. This option is not supported in the current version.
- support_out_of_bound_index: The out-of-bounds verification is performed on the indices of the gather, scatter, and segment operators. The verification deteriorates the operator execution performance.
- keep_fp16: The FP16 data type is used for internal operator processing. In this mode, FP16 is not automatically converted to FP32. If FP32 computation fails to meet performance expectations and high accuracy is not required, you can enable the keep_fp16 mode. This low-precision mode trades accuracy for performance and is not recommended.
- super_performance: ultra-high performance. Compared with high performance, the algorithm calculation formula is optimized.
You can view the supported precision and performance mode values for a specific operator in the opp/built-in/op_impl/ai_core/tbe/impl_mode/all_ops_impl_mode.ini file under the CANN software installation directory.
This parameter is mutually exclusive with op_select_implmode and optypelist_for_implmode. If they are all specified, op_precision_mode takes precedence.
Generally, you do not need to set this parameter. It is used if you need to adjust the precision of a specific operator using the configuration .ini file in the case that you fail to obtain optimal network performance or accuracy in the high-performance or high-precision mode.
Example:
custom_op.parameter_map["op_precision_mode"].s = tf.compat.as_bytes("/home/test/op_precision.ini")Training/Online inference
enable_scope_fusion_passes
Scope fusion pattern (or scope fusion patterns separated by commas) to take effect at build time. Name of the registered fusion pattern. You can pass multiple names. Separate the names by commas (,).
Scope fusion patterns (either built-in or custom) are classified into the following two types:
- General: common scope fusion patterns applicable to all networks. They are enabled by default and cannot be manually invalidated.
- Non-general scope fusion patterns: applicable to specific networks. By default, they are disabled. You can use enable_scope_fusion_passes to enable selected fusion patterns.
Example:
custom_op.parameter_map["enable_scope_fusion_passes"].s = tf.compat.as_bytes("ScopeLayerNormPass,ScopeClipBoxesPass")Training/Online inference
stream_max_parallel_num
This parameter applies only to neural machine translation (NMT) networks.
It specifies the parallelism degree of the AI CPU/AI Core engine to implement parallel execution between AI CPU/AI Core operators.
DNN_VM_AICPU is the name of the AI CPU engine. In this example, the number of concurrent tasks on the AI CPU engine is 10.
AIcoreEngine is the name of the AI Core engine. In this example, the number of concurrent tasks on the AI Core engine is 1.
Defaults to 1. The value cannot exceed the maximum number of AI Cores.
Example:
custom_op.parameter_map["stream_max_parallel_num"].s = tf.compat.as_bytes("DNN_VM_AICPU:10,AIcoreEngine:1")Training/Online inference
is_tailing_optimization
This option applies only to Bidirectional Encoder Representations from Transformers (BERT) networks.
Communication tailing optimization enable in distributed training scenarios to improve performance. By changing a computation dependency relationship, a computation operation that does not depend on the last AR (gradient aggregation fragment) is scheduled to be performed in parallel with the last AR, to optimize communication tailing. Value:
- True: enabled.
- False (default): disabled.
This option must work with NPUOptimizer and the value must be the same as that of is_tailing_optimization in NPUOptimizer.
Example:
custom_op.parameter_map["is_tailing_optimization"].b = True
Training
variable_placement
If the network weight is large, network execution may fail due to insufficient device memory. In this case, you can deploy the variable to the host to reduce the memory usage of the device.- Device: The variable is deployed on the device.
- Host: The variable is deployed on the host.
Default value: Device
Constraints:- If this configuration option is set to Host, mixed computing must be enabled (mix_compile_mode = True).
- If the training script contains APIs of TensorFlow V1 control flow operators, such as tf.case, tf.cond, and tf.while_loop, setting variable_placement to Host may cause the network execution to fail. To avoid this problem, add the following APIs to the training script to convert the control flow operators of TensorFlow V1 to V2 and enable resource variables:
tf.enable_control_flow_v2() tf.enable_resource_variables()
Example:
custom_op.parameter_map["variable_placement"].s = tf.compat.as_bytes("Device")Training/Online inference
frozen_variable
To save the weight as a checkpoint, you can use this parameter to convert the variable to constant to reduce data copies between the host and device and improve inference performance.- True: conversion enabled.
- False: conversion disabled.
Default value: False
Example:
custom_op.parameter_map["frozen_variable"].b = True
Online inference
graph_max_parallel_model_num
In online inference scenarios, you can set this option to specify the maximum number of threads for parallel graph execution. If the value of this option is greater than 1, the corresponding number of threads are started for parallel graph execution, improving the overall graph pipeline efficiency.
The value must be an integer in the range of [1, INT32_MAX]. The default value is 1. INT32_MAX is the maximum value of the INT32 type, which is 2147483647.
Example:
custom_op.parameter_map["graph_max_parallel_model_num"].i = 4
Online inference
Profiling
AOE
The AOE tuning feature supports only the following products:
Atlas A3 training products /Atlas A3 inference products Atlas A2 training products /Atlas A2 inference products Atlas training products
Operator and Graph Build
|
Option |
Description |
Application Scenarios |
|---|---|---|
|
op_compiler_cache_mode |
Disk cache mode for operator building. enable is the default value.
Notes:
Example: custom_op.parameter_map["op_compiler_cache_mode"].s = tf.compat.as_bytes("enable") |
Training/Online inference |
|
op_compiler_cache_dir |
Disk cache directory for operator compilation. The value can contain letters, digits, underscores (_), hyphens (-), and periods (.). If the specified directory exists and is valid, the kernel_cache subdirectory is automatically created. If the specified directory does not exist but is valid, the system automatically creates a directory and the kernel_cache subdirectory. The storage priority of operator compilation cache files is as follows: op_compiler_cache_dir > ${ASCEND_CACHE_PATH}/kernel_cache > Default path ($HOME/atc_data) For details about ASCEND_CACHE_PATH, see "Installation" in Environment Variables. Example: custom_op.parameter_map["op_compiler_cache_dir"].s = tf.compat.as_bytes("/home/test/kernel_cache") |
Training/Online inference |
|
aicore_num |
Maximum number of Cube cores and Vector cores used for operator compilation.
Format: Integer 1|Integer 2, where the two values are separated by vertical bars (|). Integer 1 specifies the maximum number of Cube cores to use, and Integer 2 specifies the maximum number of Vector cores to use. Both values must be greater than 0 and less than or equal to the actual number of Cube cores and Vector cores available on the Ascend AI Processor.
NOTE:
Example: custom_op.parameter_map["aicore_num"].s = tf.compat.as_bytes("2|4") |
Training/Online inference |
|
oo_constant_folding |
Enables or disables constant folding.
Constant folding evaluates and replaces constant expressions during graph compilation to reduce memory usage. In most cases, you are advised to retain the default value to enable constant folding. However, some networks require more memory during compilation and running, and the constant memory is occupied throughout the entire lifecycle of a graph. If enabling constant folding increases the overall memory consumption, you can disable it using this parameter.
Example: custom_op.parameter_map["oo_constant_folding"].b = True
NOTE:
If constant folding is disabled and an error occurs during network compilation and running, an error message similar to the following will be displayed:
Solution: Enable constant folding by setting oo_constant_folding to True, and then use the _grappler_do_not_remove attribute via TensorFlow's Grappler to selectively disable constant folding for specific operators. |
Training/Online inference |
Data Augmentation
|
Option |
Description |
Application Scenarios |
|---|---|---|
|
local_rank_id |
Rank ID of the current process, used in data parallel processing. The main process deduplicates the data and distributes the deduplicated data to the devices of other processes for forward and backward propagation.
In this mode, multiple devices on a host share one process for data preprocessing. Although this is still a multi-process scenario, data preprocessing is performed in the main process, and other processes no longer accept datasets on the current process, but only receive preprocessed data from the main process. To identify the main process, call the collective communication API get_local_rank_id() to get the rank ID of the current process on its server. Example: custom_op.parameter_map["local_rank_id"].i = 0 |
Training/Online inference |
|
local_device_list |
Devices that the main process sends data to, used in conjunction with local_rank_id. custom_op.parameter_map["local_device_list"].s = tf.compat.as_bytes("0,1") |
Training/Online inference |
Exception Remedy
|
Option |
Description |
Application Scenarios |
|---|---|---|
|
hccl_timeout |
Synchronization timeout for inter-device task execution, in seconds. You can set the timeout interval if the default value does not meet your requirement (for example, when a communication failure occurs).
NOTE:
Example: custom_op.parameter_map["hccl_timeout"].i = 1800 |
Training/Online inference |
|
op_wait_timeout |
Operator wait timeout interval (s). Defaults to 120. You can set the timeout interval if the default value does not meet your requirement. Configuration example: custom_op.parameter_map["op_wait_timeout"].i = 120 |
Training/Online inference |
|
op_execute_timeout |
Operator execution timeout interval (s). Example: custom_op.parameter_map["op_execute_timeout"].i = 90 |
Training/Online inference |
|
stream_sync_timeout |
Timeout interval for stream synchronization during graph execution. If the timeout interval exceeds the configured value, a synchronization failure is reported. The unit is ms. The default value is -1, indicating that there is no waiting time and no error is reported when the synchronization fails. Note: In cluster scenarios, the value of this option (timeout interval for stream synchronization) must be greater than the collective communication timeout interval, that is, the value of hccl_timeout or the environment variable HCCL_EXEC_TIMEOUT. Example: custom_op.parameter_map["stream_sync_timeout"].i = 60000 |
Training/Online inference |
|
event_sync_timeout |
Timeout interval for event synchronization during graph execution. If the timeout interval exceeds the configured value, a synchronization failure is reported. The unit is ms. The default value is -1, indicating that there is no waiting time and no error is reported when the synchronization fails. Configuration example: custom_op.parameter_map["event_sync_timeout"].i = 60000 |
Training/Online inference |
Experiment Parameters
The experiment parameters are extended parameters for debugging and may be changed in later versions. Therefore, they cannot be used in commercial products.
|
Option |
Description |
Application Scenarios |
|---|---|---|
|
graph_compiler_cache_dir |
Drive cache directory for graph compilation. If this parameter is not empty, the drive cache function for graph compilation takes effect. The graph compilation cache function supports drive persistence of graph compilation results. When graph compilation is performed again, the compilation results cached on the drive can be directly loaded to reduce the graph compilation duration.
Note:
Example: custom_op.parameter_map["graph_compiler_cache_dir"].s = tf.compat.as_bytes("/root/build_cache_dir") |
Training/Online inference |
|
jit_compile |
Determines whether to compile the operator online or use the compiled operator binary file.
NOTICE:
This option is used only for networks of large recommendation models. Example: custom_op.parameter_map["jit_compile"].s = tf.compat.as_bytes( "auto") |
Training/Online inference |
|
shape_generalization_mode |
When jit_compile is set to true (online operator compilation), use this parameter to configure the shape generalization mode.
NOTICE:
When compile_dynamic_mode is set to True, the first iteration generalizes all input shapes to -1, and the shape_generalization_mode setting does not take effect. Example: custom_op.parameter_map["shape_generalization_mode"].s = tf.compat.as_bytes( "FULL") |
Training/Online inference |
|
experimental_accelerate_train_mode |
If training takes more than one hour, you can trigger training acceleration to improve training performance by configuring this option. Based on the configured acceleration type, acceleration trigger mode, and the proportion of low-precision training processes, the software compiles and runs the corresponding proportion of training processes with reduced precision, while the remaining processes are compiled and run at their original precision.
The value of this option is a string with three fields separated by vertical bars (|), for example, fast|step|0.9.
Example:
Notes:
|
Training |
|
auto_multistream_parallel_mode |
This option applies only to graphs with a static shape. You can enable parallel execution of Cube and Vector operators to improve graph execution performance.
NOTICE:
Example:
custom_op.parameter_map["auto_multistream_parallel_mode"].s =
tf.compat.as_bytes("cv")
|
Training |
Parameters That Will Be Deprecated in Later Versions
The following options will be deprecated in later versions. You are advised not to use them anymore.
|
Option |
Description |
Application Scenarios |
|---|---|---|
|
op_debug_level |
Function debugging. Whether to enable operator debugging. The values are as follows:
This parameter is left empty by default, indicating that the configuration is disabled. Example: custom_op.parameter_map["op_debug_level"].i = 0 |
Training/Online inference |
|
enable_data_pre_proc |
Performance tuning. Whether to enable GetNext operator offload to the Ascend AI Processor. GetNext operator offload is a prerequisite for iteration offload.
Example:
custom_op.parameter_map["enable_data_pre_proc"].b = True |
Training |
|
variable_format_optimize |
Performance tuning. Variable format optimization enable.
If it is enabled, the variables are reformatted during network variable initialization to better target to Ascend AI Processor (for example, from NCHW to NC1HWC0) for improved training efficiency. Enable or disable this function as needed. This parameter is left empty by default, indicating that the configuration is disabled. Example: custom_op.parameter_map["variable_format_optimize"].b = True |
Training |
|
op_select_implmode |
Performance tuning. Operator implementation mode select. Certain operators built in the Ascend AI Processor can be implemented in either high-precision or high-performance mode at model build time. Arguments:
This parameter is left empty by default, indicating that the configuration is disabled. Example: custom_op.parameter_map["op_select_implmode"].s = tf.compat.as_bytes("high_precision") |
Training/Online inference |
|
optypelist_for_implmode |
Performance tuning. List of operator types (separated by commas) that use the mode specified by the op_select_implmode parameter. Currently, Pooling, SoftmaxV2, LRN, and ROIAlign operators are supported. Use this option in conjunction with op_select_implmode, for example: Set op_select_implmode to high_precision. Set optypelist_for_implmode to Pooling. This parameter is left empty by default, indicating that the configuration is disabled. Example: custom_op.parameter_map["optypelist_for_implmode"].s = tf.compat.as_bytes("Pooling,SoftmaxV2") |
Training/Online inference |
|
dynamic_input |
Whether it is a dynamic input.
Example: custom_op.parameter_map["dynamic_input"].b = True |
Training/Online inference |
|
dynamic_graph_execute_mode |
Execution mode of a dynamic input. That is, this option takes effect when dynamic_input is set to True. Possible values are: dynamic_execute: dynamic graph compilation. In this mode, the shape range configured in dynamic_inputs_shape_range is used for compilation. Example: custom_op.parameter_map["dynamic_graph_execute_mode"].s = tf.compat.as_bytes("dynamic_execute") |
Training/Online inference |
|
dynamic_inputs_shape_range |
Shape range of each dynamic input. If a graph has two dataset inputs and one placeholder input, a configuration example is as follows: custom_op.parameter_map["dynamic_inputs_shape_range"].s = tf.compat.as_bytes("getnext:[128 ,3~5, 2~128, -1],[64 ,3~5, 2~128, -1];data:[128 ,3~5, 2~128, -1]")
Precautions:
|
Training/Online inference |
|
graph_memory_max_size |
Sizes of the network static memory and the maximum dynamic memory (used in earlier versions). In the current version, this parameter does not take effect. The system dynamically allocates memory resources based on the actual memory usage of the network. |
Training/Online inference |
|
variable_memory_max_size |
Size of the variable memory (used in earlier versions). In the current version, this parameter does not take effect. The system dynamically allocates memory resources based on the actual memory usage of the network. |
Training/Online inference |
