NPURunConfig Options
Basic Options
|
Option |
Description |
|---|---|
|
graph_run_mode |
Graph run mode. Values are as follows:
Configuration example: config = NPURunConfig(graph_run_mode=1) |
|
session_device_id |
Logical ID of a device. Setting this option allows you to run different models on multiple devices by executing a single training script. Generally, you can create sessions for multiple graphs and pass the corresponding argument of session_device_id to the session. This option takes precedence over the environment variable ASCEND_DEVICE_ID. Example: config0 = NPURunConfig(..., session_device_id=0, ...) estimator0 = NPUEstimator(..., config=config0, ...) ... config1 = NPURunConfig(..., session_device_id=1, ...) estimator1 = NPUEstimator(..., config=config1, ...) ... config7 = NPURunConfig(..., session_device_id=7, ...) estimator7 = NPUEstimator(..., config=config7, ...) ... |
|
distribute |
ParameterServerStrategy object for distributed training in the PS-Worker architecture. Example: config = NPURunConfig(distribute=strategy) |
|
deterministic |
Whether to enable deterministic computing. If enabled, the same output is generated if an operator is executed for multiple times with the same hardware and input. The values are as follows:
By default, deterministic computing does not need to be enabled, because it slows down operator execution and affects performance. If it is disabled, the results of multiple executions may be different. This is generally caused by asynchronous multi-thread executions during operator implementation, which changes the accumulation sequence of floating point numbers. However, if the execution results of a model are different for multiple times or the precision needs to be tuned, you can enable deterministic computing to assist model debugging and tuning. Note that if you want a completely definite result, you need to set a definite random seed in the training script to ensure that the random numbers generated in the program are also definite. Example: config = NPURunConfig(deterministic=1) |
Memory Management
|
Option |
Description |
|---|---|
|
memory_config |
System memory usage mode. Before creating NPURunConfig, you can instantiate a MemoryConfig class to configure functions. For details about the constructor of the MemoryConfig class, see MemoryConfig Constructor. |
|
external_weight |
When multiple models are loaded in a session, if the weights of these models can be reused, you are advised to use this configuration item to externalize the weights of the Const/Constant nodes on the network to implement weight reuse among multiple models and reduce the memory usage of the weights.
Note: This option is usually not required. If the model loading environment has limitations on memory, you can flush the weight externally.
Example:
config = NPURunConfig(external_weight=True) |
|
input_fusion_size |
Threshold for fusing and copying multiple discrete pieces of user input data during data transfer from the host to the device. The unit is byte. The minimum value is 0 byte, the maximum value is 33554432 bytes (32 MB), and the default value is 131072 bytes (128 KB). If:
Assume there are 10 user inputs, including two 100 KB inputs, two 50 KB inputs, and the other inputs greater than 100 KB:
Example: config = NPURunConfig(input_fusion_size=25600) |
Dynamic Shape
|
Option |
Description |
|---|---|
|
ac_parallel_enable |
Indicates whether to allow AI CPU operators and AI Core operators to run in parallel in a dynamic shape graph.
In a dynamic shape graph, when this option is enabled, the system automatically identifies AI CPU operators that can be concurrently executed with the AI Core operators in the graph. Operators of different engines are distributed to different flows to implement parallel execution among multiple engines, improving resource utilization and dynamic shape execution performance.
Example: config = NPURunConfig(ac_parallel_enable="1") |
|
compile_dynamic_mode |
Indicates whether to generalize all input shapes in the graph.
Example: config = NPURunConfig(compile_dynamic_mode=True) |
Mixed Computing
|
Option |
Description |
|---|---|
|
mix_compile_mode |
Whether to enable mixed computing.
In full offload mode, all compute operators are offloaded to the device. As a supplement to the full offload mode, mixed computing allows certain operators to be executed online within the frontend framework, improving the Ascend AI Processor's adaptability to TensorFlow. Example: config = NPURunConfig(mix_compile_mode=True) |
Debugging
Accuracy Tuning
|
Option |
Description |
|---|---|
|
precision_mode |
A string for the operator precision mode.
For the Example: config = NPURunConfig(precision_mode="allow_mix_precision")
NOTE:
|
|
precision_mode_v2 |
A string for the operator precision mode.
Default value:
Example: config = NPURunConfig(precision_mode_v2="origin")
NOTE:
|
|
modify_mixlist |
When mixed precision is enabled, you can use this option to specify the path and file name of the blocklist, trustlist, and graylist, and specify the operators that allow precision reduction and those that do not allow precision reduction. You can enable the mixed precision by configuring precision_mode_v2 or precision_mode in the script.
The blocklist, trustlist, and graylist storage files are in JSON format. A configuration example is as follows:
config = NPURunConfig(modify_mixlist="/home/test/ops_info.json") You can specify the operator types in ops_info.json as shown below. Separate operators with commas (,). {
"black-list": { // Blocklist
"to-remove": [ // Move an operator from the blocklist to the graylist.
"Xlog1py"
],
"to-add": [ // Move an operator from the trustlist or graylist to the blocklist.
"Matmul",
"Cast"
]
},
"white-list": { // Trustlist
"to-remove": [ // Move an operator from the trustlist to the graylist.
"Conv2D"
],
"to-add": [ // Move an operator from the blocklist or graylist to the trustlist.
"Bias"
]
}
}
Note: The operators in the preceding example configuration file are for reference only. The configuration should be based on the actual hardware environment and the built-in tuning policies of the operators. You can query the built-in tuning policy of each operator in mixed precision mode in CANN software installation directory/opp/built-in/op_impl/ai_core/tbe/config/<soc_version>/aic-<soc_version>-ops-info.json. For example: "Conv2D":{
"precision_reduce":{
"flag":"true"
},
|
|
enable_reduce_precision |
Not supported in the current version. |
|
customize_dtypes |
If precision_mode is used to set the global precision mode of a network, precision problems may occur on particular operators. In this case, you can use customize_dtypes to configure the precision mode of these operators, and still compile other operators using the precision mode specified by precision_mode. Note if precision_mode is set to must_keep_origin_dtype, customize_dtypes does not take effect. Set it to the path (including the name of the configuration file), for example, /home/test/customize_dtypes.cfg. Configuration example: config = NPURunConfig(customize_dtypes="/home/test/customize_dtypes.cfg") List the names or types of operators whose precision needs customization in the configuration file. Each operator occupies a line, and the operator type must be defined based on Ascend IR. If both operator name and type are configured for an operator, the operator name applies during compilation. The structure of the configuration file is as follows: # By operator name Opname1::InputDtype:dtype1,dtype2,...OutputDtype:dtype1,... Opname2::InputDtype:dtype1,dtype2,...OutputDtype:dtype1,... # By operator type OpType::TypeName1:InputDtype:dtype1,dtype2,...OutputDtype:dtype1,... OpType::TypeName2:InputDtype:dtype1,dtype2,...OutputDtype:dtype1,... Example: # By operator name resnet_v1_50/block1/unit_3/bottleneck_v1/Relu::InputDtype:float16,int8,OutputDtype:float16,int8 # By operator type OpType::Relu:InputDtype:float16,int8,OutputDtype:float16,int8
NOTE:
|
Accuracy Comparison
|
Option |
Description |
|---|---|
|
dump_config |
Dump configuration. Before creating NPURunConfig, you can instantiate a DumpConfig class for dump configuration. For details about the constructor of the DumpConfig class, see DumpConfig Constructor. Example: config = NPURunConfig(dump_config=dump_config) |
|
quant_dumpable |
If the TensorFlow network is quantized by the AMCT tool, this option can be used to control whether to collect the dump data before quantization. The default value is 0.
Example: config = NPURunConfig(quant_dumpable="1")
NOTE:
This option applies only to online inference scenarios. When data dump is enabled, you can set this option to 1 to ensure that the dump data before quantization can be collected. |
|
fusion_switch_file |
Directory of the fusion switch configuration file, including the file name. The value can contain letters, digits, underscores (_), hyphens (-), and periods (.). The built-in graph fusion and UB fusion patterns are enabled by default. You can disable selected fusion patterns in the configuration file.
Example:
config = NPURunConfig(fusion_switch_file="/home/test/fusion_switch.cfg") The following is a template of the fusion_switch.cfg configuration file. on indicates that a fusion pattern is enabled, and off indicates that a fusion pattern is disabled. {
"Switch":{
"GraphFusion":{
"RequantFusionPass":"on",
"ConvToFullyConnectionFusionPass":"off",
"SoftmaxFusionPass":"on",
"NotRequantFusionPass":"on",
"ConvConcatFusionPass":"on",
"MatMulBiasAddFusionPass":"on",
"PoolingFusionPass":"on",
"ZConcatv2dFusionPass":"on",
"ZConcatExt2FusionPass":"on",
"TfMergeSubFusionPass":"on"
},
"UBFusion":{
"TbePool2dQuantFusionPass":"on"
}
}
}
To disable all fusion patterns at a time, refer to this configuration file example. {
"Switch":{
"GraphFusion":{
"ALL":"off"
},
"UBFusion":{
"ALL":"off"
}
}
}
Notes:
|
|
buffer_optimize |
Enables buffer optimization. This is an advanced switch.
Example: config = NPURunConfig(buffer_optimize="l2_optimize") |
Performance Tuning
- Basic configuration
Option
Description
iterations_per_loop
Number of iterations per training loop performed on the Ascend AI Processor per sess.run() call. Defaults to 1. The total number of training iterations per loop must be an integer multiple of the value of iterations_per_loop. Training is performed according to the specified number of iterations per loop (iterations_per_loop) on Ascend AI Processor and then the result is returned to the host. This option can save unnecessary interactions between the host and device and reduce the training time consumption.
In mixed compute mode (with mix_compile_mode set to True), iterations_per_loop must be set to 1.
Note: When iterations_per_loop is set to a value greater than 1, the total number of training iterations set by the user may be different from the actual total number of iterations due to issues such as loop offloading and loss scaling overflow.
Example:
config = NPURunConfig(iterations_per_loop=1000)
- Advanced setting
Option
Description
hcom_parallel
Enables AllReduce gradient update and forward and backward propagation in parallel during distributed training.
- True: enabled.
- False: disabled.
For a small network (for example, ResNet-18), you are advised to set this option to False.
Example:
config = NPURunConfig(hcom_parallel=True)
op_precision_mode
High-precision or high-performance mode of an operator. You can pass a custom mode configuration file op_precision.ini to set different modes for operators.
You can set this option by operator type (low priority) or node name (high priority). Example:[ByOpType] optype1=high_precision optype2=high_performance optype4=support_out_of_bound_index [ByNodeName] nodename1=high_precision nodename2=high_performance nodename4=support_out_of_bound_index
- high_precision
- high_performance
- support_out_of_bound_index: indicates that the out-of-bounds verification is performed on the indices of the gather, scatter, and segment operators. The verification deteriorates the operator execution performance.
- keep_fp16: The FP16 data type is used for internal processing of operators. In this scenario, the FP16 data type is not automatically converted to the FP32 data type. If the performance of FP32 computation does not meet the expectation and high precision is not required, you can select the keep_fp16 mode. This low-precision mode sacrifices the precision for improving the performance, which is not recommended.
- super_performance: indicates ultra-high performance. Compared with high performance, the algorithm calculation formula is optimized.
You can view the precision or performance mode supported by an operator in the opp/built-in/op_impl/ai_core/tbe/impl_mode/all_ops_impl_mode.ini file in the file storage path with the CANN software installed.
This option is mutually exclusive with op_select_implmode and optypelist_for_implmode. If they are all specified, op_precision_mode takes precedence.
Generally, you do not need to set this option. It is used if you need to adjust the precision of a specific operator using the configuration .ini file in the case that you fail to obtain optimal network performance or accuracy in the high-performance or high-precision mode.
Example:
config = NPURunConfig(op_precision_mode="/home/test/op_precision.ini")
enable_scope_fusion_passes
Scope fusion pattern (or scope fusion patterns separated by commas) to take effect during compilation. Name of the registered fusion pattern. You can pass multiple names. Separate the names by commas (,).
Scope fusion patterns (either built-in or custom) are classified into the following two types:
- General: common scope fusion patterns applicable to all networks. They are enabled by default and cannot be manually invalidated.
- Non-general scope fusion patterns: applicable to specific networks. By default, they are disabled. You can use enable_scope_fusion_passes to enable selected fusion patterns.
Example:
config = NPURunConfig(enable_scope_fusion_passes="ScopeLayerNormPass,ScopeClipBoxesPass")
stream_max_parallel_num
This option applies only to NMT networks.
It specifies the degree of parallelism of AI CPU and AI Core engines for parallel execution of AI CPU and AI Core operators.
Example:
config = NPURunConfig(stream_max_parallel_num="DNN_VM_AICPU:10,AIcoreEngine:1")
DNN_VM_AICPU is the name of the AI CPU engine. In this example, the number of concurrent tasks on the AI CPU engine is 10.
AIcoreEngine is the name of the AI Core engine. In this example, the number of concurrent tasks on the AI Core engine is 1.
Defaults to 1. The value cannot exceed the maximum number of AI Cores.
is_tailing_optimization
This option applies only to BERT networks.
Communication tailing optimization enable in distributed training scenarios to improve performance. By changing a computation dependency relationship, a computation operation that does not depend on the last AR (gradient aggregation fragment) is scheduled to be performed in parallel with the last AR, to optimize communication tailing. Value:
- True: enabled
- False (default): disabled.
This option must be used in pair with NPUOptimizer Constructor and the value must be the same as that of is_tailing_optimization in NPUOptimizer Constructor.
Example:
config = NPURunConfig(is_tailing_optimization=True)
enable_small_channel
Small channel optimization enable. If it is enabled, performance benefits are yielded at the convolutional layers with channel size <= 4.
- 0: disabled. This function is disabled by default in the training scenario (graph_run_mode is 1). You are advised not to enable this function in the training scenario.
- 1: enabled. This is the default option that cannot be modified for the online inference scenario (graph_run_mode is 0).
NOTE:
After this function is included, performance benefits can be obtained on the GoogleNet, ResNet-50, ResNet-101, and ResNet-152 networks. For other networks, the performance may deteriorate.
Example:
config = NPURunConfig(enable_small_channel=0)
variable_placement
If the network weight is large, network execution may fail due to insufficient device memory. In this case, you can deploy the variable to the host to reduce the memory usage of the device.
- Device: The variable is deployed on the device.
- Host: The variable is deployed on the host.
Default value: Device
Constraints:- If this configuration option is set to Host, mixed computing must be enabled (mix_compile_mode = True).
- If the training script contains APIs of TensorFlow V1 control flow operators, such as tf.case, tf.cond, and tf.while_loop, setting variable_placement to Host may cause the network execution to fail. To avoid this problem, add the following APIs to the training script to convert the control flow operators of TensorFlow V1 to V2 and enable resource variables:
tf.enable_control_flow_v2() tf.enable_resource_variables()
Example:
config = NPURunConfig(variable_placement="Device")
graph_max_parallel_model_num
In the online inference scenario, you can set this option to specify the maximum number of threads for parallel graph execution. If the value of this option is greater than 1, the corresponding number of threads are started for parallel graph execution, improving the overall graph execution efficiency.
The value must be an integer in the range of [1, INT32_MAX]. The default value is 1. INT32_MAX is the maximum value of the INT32 type, which is 2147483647.
Example:
config = NPURunConfig(graph_max_parallel_model_num=4)
Profiling
|
Option |
Description |
|---|---|
|
profiling_config |
Profiling configuration. Before creating NPURunConfig, you can instantiate a ProfilingConfig class for profiling configuration. For details about the constructor of the ProfilingConfig class, see ProfilingConfig Constructor. Example: config = NPURunConfig(profiling_config=profiling_config) |
AOE
Operator Compilation
|
Option |
Description |
|---|---|
|
op_compiler_cache_mode |
Disk cache mode for operator compilation. enable is the default value.
Notes:
Example:
config = NPURunConfig(op_compiler_cache_mode="enable") |
|
op_compiler_cache_dir |
Disk cache directory for operator compilation. The value can contain letters, digits, underscores (_), hyphens (-), and periods (.). If the specified directory exists and is valid, the kernel_cache subdirectory is automatically created. If the specified directory does not exist but is valid, the system automatically creates this directory and the kernel_cache subdirectory. The storage priority of the operator compilation cache files is as follows: op_compiler_cache_dir -> ${ASCEND_CACHE_PATH}/kernel_cache_host ID -> the default path ($HOME/atc_data) For details about ASCEND_CACHE_PATH, see Environment Variables.
Example:
config = NPURunConfig(op_compiler_cache_dir="/home/test/kernel_cache") |
Data Augmentation
|
Option |
Description |
|---|---|
|
local_rank_id |
Rank ID of the current process, used in data parallel processing in recommendation networks. The main process deduplicates the data and distributes the deduplicated data to the devices of other processes for forward and backward propagation.
In this mode, multiple devices on a host share one main process for data preprocessing, leaving other processes to receive preprocessed data from the main process. To identify the main process, call the collective communication API get_local_rank_id() to get the rank ID of the current process on its server. Example: config = NPURunConfig(local_rank_id=0, local_device_list="0,1") |
|
local_device_list |
Devices that the main process sends data to, used in conjunction with local_rank_id. config = NPURunConfig(local_rank_id=0, local_device_list="0,1") |
Exception Remedy
|
Option |
Description |
|---|---|
|
hccl_timeout |
Timeout interval (s) of collective communication. Defaults to 1836. You can set the timeout interval if the default value does not meet your requirement (for example, when a communication failure occurs).
NOTE:
Example: config = NPURunConfig(hccl_timeout=1800) |
|
op_wait_timeout |
Operator wait timeout interval (s). Defaults to 120. Example: config = NPURunConfig(op_wait_timeout=120) |
|
op_execute_timeout |
Operator execution timeout interval (s). Example: config = NPURunConfig(op_execute_timeout=90) |
|
stream_sync_timeout |
Timeout interval for stream synchronization during graph execution. If the timeout interval exceeds the configured value, a synchronization failure is reported. The unit is ms. The default value is -1, indicating that there is no waiting time and no error is reported when the synchronization fails. Note: In the cluster training scenario, the value of this option (timeout interval for stream synchronization) must be greater than the collection communication timeout interval, that is, the value of hccl_timeout or the value of the environment variable HCCL_EXEC_TIMEOUT. Example: config = NPURunConfig(stream_sync_timeout=60000) |
|
event_sync_timeout |
Timeout interval for event synchronization during graph execution. If the timeout interval exceeds the configured value, a synchronization failure is reported. The unit is ms. The default value is -1, indicating that there is no waiting time and no error is reported when the synchronization fails. Example: config = NPURunConfig(event_sync_timeout=60000) |
Experiment Options
The experiment options are extended options for debugging and may be changed in later versions. Therefore, they cannot be used in commercial products.
|
Option |
Description |
|---|---|
|
experimental_config |
Extended option, which is not recommended. Before creating NPURunConfig, you can instantiate an ExperimentalConfig class to configure functions. For details about the constructor of the ExperimentalConfig class, see ExperimentalConfig Constructor. |
|
jit_compile |
Determines whether to compile the operator online or use the compiled operator binary file.
Default value: auto
NOTICE:
This option is used only for networks of large recommendation models. Example: config = NPURunConfig(jit_compile="auto") |
Options That Will Be Deprecated in Later Versions
The following options will be deprecated in later versions. You are advised not to use them anymore.
|
Option |
Description |
|---|---|
|
enable_data_pre_proc |
Performance tuning. Enable for the GetNext operator offload to the Ascend AI Processor. The GetNext operator offload is a prerequisite for iteration offload.
Example: config = NPURunConfig(enable_data_pre_proc=True) |
|
variable_format_optimize |
Performance tuning. Variable format optimization enable.
To improve training efficiency, the format of the variables is converted to a format more compatible with the Ascend AI Processor during variable initialization performed by the network. Enable or disable this function as needed. This option is left empty by default, indicating that the configuration is disabled. Example: config = NPURunConfig(variable_format_optimize=True) |
|
op_debug_level |
Operator debug enable. The values are as follows:
This option is left empty by default, indicating that the configuration is disabled. Example: config = NPURunConfig(op_debug_level=1) |
|
op_select_implmode |
Operator implementation mode select. Certain operators compiled in the Ascend AI Processor can be implemented in either high-precision or high-performance mode at model compile time. Arguments:
This option is left empty by default, indicating that the configuration is disabled.
Example:
config = NPURunConfig(op_select_implmode="high_precision") |
|
optypelist_for_implmode |
List of operator types (separated by commas) that use the mode specified by the op_select_implmode option. Currently, Pooling, SoftmaxV2, LRN, and ROIAlign operators are supported. Use this option in conjunction with op_select_implmode, for example: config = NPURunConfig(
op_select_implmode="high_precision",
optypelist_for_implmode="Pooling,SoftmaxV2")
This option is left empty by default, indicating that the configuration is disabled. |
|
dynamic_input |
Whether it is a dynamic input.
Configuration example:
config = NPURunConfig(dynamic_input=True) |
|
dynamic_graph_execute_mode |
Execution mode of a dynamic input. That is, this option takes effect when dynamic_input is set to True. Possible values are: dynamic_execute: dynamic graph compilation. In this mode, the shape range configured in dynamic_inputs_shape_range is used for compilation.
Example:
config = NPURunConfig(dynamic_graph_execute_mode="dynamic_execute") |
|
dynamic_inputs_shape_range |
Shape range of each dynamic input. If a graph has two dataset inputs and one placeholder input, a configuration example is as follows. config = NPURunConfig(dynamic_inputs_shape_range="getnext:[128 ,3~5, 2~128, -1],[64 ,3~5, 2~128, -1];data:[128 ,3~5, 2~128, -1]") Precautions:
|
|
graph_memory_max_size |
Sizes of the network static memory and the maximum dynamic memory (used in earlier versions). In the current version, this option does not take effect. The system dynamically allocates memory resources based on the actual memory usage of the network. |
|
variable_memory_max_size |
Size of the variable memory (used in earlier versions). In the current version, this option does not take effect. The system dynamically allocates memory resources based on the actual memory usage of the network. |
