Configuration Parameters
Basic Options
Option |
Description |
|---|---|
graph_run_mode |
Graph run mode.
Example: npu.global_options().graph_run_mode=1 |
deterministic |
Whether to enable deterministic computing. If it is enabled, the same output is generated when an operator is executed for multiple times with the same hardware and input. The values are as follows:
By default, deterministic computing does not need to be enabled, because it slows down operator execution and affects performance. If it is disabled, the results of multiple executions may be different. This is generally caused by asynchronous multi-thread executions during operator implementation, which changes the accumulation sequence of floating point numbers. However, if the execution results of a model are different for multiple times or the precision needs to be tuned, you can enable deterministic computing to assist model debugging and tuning. Note that if you want a completely definite result, you need to set a definite random seed in the training script to ensure that the random numbers generated in the program are also definite. Example: npu.global_options().deterministic=1 |
Memory Management
Dynamic Shape
Option |
Description |
|---|---|
ac_parallel_enable |
Whether to allow AI CPU operators and AI Core operators to run in parallel in a dynamic shape graph. In a dynamic shape graph, when this option is enabled, the system automatically identifies AI CPU operators that can be concurrently executed with the AI Core operators in the graph. Operators of different engines are distributed to different flows to implement parallel execution among multiple engines, improving resource utilization and dynamic shape execution performance.
Example: npu.global_options().ac_parallel_enable="1" |
compile_dynamic_mode |
Whether to generalize all input shapes in the graph.
Example: npu.global_options().compile_dynamic_mode=True |
all_tensor_not_empty |
Whether to remove control nodes for empty tensor checks in the execution graph. In dynamic shape graph scenarios, control nodes are typically inserted to check whether a node is empty to prevent empty tensor nodes from being sent to the device. If you are certain that the graph does not contain empty tensors, you can enable this option to remove these control nodes and improve graph execution performance.
Example: npu.global_options().all_tensor_not_empty=True |
Debugging
Accuracy Tuning
Option |
Description |
|---|---|
precision_mode_v2 |
A string for the operator precision mode.
Default value:
Example: npu.global_options().precision_mode_v2="origin" NOTE:
|
precision_mode |
A string for the operator precision mode.
For the For the For the Example: npu.global_options().precision_mode="allow_mix_precision" NOTE:
|
modify_mixlist |
When mixed precision is enabled, you can use this parameter to specify the path and file name of the blocklist, trustlist, and graylist, and specify the operators that allow precision reduction and those that do not allow precision reduction. You can enable the mixed precision by configuring precision_mode_v2 (recommended) or precision_mode in the script. The blocklist, trustlist, and graylist storage files are in JSON format. A configuration example is as follows:
npu.global_options().modify_mixlist="/home/test/ops_info.json" Specify the operator type (or types separated by commas) in ops_info.json as follows. {
"black-list": { // Blocklist
"to-remove": [ // Move an operator from the blocklist to the graylist.
"Xlog1py"
],
"to-add": [ // Move an operator from the trustlist or graylist to the blocklist.
"MatMul",
"Cast"
]
},
"white-list": { // Trustlist
"to-remove": [ // Move an operator from the trustlist to the graylist.
"Conv2D"
],
"to-add": [ // Move an operator from the blocklist or graylist to the trustlist.
"Bias"
]
}
}
Note: The operators in the preceding example configuration file are for reference only. The configuration should be based on the actual hardware environment and the built-in tuning policies of the operators. You can query the built-in tuning policy of each operator in mixed precision mode in CANN software installation directory /opp/built-in/op_impl/ai_core/tbe/config/<soc_version>/aic-<soc_version>-ops-info.json. For example: "Conv2D":{
"precision_reduce":{
"flag":"true"
},
...
}
|
customize_dtypes |
If precision_mode is used to set the global precision mode of a network, precision problems may occur on particular operators. In this case, you can use customize_dtypes to configure the precision mode of these operators, and still compile other operators using the precision mode specified by precision_mode. Note if precision_mode is set to must_keep_origin_dtype, customize_dtypes does not take effect. Set it to the path (including the name of the configuration file), for example, /home/test/customize_dtypes.cfg. Example: npu.global_options().customize_dtypes = "/home/test/customize_dtypes.cfg" List the names or types of operators whose precision needs customization in the configuration file. Each operator occupies a line, and the operator type must be defined based on Ascend IR. If both operator name and type are configured for an operator, the operator name applies during building. The structure of the configuration file is as follows: # By operator name Opname1::InputDtype:dtype1,dtype2,...OutputDtype:dtype1,... Opname2::InputDtype:dtype1,dtype2,...OutputDtype:dtype1,... # By operator type OpType::TypeName1:InputDtype:dtype1,dtype2,...OutputDtype:dtype1,... OpType::TypeName2:InputDtype:dtype1,dtype2,...OutputDtype:dtype1,... Example: # By operator name resnet_v1_50/block1/unit_3/bottleneck_v1/Relu::InputDtype:float16,int8,OutputDtype:float16,int8 # By operator type OpType::Relu:InputDtype:float16,int8,OutputDtype:float16,int8 NOTE:
|
Accuracy Comparison
Performance Tuning
Option |
Description |
|---|---|
hcom_parallel |
Enable for the AllReduce gradient update and forward and backward propagation in parallel.
Example: npu.global_options().hcom_parallel=True For a small network (for example, ResNet18), you are advised to set this option to False. |
enable_small_channel |
Small channel optimization enable. If it is enabled, performance benefits are yielded at the convolutional layers with channel size <= 4.
Example: npu.global_options().enable_small_channel=1 |
op_precision_mode |
High-precision or high-performance mode of an operator. You can pass a custom mode configuration file op_precision.ini to set different modes for operators. You can set this option by operator type (low priority) or node name (high priority). Example: [ByOpType] optype1=high_precision optype2=high_performance optype3=enable_hi_float_32_execution optype4=support_out_of_bound_index [ByNodeName] nodename1=high_precision nodename2=high_performance nodename3=enable_hi_float_32_execution nodename4=support_out_of_bound_index
You can view the precision and performance mode supported by an operator in the opp/built-in/op_impl/ai_core/tbe/impl_mode/all_ops_impl_mode.ini file of the CANN component directory. This option is mutually exclusive with op_select_implmode and optypelist_for_implmode. If they are all specified, op_precision_mode takes precedence. Generally, you do not need to set this option. It is used if you need to adjust the precision of a specific operator using the configuration .ini file in the case that you fail to obtain optimal network performance or accuracy in the high-performance or high-precision mode. Example: npu.global_options().op_precision_mode="/home/test/op_precision.ini" |
stream_max_parallel_num |
This option applies only to neural machine translation (NMT) networks. Degree of parallelism of AI CPU and AI Core engines for parallel execution of AI CPU and AI Core operators. DNN_VM_AICPU is the name of the AI CPU engine. In this example, the number of concurrent tasks on the AI CPU engine is 10. AIcoreEngine is the name of the AI Core engine. In this example, the number of concurrent tasks on the AI Core engine is 1. The value range is [1, 13]. Defaults to 1. Example: npu.global_options().stream_max_parallel_num="DNN_VM_AICPU:10,AIcoreEngine:1" |
is_tailing_optimization |
This option applies only to Bidirectional Encoder Representations from Transformers (BERT) networks. Communication tailing optimization enable in distributed training scenarios to improve performance. By changing a computation dependency relationship, a computation operation that does not depend on the last AR (gradient aggregation fragment) is scheduled to be performed in parallel with the last AR, to optimize communication tailing. Values:
Example: npu.global_options().is_tailing_optimization=True |
enable_scope_fusion_passes |
Fusion pattern (or fusion patterns separated by commas) to take effect at build time. Scope fusion patterns (either built-in or custom) are classified into the following two types:
Example: npu.global_options().enable_scope_fusion_passes="ScopeLayerNormPass,ScopeClipBoxesPass" |
Profiling
AOE
Operator and Graph Build
Option |
Description |
|---|---|
op_compiler_cache_mode |
Disk cache mode for operator building. enable is the default value.
Notes:
Example:
npu.global_options().op_compiler_cache_mode="enable" |
op_compiler_cache_dir |
Disk cache directory for operator compilation. The directory can contain letters, digits, underscores (_), hyphens (-), and periods (.). If the specified directory exists and is valid, a kernel_cache subdirectory is automatically created. If the specified directory does not exist but is valid, the system automatically creates this directory and the kernel_cache subdirectory. The storage priority of the operator compilation cache files is as follows: op_compiler_cache_dir > ${ASCEND_CACHE_PATH}/kernel_cache > Default path ($HOME/atc_data) For details about ASCEND_CACHE_PATH, see Environment Variables. Example:
npu.global_options().op_compiler_cache_dir="/home/test/kernel_cache" |
aicore_num |
Maximum number of Cube cores and Vector cores used for operator compilation. Format: Integer 1|Integer 2, where the two values are separated by vertical bars (|). Integer 1 specifies the maximum number of Cube cores to use, and Integer 2 specifies the maximum number of Vector cores to use. Both values must be greater than 0 and less than or equal to the actual number of Cube cores and Vector cores available on the Ascend AI Processor.
NOTE:
Example: npu.global_options().aicore_num="2|4" |
oo_constant_folding |
Enables or disables constant folding. Constant folding is to directly compute and replace the values of constant expressions during the graph build stage, thereby reducing the memory usage. In most cases, you are advised to retain the default value to enable constant folding. However, some networks require more memory during compilation and running, and the constant memory is occupied throughout the lifecycle of the graph. If the total memory increases after constant folding, you can use this parameter to disable constant folding.
npu.global_options().oo_constant_folding=True NOTE:
If constant folding is disabled and an error occurs during network compilation and running, information similar to the following will be displayed:
|
Exception Remedy
Experiment Options
The experiment options are extended options for debugging and may be changed in later versions. Therefore, they cannot be used in commercial products.
Option |
Description |
|---|---|
graph_compiler_cache_dir |
Drive cache directory for graph compilation. If this parameter is not empty, the drive cache function for graph compilation takes effect. The graph compilation cache function supports drive persistence of graph compilation results. When graph compilation is performed again, the compilation results cached on the drive can be directly loaded to reduce the graph compilation duration. Note:
Example: npu.global_options().graph_compiler_cache_dir="/root/build_cache_dir" |
jit_compile |
Online compilation enable for model compilation.
NOTICE:
This option is used only for networks of large recommendation models. Example: npu.global_options().jit_compile = "auto" |
shape_generalization_mode |
When jit_compile is set to true (online operator compilation), use this parameter to configure the shape generalization mode.
NOTICE:
If compile_dynamic_mode is set to True, all input shapes are generalized to -1 in the first iteration. In this case, the configuration of shape_generalization_mode does not take effect. Example: npu.global_options().shape_generalization_mode = "FULL" |
auto_multistream_parallel_mode |
This option applies only to graphs with a static shape. You can enable parallel execution of Cube and Vector operators to improve graph execution performance.
NOTICE:
Example: npu.global_options().auto_multistream_parallel_mode = "cv" |