Configuration Parameters

Basic Options

Option	Description
graph_run_mode	Graph run mode. 0: online inference. 1 (default): training Example: npu.global_options().graph_run_mode=1
deterministic	Whether to enable deterministic computing. If it is enabled, the same output is generated when an operator is executed for multiple times with the same hardware and input. The values are as follows: 0 (default): disables deterministic computing. 1: enables deterministic computing. By default, deterministic computing does not need to be enabled, because it slows down operator execution and affects performance. If it is disabled, the results of multiple executions may be different. This is generally caused by asynchronous multi-thread executions during operator implementation, which changes the accumulation sequence of floating point numbers. However, if the execution results of a model are different for multiple times or the precision needs to be tuned, you can enable deterministic computing to assist model debugging and tuning. Note that if you want a completely definite result, you need to set a definite random seed in the training script to ensure that the random numbers generated in the program are also definite. Example: npu.global_options().deterministic=1

Option

Description

graph_run_mode

Graph run mode.

0: online inference.
1 (default): training

Example:

npu.global_options().graph_run_mode=1

deterministic

Whether to enable deterministic computing. If it is enabled, the same output is generated when an operator is executed for multiple times with the same hardware and input.

The values are as follows:

0 (default): disables deterministic computing.
1: enables deterministic computing.

By default, deterministic computing does not need to be enabled, because it slows down operator execution and affects performance. If it is disabled, the results of multiple executions may be different. This is generally caused by asynchronous multi-thread executions during operator implementation, which changes the accumulation sequence of floating point numbers.

However, if the execution results of a model are different for multiple times or the precision needs to be tuned, you can enable deterministic computing to assist model debugging and tuning. Note that if you want a completely definite result, you need to set a definite random seed in the training script to ensure that the random numbers generated in the program are also definite.

Example:

npu.global_options().deterministic=1

Memory Management

Option	Description
memory_config.atomic_clean_policy	Whether to clean up the memory occupied by all operators with the memset attribute (memset operators) on the network. The options are as follows: 0 (default): Enables collective cleanup. 1: Disables collective cleanup. Memory used by each memset operator is cleaned up separately. When the memset operators on the network occupy too much memory, you can try this method. However, this method may cause performance loss. Example: npu.global_options().memory_config.atomic_clean_policy=1
memory_config.static_memory_policy	Memory allocation mode used during network running. 0 (default): dynamic memory allocation. Memory is dynamically allocated based on the actual size. 2: dynamic memory expansion supported by only static shape. During network running, this option can be used to implement memory reuse between multiple graphs in a session. That is, the memory required by the maximum graph is allocated. For example, if the memory required by the current graph exceeds the memory of the previous graph, the memory of the previous graph is directly released. The memory is reallocated based on the memory required by the current graph. 3: dynamic memory expansion supported by only dynamic shape, which solves the fragment problem during dynamic memory allocation and reduces the memory usage of the dynamic-shape network. 4: dynamic memory expansion supported by both static and dynamic shapes. Example: npu.global_options().memory_config.static_memory_policy=0 NOTE: This option cannot be set to 2 or 4 when multiple graphs are executed concurrently. To be compatible with earlier versions, the system adopts the method of mode 2 even if this option is set to 1. If this option is set to 3 or 4, memory gains are generated, but performance may deteriorate.
memory_config.variable_use_1g_huge_page	In recommendation models, the embedding layer in TensorFlow uses variables. When embedding layers serve as input or output addresses for index-based operators (such as Gather and ScatterNd), large memory footprints may lead to extensive scattered access, potentially causing performance degradation. In such cases, you can try configuring this parameter to allocate memory for variables and constants using 1 GB huge pages, thereby improving memory access performance. The options are as follows: 0 (default): Uses the system default page size (4 KB or 2 MB) for memory allocation. 1: Allocates memory using 1 GB huge pages. If the allocation fails, an error log is printed and the service terminates. 2: Allocates memory using 1 GB huge pages. If the allocation fails, an error log is printed, but the service does not terminate; instead, it falls back to 2 MB pages. If the fallback allocation succeeds, the service continues; if it also fails, the service terminates. Using 1 GB huge pages can effectively reduce the number of page table entries and expand the address range covered by the translation lookaside buffer (TLB) cache, thereby improving performance for scattered access patterns. The TLB is a hardware module on the Ascend AI processor that caches recently used virtual-to-physical address mappings. Example: npu.global_options().memory_config.variable_use_1g_huge_page=1 NOTE: This parameter can be used only by the following products: Atlas A3 training products/Atlas A3 inference products Atlas A2 training products/Atlas A2 inference products
external_weight	When multiple models are loaded in a session, if the weights of these models can be reused, you are advised to use this configuration item to externalize the weights of the Const/Constant nodes on the network to implement weight reuse among multiple models and reduce the memory usage of the weights. False (default): The weights are not externalized but are saved in graphs. True: The weights of all Const/Constant nodes on the network are flushed to the disk and are converted to the FileConstant type. The weight file is named in the format of *weight_<hash value>. If the environment variable ASCEND_WORK_PATH* is not configured in the environment, the weight files are flushed to the current execution directory tmp_weight_<pid>_<sessionid>. If ASCEND_WORK_PATH is configured in the environment, the weight files are flushed to the ${ASCEND_WORK_PATH}/tmp_weight_<pid>_<sessionid> directory. For details about ASCEND_WORK_PATH, see "Installation" in Environment Variables. When the model is uninstalled, the tmp_weight_<pid>_<sessionid> directory is automatically deleted. Note: This option is usually not required. If the model loading environment has limitations on memory, you can flush the weight externally. Example: npu.global_options().external_weight=True
input_fusion_size	Threshold for fusing and copying multiple discrete pieces of user input data during H2D transmission. The unit is byte. The minimum value is 0 byte, the maximum value is 33554432 bytes (32 MB), and the default value is 131072 bytes (128 KB). If: Size of input data ≤ threshold: The data is fused before transferred from the host to the device. Size of input data > threshold or threshold = 0 (function disabled): The data is not fused before transferred from the host to the device. Assume there are 10 user inputs, including two 100 KB inputs, two 50 KB inputs, and the other inputs greater than 100 KB: input_fusion_size set to 100KB: The preceding four inputs are fused into 300 KB data for transfer. The other six inputs are directly transferred from the host to the device. input_fusion_size set to 0KB: This function is disabled. That is, the data is not fused, and the ten inputs are directly transferred from the host to the device. Note: This parameter takes effect only for static shape graphs. Example: npu.global_options().input_fusion_size=25600
input_batch_cpy	Whether to enable the batch memory copy function when input data is transferred from the host to the device. True: The batch memory copy function is enabled. This value takes effect only when the number of user inputs is greater than 1. False (default): The batch memory copy function is disabled. NOTE: This parameter is supported only on the following products: Atlas A3 training products/Atlas A3 inference products Atlas A2 training products/Atlas A2 inference products This parameter improves data transfer performance from the host to the device. It applies to scenarios that require frequent data transfer and have low PCIe bandwidth utilization. Enabling the batch copy function using this parameter can improve bandwidth utilization. If the network initially has only one input, the batch copy function does not take effect even if it is enabled. When both the input_fusion_size parameter (for enabling fusion and copy) and the input_batch_cpy parameter (for enabling batch copy) are configured, the threshold for the fusion and copy function may affect the batch copy function. For example, if there are five inputs and four of them are smaller than the threshold for fusion and copy and meet the fusion conditions, these four inputs will be processed using fusion and copy. The remaining input does not meet the input quantity requirement for batch copy and therefore will not be batch-copied. Example: npu.global_options().input_batch_cpy=True

Dynamic Shape

Option	Description
ac_parallel_enable	Whether to allow AI CPU operators and AI Core operators to run in parallel in a dynamic shape graph. In a dynamic shape graph, when this option is enabled, the system automatically identifies AI CPU operators that can be concurrently executed with the AI Core operators in the graph. Operators of different engines are distributed to different flows to implement parallel execution among multiple engines, improving resource utilization and dynamic shape execution performance. 1: AI CPU operators and AI Core operators are allowed to run in parallel. 0 (default): AI CPU operators are not separately distributed. Example: npu.global_options().ac_parallel_enable="1"
compile_dynamic_mode	Whether to generalize all input shapes in the graph. True: All input shapes are generalized to -1. Also, static shape graphs are generalized to dynamic ones. False (default): Input shapes are not generalized. Example: npu.global_options().compile_dynamic_mode=True
all_tensor_not_empty	Whether to remove control nodes for empty tensor checks in the execution graph. In dynamic shape graph scenarios, control nodes are typically inserted to check whether a node is empty to prevent empty tensor nodes from being sent to the device. If you are certain that the graph does not contain empty tensors, you can enable this option to remove these control nodes and improve graph execution performance. True: Removes the control nodes used for empty tensor checks in the execution graph. Set it to True only when you are sure that the graph does not contain empty tensor nodes; otherwise, some operators may fail. False (default): Retains the control nodes used for empty tensor checks in the execution graph. Example: npu.global_options().all_tensor_not_empty=True

Option

Description

ac_parallel_enable

Whether to allow AI CPU operators and AI Core operators to run in parallel in a dynamic shape graph.

In a dynamic shape graph, when this option is enabled, the system automatically identifies AI CPU operators that can be concurrently executed with the AI Core operators in the graph. Operators of different engines are distributed to different flows to implement parallel execution among multiple engines, improving resource utilization and dynamic shape execution performance.

1: AI CPU operators and AI Core operators are allowed to run in parallel.
0 (default): AI CPU operators are not separately distributed.

Example:

npu.global_options().ac_parallel_enable="1"

compile_dynamic_mode

Whether to generalize all input shapes in the graph.

True: All input shapes are generalized to -1. Also, static shape graphs are generalized to dynamic ones.
False (default): Input shapes are not generalized.

Example:

npu.global_options().compile_dynamic_mode=True

all_tensor_not_empty

Whether to remove control nodes for empty tensor checks in the execution graph. In dynamic shape graph scenarios, control nodes are typically inserted to check whether a node is empty to prevent empty tensor nodes from being sent to the device. If you are certain that the graph does not contain empty tensors, you can enable this option to remove these control nodes and improve graph execution performance.

True: Removes the control nodes used for empty tensor checks in the execution graph. Set it to True only when you are sure that the graph does not contain empty tensor nodes; otherwise, some operators may fail.
False (default): Retains the control nodes used for empty tensor checks in the execution graph.

Example:

npu.global_options().all_tensor_not_empty=True

Debugging

Option	Description
op_debug_config	Enable for global memory check. The value is the path of the .cfg configuration file. Multiple options in the configuration file are separated by commas (,). oom: checks whether memory overwriting occurs in the global memory during operator execution. During operator compilation, the .o file (operator binary file) and .json file (operator description file) are retained in the kernel_meta folder in the current execution path, and the following detection logic is added: inline __aicore__ void CheckInvalidAccessOfDDR(xxx) { if (access_offset < 0 \|\| access_offset + access_extent > ddr_size) { if (read_or_write == 1) { trap(0X5A5A0001); } else { trap(0X5A5A0002); } } } You can use dump_cce to view the preceding code in the generated .cce file. If memory overwriting occurs during compilation, the error code EZ9999 is reported. dump_cce: Retains the .cce file, .o file, and .json file of the operator in the kernel_meta folder in the current execution path during operator compilation. dump_loc: Retains the .cce file, .o file, and .json file of the operator, as well as the _loc.json file (mapping file of python-cce) in the kernel_meta folder in the current execution path during operator compilation. ccec_O0: Enables the default option -O0 of the CCEC during operator compilation. This option does not perform any optimization based on the debugging information. ccec_g: Enables the -g option of the CCEC during operator compilation. Compared with -O0, this option generates optimization and debugging information. check_flag: Checks whether the pipeline synchronization signals in an operator are valid and consistent during operator execution. Retain the .o file and .json file in the generated kernel_meta folder and add the following detection logic during operator compilation: set_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID0); set_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID1); set_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID2); set_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID3); .... pipe_barrier(PIPE_MTE3); pipe_barrier(PIPE_MTE2); pipe_barrier(PIPE_M); pipe_barrier(PIPE_V); pipe_barrier(PIPE_MTE1); pipe_barrier(PIPE_ALL); wait_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID0); wait_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID1); wait_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID2); wait_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID3); ... You can use dump_cce to view the preceding code in the generated .cce file. During compilation, if a mismatch exists in the pipeline synchronization signals in an operator, a timeout error is reported at the faulty operator. The following is an example of the error message: Aicore kernel execute failed, ..., fault kernel_name=Operator name,... rtStreamSynchronizeWithTimeout execute failed.... Example: npu.global_options().op_debug_config="/root/test0.cfg" The information about the test0.cfg file is as follows: op_debug_config=ccec_g,oom Constraints: During operator compilation, if you want to compile only some instead of all AI Core operators, add the op_debug_list field to the test0.cfg configuration file. By doing so, only the operators specified in the list are built, based on the options configured in op_debug_config. The op_debug_list field has the following requirements: The operator name or operator type can be specified. Operators are separated by commas (,). The operator type is configured in the OpType::typeName format. The operator type and operator name can be configured in a mixed manner. The operator to be compiled must be stored in the configuration file specified by op_debug_config. The following is a configuration example of the test0.cfg file: op_debug_config= ccec_g,oom op_debug_list=GatherV2,opType::ReduceSum During model compilation, the GatherV2 and ReduceSum operators are compiled based on the ccec_g and oom options. NOTE: When ccec_O0 and ccec_g are enabled, the size of the operator kernel file (.o file) increases. In dynamic shape scenarios, all possible scenarios are traversed during operator build, which may cause operator build failures due to large operator kernel files. In this case, do not enable the options of the CCE compiler. If the build failure is caused by the large operator kernel file, the following log is displayed: message:link error ld.lld: error: InputSection too large for range extension thunk* ./kernel_meta_xxxxx.o:(xxxx) The CCEC options ccec_O0 and oom cannot be enabled at the same time. Otherwise, an AI Core error is reported. The following is an example of the error message: ...there is an aivec error exception, core id is 49, error code = 0x4 ... If this option is set to dump_cce or dump_loc, you can use debug_dir to specify the path for storing debugging-related process files. When the build options oom, dump_cce, and dump_loc are configured, if the model contains the following MC2 operators, the .o, .json, and .cce files of the operators are not generated in the operator build folder kernel_meta. MatMulAllReduce MatMulAllReduceAddRmsNorm AllGatherMatMul MatMulReduceScatter AlltoAllAllGatherBatchMatMul BatchMatMulReduceScatterAlltoAll If NPU_COLLECT_PATH* is configured, the function of checking whether memory overwriting occurs in the global memory cannot be enabled. That is, this option cannot be set to oom. Otherwise, an error is reported when the compiled model file or operator kernel package is used.
enable_exception_dump	Whether to dump data of exception operators. 0: Disables the exception operator data dump function. 1: Enables the common ExceptionDump function to dump the input and output data, tensor description information (such as shape, dtype, and format), and workspace information of exception operators. The dump data is stored in the following directories in descending order of priority: NPU_COLLECT_PATH > ASCEND_WORK_PATH > default directory (extra-info in the script execution directory). 2 (default): Enables the LiteExceptionDump function to dump the input and output data, workspace information, and tiling information of exception operators. The exported data is used to analyze AI Core errors. For details about how to collect and locate AI Core errors, see "Typical Faults > AI Core Error Locating" in Troubleshooting. The dump data is stored in the following directories in descending order of priority: ASCEND_WORK_PATH > default directory (extra-info/data-dump/<device_id> in the script execution directory). NOTE: If the environment variable NPU_COLLECT_PATH is configured, exception operator data is dumped in accordance with mode 1 (common ExceptionDump) regardless of the value of enable_exception_dump, and the dump data is stored in the directory specified by NPU_COLLECT_PATH. For details about the environment variable, see Environment Variables. Example: npu.global_options().enable_exception_dump=1
debug_dir	Directory of the debug files generated during operator building, including the .o, .json, and .cce files. The storage priority of the debugging files generated during operator compilation is as follows: debug_dir > ASCEND_WORK_PATH > Default storage path (current script execution path) For details about ASCEND_WORK_PATH, see Environment Variables. Example: npu.global_options().debug_dir="/home/test"
export_compile_stat	Whether to generate the operator fusion result file fusion_result.json during graph compilation. The options are as follows: 0: The operator fusion result file is not generated. 1 (default): The operator fusion result file is generated when the program exits normally. 2: The operator fusion result file is generated after graph compilation is complete. That is, if graph compilation is complete but the program is interrupted, the result file is also generated. The fusion_result.json file records the fusion patterns used during graph compilation. The key fields in the file are described as follows: session_and_graph_id_xx_xx: thread and graph ID of the fusion result. graph_fusion: graph fusion. ub_fusion: UB fusion. match_times: number of times that the fusion pattern is matched during graph build. effect_times: actual number of times that the fusion takes effect. repository_hit_times: number of times that the UB fusion repository is hit. NOTE: If ASCEND_WORK_PATH is not configured in the environment, the operator fusion result is saved to the fusion_result.json file in the current execution directory. If ASCEND_WORK_PATH is configured, the operator fusion result is saved to the $ASCEND_WORK_PATH/FE/${Process ID}/fusion_result.json file. For details about the environment variable, see Environment Variables. The fusion patterns disabled by fusion_switch_file are not displayed in fusion_result.json. Example: npu.global_options().export_compile_stat=1

Accuracy Tuning

Option	Description
precision_mode_v2	A string for the operator precision mode. fp16 Indicates that float16 is forcibly selected if the operator precision in the original graph is float16, bfloat16, or float32. origin Retains the original precision. If the precision of an operator in the original graph is float16, and the implementation of the operator in the AI Core does not support float16 but supports only float32 and bfloat16, the system automatically uses high-precision float32. If the precision of an operator in the original graph is float16, and the implementation of the operator in the AI Core does not support float16 but supports only bfloat16, the AI CPU operator of float16 is used. If the AI CPU operator is not supported, an error is reported. If the precision of an operator in the original graph is float32, and the implementation of the operator in the AI Core does not support float32 but supports only float16, the AI CPU operator of float32 is used. If the AI CPU operator is not supported, an error is reported. cube_fp16in_fp32out The system selects a processing mode based on the operator type for AI Core operators supporting both float32 and float16. For cube operators, the system processes the computation based on the operator implementation. The preferred input data type is float16 and the output data type is float32. If the float16 input data and float32 output data types are not supported, set both the input and output data types to float32. If the float32 input and output data types are not supported, set both the input and output data types to float16. If the float16 input and output data types are not supported, an error is reported. For vector compute operators, the operator precision in the original graph is float16 or bfloat16, and float32 is forcibly selected. This option is invalid if the original graph contains operators not supporting float32 in the AI Core, for example, an operator that supports only float16. In this case, float16 is retained. If the operator in the AI Core does not support float32 and it is configured to the blocklist of precision reduction (by setting precision_reduce to false), the counterpart AI CPU operator supporting float32 is used. If the AI CPU operator does not support float32, an error is reported. mixed_float16 Mixed precision of float16, bfloat16, and float32 is used for neural network processing. For float32 and befloat16 operators in the original graph, float16 is automatically used for certain float32 and bfloat16 operators based on the built-in tuning policy. This will improve system performance and reduce memory usage with minimal precision degradation. Use the mixed precision mode in conjunction with loss scaling to compensate for the accuracy degradation caused by precision reduction. mixed_bfloat16 Mixed precision of bfloat16 and float32 is used for neural network processing. In this mode, bfloat16 is automatically used for certain float32 operators in the original graph based on the built-in tuning policy. This will improve system performance and reduce memory usage with minimal precision degradation. If the operators do not support bfloat16 and float32, the AI CPU operators are used for computation. If AI CPU operators also do not support float16 and float32, an error is reported during execution. Note: This configuration is supported only by Atlas A3 training products/Atlas A3 inference productsAtlas A2 training products/Atlas A2 inference products. Default value: For Atlas A3 training products/Atlas A3 inference products, the default value is origin. For Atlas A2 training products/Atlas A2 inference products, the default value is origin. For Atlas training products, this parameter does not have a default value. The default value of the precision_mode parameter is used, that is, allow_fp32_to_fp16. Example: npu.global_options().precision_mode_v2="origin" NOTE: This option cannot be used together with precision_mode. precision_mode_v2 is recommended. This option can be used to set the global precision mode of a network, but it may result in performance or precision problems on particular operators. In this case, you are advised to call npu.keep_dtype_scope to keep the precision of some operators unchanged. For details about the built-in tuning policy of each operator in mixed precision mode, see the description of the modify_mixlist option. The Atlas training products does not support data type bfloat16.
precision_mode	A string for the operator precision mode. allow_fp32_to_fp16 For matrix operators: If the operator precision in the original graph is float32, the precision is preferably reduced to float16. If the operator in the AI Core does not support float16, float32 is used. If the operator in the AI Core does not support float32, the AI CPU operator is used for computation. If the AI CPU operator also does not support float32, an error is reported during execution. If the operator precision in the original graph is bfloat16, the precision of the original graph is preferably used. If the operator in the AI Core does not support bfloat16, float32 is used. If the operator in the AI Core does not support float32, the precision is directly reduced to float16. If the operator in the AI Core does not support float16, the AI CPU operator is used for computation. If the AI CPU operator also does not support float16, an error is reported during execution. For vector operators, the precision of the original graph is retained preferably. If the operator precision in the original graph is float32, the precision of the original graph is preferably used. If the operator in the AI Core does not support float32, the precision is directly reduced to float16. If the operator in the AI Core does not support float16, the AI CPU operator is used for computation. If the AI CPU operator also does not support float16, an error is reported during execution. If the operator precision in the original graph is bfloat16, the precision of the original graph is preferably used. If the operator in the AI Core does not support bfloat16, float32 is used. If the operator in the AI Core does not support float32, the precision is directly reduced to float16. If the operator in the AI Core does not support float16, the AI CPU operator is used for computation. If the AI CPU operator also does not support float16, an error is reported during execution. force_fp16 Forces float16 for operators supporting float16, bfloat16, and float32. This parameter applies only to online inference scenarios. force_fp32/cube_fp16in_fp32out force_fp32 and cube_fp16in_fp32out have the same effect. This option indicates that the system selects different processing modes based on the operator type when the operator in the AI Core supports both the float32 and float16 data types. cube_fp16in_fp32out is newly added to the new version. For cube operators, this option has clearer semantics. For cube operators, the system processes the computation based on the operator implementation. The preferred input data type is float16 and the output data type is float32. If the float16 input data and float32 output data types are not supported, set both the input and output data types to float32. If the float32 input and output data types are not supported, set both the input and output data types to float16. If the float16 input and output data types are not supported, an error is reported. For vector compute operators, the operator precision in the original graph is float16 or bfloat16, and float32 is forcibly selected. This option is invalid if the original graph contains operators not supporting float32 in the AI Core, for example, an operator that supports only float16. In this case, float16 is retained. If the operator in the AI Core does not support float32 and it is configured to the blocklist of precision reduction (by setting precision_reduce to false), the counterpart AI CPU operator supporting float32 is used. If the AI CPU operator does not support float32, an error is reported. must_keep_origin_dtype Retains the original precision. If the precision of an operator in the original graph is float16, and the implementation of the operator in the AI Core does not support float16 but supports only float32 and bfloat16, the system automatically uses high-precision float32. If the precision of an operator in the original graph is float16, and the implementation of the operator in the AI Core does not support float16 but supports only bfloat16, the AI CPU operator of float16 is used. If the AI CPU operator is not supported, an error is reported. If the precision of an operator in the original graph is float32, and the implementation of the operator in the AI Core does not support float32 but supports only float16, the AI CPU operator of float32 is used. If the AI CPU operator is not supported, an error is reported. allow_mix_precision_fp16/allow_mix_precision allow_mix_precision has the same effect as that of allow_mix_precision_fp16, indicating that mixed precision of float16, bfloat16, and float32 is used for neural network processing. allow_mix_precision_fp16 is newly added to the new version, which has clearer semantics for easy understanding. For float32 and befloat16 operators in the original model, float16 is automatically used for certain float32 and bfloat16 operators based on the built-in tuning policy. This will improve system performance and reduce memory usage with minimal precision degradation. allow_mix_precision_bf16 Mixed precision of bfloat16 and float32 is used for neural network processing. In this mode, bfloat16 is automatically used for certain float32 operators on the original model based on the built-in tuning policy. This will improve system performance and reduce memory usage with minimal precision degradation. If the operator in the AI Core does not support bfloat16 and float32, the AI CPU operator is used for computation. If AI CPU operator also does not support bfloat16 and float32, an error is reported during execution. Note: This configuration is supported only by Atlas A3 training products/Atlas A3 inference productsAtlas A2 training products/Atlas A2 inference products. allow_fp32_to_bf16 If the operator precision in the original graph is float32, the precision of the original graph is preferably used. If the operator in the AI Core does not support float32, the precision is reduced to bfloat16. If the operator in the AI Core does not support bfloat16, the AI CPU operator is used for computation. If the AI CPU operator also does not support bfloat16, an error is reported during execution. If the operator precision in the original graph is bfloat16, the precision of the original graph is preferably used. If the operator in the AI Core does not support bfloat16, float32 is used. If the operator in the AI Core does not support float32, the AI CPU operator is used for computation. If the AI CPU operator also does not support float32, an error is reported during execution. Note: This configuration is supported by Atlas A3 training products/Atlas A3 inference productsAtlas A2 training products/Atlas A2 inference products. For the Atlas A3 training products/Atlas A3 inference products, the default value is must_keep_origin_dtype. For the Atlas A2 training products/Atlas A2 inference products, the default value is must_keep_origin_dtype. For the Atlas training products, the default value is allow_fp32_to_fp16. Example: npu.global_options().precision_mode="allow_mix_precision" NOTE: This option cannot be used together with precision_mode_v2. precision_mode_v2 is recommended. This option can be used to set the global precision mode of a network, but it may result in performance or precision problems on particular operators. In this case, you are advised to call npu.keep_dtype_scope to keep the precision of some operators unchanged. For details about the built-in tuning policy of each operator in mixed precision mode, see the description of the modify_mixlist option. The Atlas training products does not support data type bfloat16.
modify_mixlist	When mixed precision is enabled, you can use this parameter to specify the path and file name of the blocklist, trustlist, and graylist, and specify the operators that allow precision reduction and those that do not allow precision reduction. You can enable the mixed precision by configuring precision_mode_v2 (recommended) or precision_mode in the script. The blocklist, trustlist, and graylist storage files are in JSON format. A configuration example is as follows: npu.global_options().modify_mixlist="/home/test/ops_info.json" Specify the operator type (or types separated by commas) in ops_info.json as follows. { "black-list": { // Blocklist "to-remove": [ // Move an operator from the blocklist to the graylist. "Xlog1py" ], "to-add": [ // Move an operator from the trustlist or graylist to the blocklist. "MatMul", "Cast" ] }, "white-list": { // Trustlist "to-remove": [ // Move an operator from the trustlist to the graylist. "Conv2D" ], "to-add": [ // Move an operator from the blocklist or graylist to the trustlist. "Bias" ] } } Note: The operators in the preceding example configuration file are for reference only. The configuration should be based on the actual hardware environment and the built-in tuning policies of the operators. You can query the built-in tuning policy of each operator in mixed precision mode in CANN software installation directory /opp/built-in/op_impl/ai_core/tbe/config/<soc_version>/aic-<soc_version>-ops-info.json. For example: "Conv2D":{ "precision_reduce":{ "flag":"true" }, ... } true (trustlist): The precision of operators on the trustlist can be reduced in mixed precision mode. false (blocklist): The precision of operators on the blocklist cannot be reduced in mixed precision mode. Not specified (graylist): Follows the same mixed precision processing as the upstream operator.
customize_dtypes	If precision_mode is used to set the global precision mode of a network, precision problems may occur on particular operators. In this case, you can use customize_dtypes to configure the precision mode of these operators, and still compile other operators using the precision mode specified by precision_mode. Note if precision_mode is set to must_keep_origin_dtype, customize_dtypes does not take effect. Set it to the path (including the name of the configuration file), for example, /home/test/customize_dtypes.cfg. Example: npu.global_options().customize_dtypes = "/home/test/customize_dtypes.cfg" List the names or types of operators whose precision needs customization in the configuration file. Each operator occupies a line, and the operator type must be defined based on Ascend IR. If both operator name and type are configured for an operator, the operator name applies during building. The structure of the configuration file is as follows: # By operator name Opname1::InputDtype:dtype1,dtype2,...OutputDtype:dtype1,... Opname2::InputDtype:dtype1,dtype2,...OutputDtype:dtype1,... # By operator type OpType::TypeName1:InputDtype:dtype1,dtype2,...OutputDtype:dtype1,... OpType::TypeName2:InputDtype:dtype1,dtype2,...OutputDtype:dtype1,... Example: # By operator name resnet_v1_50/block1/unit_3/bottleneck_v1/Relu::InputDtype:float16,int8,OutputDtype:float16,int8 # By operator type OpType::Relu:InputDtype:float16,int8,OutputDtype:float16,int8 NOTE: You can find the operator precisions supported in the operator information library, which is saved in *opp/built-in/op_impl/ai_core/tbe/config/${soc_version}/aic-${soc_version}-ops-info.json* under the CANN directory by default. The data type specified by this option takes high priority, which may invite accuracy or performance degradation. If the specified data type is not supported, the building will fail. If the configuration is performed based on the operator name, the operator name may change due to operations such as fusion and splitting during model compilation. As a result, the configuration does not take effect and the precision is not improved. In this case, you need to obtain logs to locate the fault. For details about the logs, see Log Reference.

Accuracy Comparison

Option	Description
dump_config.enable_dump	Data dump enable. True: enabled. The dump file path is read from dump_path. False (default): disabled. NOTE: Data dump and overflow/underflow data collection cannot be enabled at the same time. That is, dump_config.enable_dump and dump_config.enable_dump_debug cannot be set to True at the same time. If either dump_config.enable_dump or dump_config.enable_dump_debug is set to True and enable_exception_dump is set to 1 (indicating that common ExceptionDump function is enabled): For dynamic-shape networks, only enable_exception_dump takes effect. For static-shape networks, enable_exception_dump and either of dump_config.enable_dump and dump_config.enable_dump_debug take effect. Example: npu.global_options().dump_config.enable_dump=True
dump_config.dump_mode	Data dump mode. The values are as follows: input: dumps only operator inputs. output (default): dumps only operator outputs. all: dumps both operator inputs and outputs. NOTE: If this option is set to all, the input data of some operators, such as collective communication operators HcomAllGather and HcomAllReduce, will be modified during operator execution. Therefore, the system dumps the operator input before operator execution and dumps the operator output after execution. In this way, the dumped input and output data of the same operator is flushed to disks separately, and multiple dump files are generated. After parsing the dump files, you can determine whether the data is an input or output based on the file content. Example: npu.global_options().dump_config.dump_mode="all"
dump_config.dump_path	Dump path. This option is required when enable_dump or enable_dump_debug is set to True. Create the specified path in advance in the environment (either container or host) where training is performed. The running user configured during installation must have the read and write permissions on this path. The value can be an absolute path or a path relative to the path where the training script is executed. An absolute path starts with a slash (/), for example, /home/HwHiAiUser/output. A relative path starts with a directory name, for example, output. Example: npu.global_options().dump_config.dump_path = "/home/HwHiAiUser/output"
dump_config.dump_step	Iterations to dump. Separate multiple iterations using vertical bars (\|), for example, 0\|5\|10. You can also use hyphens (-) to specify the iteration range, for example, 0\|3-5\|10. If this option is not set, dump data of all iterations is collected. Example: npu.global_options().dump_config.dump_step="0\|5"
dump_config.dump_data	Type of operator content to dump. tensor (default): dumps operator data. stats: dumps operator statistics. The result file is in .csv format. In large-scale training scenarios, dumping a large amount of data takes a long time. You can dump the statistics of all operators, identify the operators that may be abnormal based on the statistics, and then dump the input or output data of these abnormal operators. Example: npu.global_options().dump_config.dump_data = "stats"
dump_config.dump_layer	Name of the operator to dump. Multiple operator names are separated by spaces. If this option is not set, all operators are dumped by default. If the input of the specified operator involves the data operator, the data operator information is also dumped. Example: npu.global_options().dump_config.dump_layer = "nodename1 nodename2 nodename3"
dump_config.enable_dump_debug	Overflow/underflow data collection enable. True: enabled. The dump file path is read from dump_path. An abnormality occurs if dump_path is None. False (default): disabled. NOTE: Data dump and overflow/underflow data collection cannot be enabled at the same time. That is, dump_config.enable_dump and dump_config.enable_dump_debug cannot be set to True at the same time. If either dump_config.enable_dump or dump_config.enable_dump_debug is set to True and enable_exception_dump is set to 1 (indicating that common ExceptionDump function is enabled): For dynamic-shape networks, only enable_exception_dump takes effect. For static-shape networks, enable_exception_dump and either of dump_config.enable_dump and dump_config.enable_dump_debug take effect. Example: npu.global_options().dump_config.enable_dump_debug=True
dump_config.dump_debug_mode	Overflow/Underflow detection mode. The values are as follows: aicore_overflow: detects AI Core operator overflow/underflow, that is, detecting whether abnormal extreme values (such as 65500, 38400, and 51200 in float16) are output with normal inputs. Once such fault is detected, analyze the cause of the overflow/underflow and modify the operator implementation based on the network requirements and operator logic. atomic_overflow: detects Atomic Add overflow/underflow. Atomic Add overflow/underflow is detected when data is transferred from the UB to OUT after AI Core computation. all: detects overflow/underflow of both AI Core operators and Atomic Add. The default value is all. NOTE: For Atlas A2 training products/Atlas A2 inference products, only the default value all can be used. Example: npu.global_options().dump_config.dump_debug_mode="aicore_overflow"
fusion_switch_file	Directory of the fusion switch configuration file, including the file name. The value can contain letters, digits, underscores (_), hyphens (-), and periods (.). The built-in graph fusion and UB fusion patterns are enabled by default. You can disable selected fusion patterns in the configuration file as needed. For details about fusion patterns that can be disabled, see Graph Fusion and UB Fusion Patterns. Example: npu.global_options().fusion_switch_file="/home/test/fusion_switch.cfg" The following is a template of the fusion_switch.cfg configuration file. on indicates that a fusion pattern is enabled, and off indicates that a fusion pattern is disabled. { "Switch":{ "GraphFusion":{ "RequantFusionPass":"on", "ConvToFullyConnectionFusionPass":"off", "SoftmaxFusionPass":"on", "NotRequantFusionPass":"on", "ConvConcatFusionPass":"on", "MatMulBiasAddFusionPass":"on", "PoolingFusionPass":"on", "ZConcatv2dFusionPass":"on", "ZConcatExt2FusionPass":"on", "TfMergeSubFusionPass":"on" }, "UBFusion":{ "TbePool2dQuantFusionPass":"on" } } } To disable all fusion patterns at a time, refer to this configuration file example. { "Switch":{ "GraphFusion":{ "ALL":"off" }, "UBFusion":{ "ALL":"off" } } } Notes: Some built-in fusion patterns are not switchable due to functionality restrictions and these fusion patterns will remain enabled despite user's switch settings. To disable all fusion patterns except selected ones, refer to the following example. { "Switch":{ "GraphFusion":{ "ALL":"off", "SoftmaxFusionPass":"on" }, "UBFusion":{ "ALL":"off", "TbePool2dQuantFusionPass":"on" } } }
quant_dumpable	If the TensorFlow network is quantized by the AMCT tool, this option can be used to specify whether to collect the dump data before quantization. The default value is 0. 0: disabled. The input and output before quantization may be optimized during graph compilation. In this case, the dump data before quantization cannot be obtained. 1: enabled. The dump data before quantization can be collected. Example: npu.global_options().quant_dumpable="1" NOTE: This option applies only to online inference scenarios. When data dump is enabled, you can set this option to 1 to ensure that the dump data before quantization can be collected.

Performance Tuning

Option	Description
hcom_parallel	Enable for the AllReduce gradient update and forward and backward propagation in parallel. True (default): enabled. False: disabled. Example: npu.global_options().hcom_parallel=True For a small network (for example, ResNet18), you are advised to set this option to False.
enable_small_channel	Small channel optimization enable. If it is enabled, performance benefits are yielded at the convolutional layers with channel size <= 4. 0: disabled This function is disabled by default in the training scenario (graph_run_mode is 1). You are advised not to enable this function in the training scenario. 1: enabled. This is the default option that cannot be modified for the online inference scenario (graph_run_mode is 0). NOTE: After this function is included, performance benefits can be obtained on the ResNet50, ResNet101, and ResNet152 networks. For other networks, the performance may deteriorate. Example: npu.global_options().enable_small_channel=1
op_precision_mode	High-precision or high-performance mode of an operator. You can pass a custom mode configuration file op_precision.ini to set different modes for operators. You can set this option by operator type (low priority) or node name (high priority). Example: [ByOpType] optype1=high_precision optype2=high_performance optype3=enable_hi_float_32_execution optype4=support_out_of_bound_index [ByNodeName] nodename1=high_precision nodename2=high_performance nodename3=enable_hi_float_32_execution nodename4=support_out_of_bound_index high_precision: high precision. high_performance: high performance. enable_float_32_execution: The FP32 data type is used for internal processing of operators. In this scenario, the FP32 data type is not automatically converted to the HF32 data type. If you are using the HF32 data type for computation and find that the accuracy drop exceeds your expectation, enable this option to specify the use of FP32 for internal computation of certain operators in order to maintain accuracy. This option is supported only by the following products: Atlas A3 training products/Atlas A3 inference products Atlas A2 training products/Atlas A2 inference products enable_hi_float_32_execution: The HF32 data type is used for internal processing of operators. After this option is enabled, the FP32 data type is automatically converted to the HF32 data type. This configuration can reduce the space occupied by data and improve performance. This option is not supported in the current version. support_out_of_bound_index: The out-of-bounds verification is performed on the indices of the gather, scatter, and segment operators. The verification deteriorates the operator execution performance. keep_fp16: The FP16 data type is used for internal operator processing. In this mode, FP16 is not automatically converted to FP32. If FP32 computation fails to meet performance expectations and high accuracy is not required, you can enable the keep_fp16 mode. This low-precision mode trades accuracy for performance and is not recommended. super_performance: ultra-high performance. Compared with high performance, the algorithm calculation formula is optimized. You can view the precision and performance mode supported by an operator in the opp/built-in/op_impl/ai_core/tbe/impl_mode/all_ops_impl_mode.ini file of the CANN component directory. This option is mutually exclusive with op_select_implmode and optypelist_for_implmode. If they are all specified, op_precision_mode takes precedence. Generally, you do not need to set this option. It is used if you need to adjust the precision of a specific operator using the configuration .ini file in the case that you fail to obtain optimal network performance or accuracy in the high-performance or high-precision mode. Example: npu.global_options().op_precision_mode="/home/test/op_precision.ini"
stream_max_parallel_num	This option applies only to neural machine translation (NMT) networks. Degree of parallelism of AI CPU and AI Core engines for parallel execution of AI CPU and AI Core operators. DNN_VM_AICPU is the name of the AI CPU engine. In this example, the number of concurrent tasks on the AI CPU engine is 10. AIcoreEngine is the name of the AI Core engine. In this example, the number of concurrent tasks on the AI Core engine is 1. The value range is [1, 13]. Defaults to 1. Example: npu.global_options().stream_max_parallel_num="DNN_VM_AICPU:10,AIcoreEngine:1"
is_tailing_optimization	This option applies only to Bidirectional Encoder Representations from Transformers (BERT) networks. Communication tailing optimization enable in distributed training scenarios to improve performance. By changing a computation dependency relationship, a computation operation that does not depend on the last AR (gradient aggregation fragment) is scheduled to be performed in parallel with the last AR, to optimize communication tailing. Values: True: enabled. False (default): disabled. Example: npu.global_options().is_tailing_optimization=True
enable_scope_fusion_passes	Fusion pattern (or fusion patterns separated by commas) to take effect at build time. Scope fusion patterns (either built-in or custom) are classified into the following two types: General scope fusion patterns: applicable to all networks. They are enabled by default and cannot be manually disabled. Non-general scope fusion patterns: applicable to specific networks. By default, they are disabled. You can use enable_scope_fusion_passes to enable selected fusion patterns. Example: npu.global_options().enable_scope_fusion_passes="ScopeLayerNormPass,ScopeClipBoxesPass"

Profiling

Option	Description
profiling_config.enable_profiling	Profiling enable. True: enabled. The profiling options are determined by profiling_options. False (default): disabled. Example: npu.global_options().profiling_config.enable_profiling=True Note: The priority of this configuration item is higher than that of the environment variable PROFILING_MODE. For details about the environment variable, see ""Profile Data Collection"" in Environment Variables.
profiling_config.profiling_options	Sets Profiling options. output: path for storing profiling result files. Both absolute path and relative path (relative to the path where the command is run) are supported. The path cannot contain the following special characters: "\n", "\f", "\r", "\b", "\t", "\v", and "\u007F". An absolute path starts with a slash (/), for example, /home/output. A relative path starts with a directory name, for example, output. It takes precedence over ASCEND_WORK_PATH. This path does not need to be created in advance because it is automatically created during collection. storage_limit: maximum size of files that can be stored in a specified disk directory. If the size of profile data files in the disk is about to use up the maximum storage space specified by this option or the total remaining disk space is about to be used up (remaining space ≤ 20 MB), the earliest files in the disk are aged and deleted. The value range is [200, 4294967295], and the unit is MB. The unit must be included when you set this parameter, for example, 200 MB. If this parameter is not set, the default value is 90% of the available space of the disk where the directory for storing profile data files is located. training_trace: iteration tracing switch. Collects software profile data of a training job and the AI Software Stack to profile the training job. Focuses on forward and backward propagation, and gradient aggregation and update. This option must be set to on when the forward and backward propagation operator data is collected. task_trace and task_time: switches that control collection of the operator delivery and execution durations. Related duration data must be output to the task_time, op_summary, and op_statistic files. Possible configuration values are as follows: on: switch on. This is the default value, delivering the same effect as l1. off: switch off. l0: collects operator delivery and execution duration data. Compared with l1, l0 does not collect basic operator information, so the performance overhead during collection is smaller, and this enables more accurate collection of statistics on time duration data. l1: collects operator delivery and execution duration data, as well as basic operator information, to provide more comprehensive performance analysis data. When Profiling is enabled to collect training data, task_trace and training_trace must be set to on. ge_api: switch that controls collection of the time consumption data of dynamic-shape operators in the host scheduling phase. Possible values are: off: switch off. The default value is off. l0: collects the time consumption data of dynamic-shape operators in the main host scheduling phase to facilitate accurate statistics. l1: collects finer-grained time consumption data of dynamic-shape operators in the host scheduling phase to provide more comprehensive performance analysis data. hccl: communication data collection switch, either on or off (default). NOTE: This switch will be deprecated in later versions. To control data collection, use task_time. aicpu: whether to collect details about the AI CPU operator, such as the operator execution time and data copy time. The value can be on or off (default). fp_point: start point of the forward propagated operator in iteration traces, to record the start timestamp of forward propagation. Set the value to the name of the top operator in forward propagation. You can save the graph as a .pbtxt file by using tf.io.write_graph in the training script to obtain the name. Alternatively, you can leave this option empty (for example, "fp_point":""), and the system will automatically identify the start point of the forward propagated operator. bp_point: end point of the backward propagated operator in iteration traces, to record the end timestamp of backward propagation. bp_point and fp_point are used to compute the time used by forward and backward propagation. Set the value to the name of the bottom operator in backward propagation. You can save the graph as a .pbtxt file by using tf.io.write_graph in the training script to obtain the name. Alternatively, you can leave this option empty (for example, "bp_point":""), and the system will automatically identify the end point of the backward propagated operator. aic_metrics: AI Core metric to profile. The options are as follows: ArithmeticUtilization: arithmetic utilization ratio. PipeUtilization (default): ratio of time taken by the compute units to that of MTEs. Memory: ratio of external memory read/write instructions. MemoryL0: ratio of internal memory L0 read/write instructions. MemoryUB: ratio of internal memory UB read/write instructions. ResourceConflictRatio: ratio of pipeline queue instructions. L2Cache: read/write L2 cache hits and re-allocations after cache misses Atlas inference products: This parameter is not supported. Atlas training products: This parameter is not supported. MemoryAccess: bandwidth of the operator's memory access on cores. Atlas inference products: This parameter is not supported. Atlas training products: This parameter is not supported. NOTE: The registers whose data is to be collected can be customized, for example, *"aic_metrics":"Custom:0x49,0x8,0x15,0x1b,0x64,0x10*". The Custom field indicates the customization type. It is set to specific register values in the range of [0x1, 0x6E]. A maximum of eight registers can be configured, which are separated with commas (,). The register value can be in hexadecimal or decimal format. l2: switch that controls L2 cache and TLB page table cache hit ratio, either on or off (default). Atlas inference products: supports collection of the L2 cache hit ratio. Atlas training products: supports collection of the L2 cache hit ratio. Atlas A2 training products/Atlas A2 inference products: supports collection of L2 cache and TLB page table cache hit ratio. aic-metrics=L2Cache is recommended for analyzing the number of hits on L2 cache from the AI Core. Atlas A3 training products/Atlas A3 inference products: supports collection of L2 cache and TLB page table cache hit ratio. aic-metrics=L2Cache is recommended for analyzing the number of hits on L2 cache from the AI Core. msproftx: switch that controls the msproftx user and upper-layer framework program to output profile data, either on or off (default). Add the following mstx API or msproftx API to the application script. The mstx API is recommended. runtime_api: runtime API data collection switch, either on or off (default). You can collect runtime API profile data, including the synchronous/asynchronous memory replication latencies between the host and device and between devices. sys_hardware_mem_freq: switch that controls the collection of the on-chip memory, QoS transmission bandwidth, LLC L3 cache bandwidth, accelerator bandwidth, SoC transmission bandwidth, and component memory usage. The collected content varies depending on the product. The actual result prevails. The value range is [1, 100], in Hz. Sampling memory data in the environment where glibc (2.34 or earlier) is installed may trigger a known Bug 19329. This problem can be solved by upgrading the glibc version. NOTE: For the following products, you are advised not to increase the profiling frequency after the profiling task is complete. Otherwise, SoC transmission bandwidth data may be lost. Atlas 200I/500 A2 inference products Atlas A2 training products/Atlas A2 inference products Atlas A3 training products/Atlas A3 inference products llc_profiling: LLC events to profile. Possible values are as follows: read (default): read events, that is, the L3 cache read rate. write: write events, that is, the L3 cache write rate. sys_io_sampling_freq: NIC and ROCE data collection frequency. The value range is [1, 100], in Hz. Atlas inference products: This parameter is not supported. Atlas A2 training products/Atlas A2 inference products: supports NIC and RoCE collection. Atlas A3 training products/Atlas A3 inference products: supports NIC and RoCE collection. sys_interconnection_freq: frequency of collecting collective communication bandwidth data (HCCS), SIO data, PCIe data and inter-chip transmission bandwidth information. The value range is [1, 50], in Hz. Atlas training products: supports HCCS and PCIe data collection. Atlas A2 training products/Atlas A2 inference products: supports HCCS, PCIe data, and inter-chip transmission bandwidth information collection. Atlas A3 training products/Atlas A3 inference products: supports HCCS, PCIe data, inter-chip transmission bandwidth information, and SIO data collection. dvpp_freq: DVPP collection frequency. The value range is [1, 100], in Hz. instr_profiling: AI Core and AI Vector bandwidth and latency collection switch. The value can be on or off (default). Atlas training products: This function is not supported. Atlas A2 training products/Atlas A2 inference products: This switch is not supported. This function is controlled through instr_profiling_freq. Atlas A3 training products/Atlas A3 inference products: This switch is not supported. This function is controlled through instr_profiling_freq. instr_profiling_freq: AI Core and AI Vector bandwidth and latency collection switch. If the collection frequency is configured, the related collection capability is enabled. The value range is [300, 30000]. The unit is Hz. Atlas training products: This function is not supported. Atlas A2 training products/Atlas A2 inference products: supported. However, instr_profiling_freq is mutually exclusive with training_trace, task_trace, hccl, aicpu, fp_point, bp_point, aic_metrics, l2, task_time, and runtime_api, so they cannot be executed at the same time. Atlas A3 training products/Atlas A3 inference products: supported. However, instr_profiling_freq is mutually exclusive with training_trace, task_trace, hccl, aicpu, fp_point, bp_point, aic_metrics, l2, task_time, and runtime_api, so they cannot be executed at the same time. host_sys: switch for collecting host profile data. You can select one or more options and separate them with commas (,), for example, "host_sys": "cpu,mem". cpu: process CPU utilization mem: process memory utilization host_sys_usage: Host-side system and process CPU and memory data collection option, selected from cpu and mem. You can select one or more options and separate them with commas (,). host_sys_usage_freq: Host-side system and process CPU and memory data collection frequency. The value range is [1, 50] and the default value is 50. The unit is Hz. NOTE: fp_point and bp_point require manual configuration only in the dynamic shape scenario. In the dynamic shape scenario, fp_point and bp_point must be configured manually. Online inference supports task_trace and aicpu but does not support training_trace**. Example: npu.global_options().profiling_config.profiling_options = '{"output":"/tmp/profiling","training_trace":"on","fp_point":"resnet_model/conv2d/Conv2Dresnet_model/batch_normalization/FusedBatchNormV3_Reduce","bp_point":"gradients/AddN_70"}'

Option

Description

profiling_config.enable_profiling

Profiling enable.

True: enabled. The profiling options are determined by profiling_options.
False (default): disabled.

Example:

npu.global_options().profiling_config.enable_profiling=True

Note: The priority of this configuration item is higher than that of the environment variable PROFILING_MODE. For details about the environment variable, see ""Profile Data Collection"" in Environment Variables.

profiling_config.profiling_options

Sets Profiling options.

output: path for storing profiling result files. Both absolute path and relative path (relative to the path where the command is run) are supported. The path cannot contain the following special characters: "\n", "\f", "\r", "\b", "\t", "\v", and "\u007F".
- An absolute path starts with a slash (/), for example, /home/output.
- A relative path starts with a directory name, for example, output.
- It takes precedence over ASCEND_WORK_PATH.
- This path does not need to be created in advance because it is automatically created during collection.
storage_limit: maximum size of files that can be stored in a specified disk directory. If the size of profile data files in the disk is about to use up the maximum storage space specified by this option or the total remaining disk space is about to be used up (remaining space ≤ 20 MB), the earliest files in the disk are aged and deleted.
The value range is [200, 4294967295], and the unit is MB. The unit must be included when you set this parameter, for example, 200 MB.

If this parameter is not set, the default value is 90% of the available space of the disk where the directory for storing profile data files is located.
training_trace: iteration tracing switch. Collects software profile data of a training job and the AI Software Stack to profile the training job. Focuses on forward and backward propagation, and gradient aggregation and update. This option must be set to on when the forward and backward propagation operator data is collected.
task_trace and task_time: switches that control collection of the operator delivery and execution durations. Related duration data must be output to the task_time, op_summary, and op_statistic files. Possible configuration values are as follows:
- on: switch on. This is the default value, delivering the same effect as l1.
- off: switch off.
- l0: collects operator delivery and execution duration data. Compared with l1, l0 does not collect basic operator information, so the performance overhead during collection is smaller, and this enables more accurate collection of statistics on time duration data.
- l1: collects operator delivery and execution duration data, as well as basic operator information, to provide more comprehensive performance analysis data.
When Profiling is enabled to collect training data, task_trace and training_trace must be set to on.
ge_api: switch that controls collection of the time consumption data of dynamic-shape operators in the host scheduling phase. Possible values are:
- off: switch off. The default value is off.
- l0: collects the time consumption data of dynamic-shape operators in the main host scheduling phase to facilitate accurate statistics.
- l1: collects finer-grained time consumption data of dynamic-shape operators in the host scheduling phase to provide more comprehensive performance analysis data.
hccl: communication data collection switch, either on or off (default).
NOTE:
This switch will be deprecated in later versions. To control data collection, use task_time.
aicpu: whether to collect details about the AI CPU operator, such as the operator execution time and data copy time. The value can be on or off (default).
fp_point: start point of the forward propagated operator in iteration traces, to record the start timestamp of forward propagation. Set the value to the name of the top operator in forward propagation. You can save the graph as a .pbtxt file by using tf.io.write_graph in the training script to obtain the name. Alternatively, you can leave this option empty (for example, "fp_point":""), and the system will automatically identify the start point of the forward propagated operator.
bp_point: end point of the backward propagated operator in iteration traces, to record the end timestamp of backward propagation. bp_point and fp_point are used to compute the time used by forward and backward propagation. Set the value to the name of the bottom operator in backward propagation. You can save the graph as a .pbtxt file by using tf.io.write_graph in the training script to obtain the name. Alternatively, you can leave this option empty (for example, "bp_point":""), and the system will automatically identify the end point of the backward propagated operator.
aic_metrics: AI Core metric to profile. The options are as follows:
- ArithmeticUtilization: arithmetic utilization ratio.
- PipeUtilization (default): ratio of time taken by the compute units to that of MTEs.
- Memory: ratio of external memory read/write instructions.
- MemoryL0: ratio of internal memory L0 read/write instructions.
- MemoryUB: ratio of internal memory UB read/write instructions.
- ResourceConflictRatio: ratio of pipeline queue instructions.
- L2Cache: read/write L2 cache hits and re-allocations after cache misses
  Atlas inference products: This parameter is not supported.
  
  Atlas training products: This parameter is not supported.
- MemoryAccess: bandwidth of the operator's memory access on cores.
  Atlas inference products: This parameter is not supported.
  
  Atlas training products: This parameter is not supported.
NOTE:
The registers whose data is to be collected can be customized, for example, "aic_metrics":"Custom:0x49,0x8,0x15,0x1b,0x64,0x10".
- The Custom field indicates the customization type. It is set to specific register values in the range of [0x1, 0x6E].
- A maximum of eight registers can be configured, which are separated with commas (,).
- The register value can be in hexadecimal or decimal format.
l2: switch that controls L2 cache and TLB page table cache hit ratio, either on or off (default).
- Atlas inference products: supports collection of the L2 cache hit ratio.
- Atlas training products: supports collection of the L2 cache hit ratio.
- Atlas A2 training products/Atlas A2 inference products: supports collection of L2 cache and TLB page table cache hit ratio. aic-metrics=L2Cache is recommended for analyzing the number of hits on L2 cache from the AI Core.
- Atlas A3 training products/Atlas A3 inference products: supports collection of L2 cache and TLB page table cache hit ratio. aic-metrics=L2Cache is recommended for analyzing the number of hits on L2 cache from the AI Core.
msproftx: switch that controls the msproftx user and upper-layer framework program to output profile data, either on or off (default).
Add the following mstx API or msproftx API to the application script. The mstx API is recommended.
runtime_api: runtime API data collection switch, either on or off (default). You can collect runtime API profile data, including the synchronous/asynchronous memory replication latencies between the host and device and between devices.
sys_hardware_mem_freq: switch that controls the collection of the on-chip memory, QoS transmission bandwidth, LLC L3 cache bandwidth, accelerator bandwidth, SoC transmission bandwidth, and component memory usage. The collected content varies depending on the product. The actual result prevails. The value range is [1, 100], in Hz.
Sampling memory data in the environment where glibc (2.34 or earlier) is installed may trigger a known Bug 19329. This problem can be solved by upgrading the glibc version.

NOTE:
For the following products, you are advised not to increase the profiling frequency after the profiling task is complete. Otherwise, SoC transmission bandwidth data may be lost.

Atlas 200I/500 A2 inference products

Atlas A2 training products/Atlas A2 inference products

Atlas A3 training products/Atlas A3 inference products
llc_profiling: LLC events to profile. Possible values are as follows:
- read (default): read events, that is, the L3 cache read rate.
- write: write events, that is, the L3 cache write rate.
sys_io_sampling_freq: NIC and ROCE data collection frequency. The value range is [1, 100], in Hz.
Atlas inference products: This parameter is not supported.

Atlas A2 training products/Atlas A2 inference products: supports NIC and RoCE collection.

Atlas A3 training products/Atlas A3 inference products: supports NIC and RoCE collection.
sys_interconnection_freq: frequency of collecting collective communication bandwidth data (HCCS), SIO data, PCIe data and inter-chip transmission bandwidth information. The value range is [1, 50], in Hz.
- Atlas training products: supports HCCS and PCIe data collection.
- Atlas A2 training products/Atlas A2 inference products: supports HCCS, PCIe data, and inter-chip transmission bandwidth information collection.
- Atlas A3 training products/Atlas A3 inference products: supports HCCS, PCIe data, inter-chip transmission bandwidth information, and SIO data collection.
dvpp_freq: DVPP collection frequency. The value range is [1, 100], in Hz.
instr_profiling: AI Core and AI Vector bandwidth and latency collection switch. The value can be on or off (default).
- Atlas training products: This function is not supported.
- Atlas A2 training products/Atlas A2 inference products: This switch is not supported. This function is controlled through instr_profiling_freq.
- Atlas A3 training products/Atlas A3 inference products: This switch is not supported. This function is controlled through instr_profiling_freq.
instr_profiling_freq: AI Core and AI Vector bandwidth and latency collection switch. If the collection frequency is configured, the related collection capability is enabled. The value range is [300, 30000]. The unit is Hz.
- Atlas training products: This function is not supported.
- Atlas A2 training products/Atlas A2 inference products: supported. However, instr_profiling_freq is mutually exclusive with training_trace, task_trace, hccl, aicpu, fp_point, bp_point, aic_metrics, l2, task_time, and runtime_api, so they cannot be executed at the same time.
- Atlas A3 training products/Atlas A3 inference products: supported. However, instr_profiling_freq is mutually exclusive with training_trace, task_trace, hccl, aicpu, fp_point, bp_point, aic_metrics, l2, task_time, and runtime_api, so they cannot be executed at the same time.
host_sys: switch for collecting host profile data. You can select one or more options and separate them with commas (,), for example, "host_sys": "cpu,mem".
- cpu: process CPU utilization
- mem: process memory utilization
host_sys_usage: Host-side system and process CPU and memory data collection option, selected from cpu and mem. You can select one or more options and separate them with commas (,).
host_sys_usage_freq: Host-side system and process CPU and memory data collection frequency. The value range is [1, 50] and the default value is 50. The unit is Hz.

NOTE:

fp_point and bp_point require manual configuration only in the dynamic shape scenario. In the dynamic shape scenario, fp_point and bp_point must be configured manually.
Online inference supports task_trace and aicpu but does not support training_trace.

Example:

npu.global_options().profiling_config.profiling_options = '{"output":"/tmp/profiling","training_trace":"on","fp_point":"resnet_model/conv2d/Conv2Dresnet_model/batch_normalization/FusedBatchNormV3_Reduce","bp_point":"gradients/AddN_70"}'

AOE

Option	Description
aoe_config.aoe_mode	Tuning mode of AOE. 1: subgraph tuning. 2: operator tuning. 4: gradient splitting tuning. In the data parallel scenario, AllReduce is used to aggregate gradients. The gradient splitting mode is closely related to the distributed training performance. Improper splitting may result in a long communication tailing from the completion of backward propagation, affecting the cluster training performance and linearity. It is sophisticated to perform manual tuning through the gradient splitting API (set_split_strategy_by_idx or set_split_strategy_by_size) of collective communication. AOE collects profile data in the real-device environment and automatically looks up for the optimal splitting strategy. You only need to set the obtained strategy to your network by passing it to the set_split_strategy_by_idx call. NOTE: The tuning mode can be configured by modifying the training script or the AOE_MODE environment variable. If both configuration methods are used, the configuration by modifying the training script takes precedence. For the Atlas A2 training products/Atlas A2 inference products, subgraph tuning is not supported. For the Atlas A3 training products/Atlas A3 inference products, subgraph tuning is not supported. Example: npu.global_options().aoe_config.aoe_mode="1"
aoe_config.work_path	Working directory of AOE, which stores the configuration and result files. By default, the files are generated in the current directory. The value is a string. Create the specified path in advance in the environment (either container or host) where training is performed. The running user configured during installation must have the read and write permissions on this path. The value can be an absolute path or a path relative to the path where the training script is executed. An absolute path starts with a slash (/), for example, /home/HwHiAiUser/output. A relative path starts with a directory name, for example, output. Example: npu.global_options().aoe_config.work_path = "/home/HwHiAiUser/output"
aoe_config.aoe_config_file	Tunes only operators with low performance on the network with AOE. Set this option to the path and name of the configuration file that contains the operator information, for example, /home/test/cfg/tuning_config.cfg. Example: npu.global_options().aoe_config.aoe_config_file="/home/test/cfg/tuning_config.cfg" The configuration file contains information about the operators to be tuned. The file content format is as follows: { "tune_ops_name":["bert/embeddings/addbert/embeddings/add_1","loss/MatMul"], "tune_ops_type":["Add", "Mul"], "tune_optimization_level":"O1", "feature":["deeper_opat"] } tune_ops_name: name of the specified operator (whole word match). You can specify one or more operator names. If multiple operator names are specified, separate them with commas (,). The operator name must be the node name of the network model processed by Graph Compiler. You can obtain the operator name from profiling tuning data. For details, see Profiling Instructions. tune_ops_type: specified operator type (whole word match). You can specify one or more operator types. If multiple operator types are specified, separate them with commas (,). If a fused operator contains the specified operator type, the fused operator will also be tuned. tune_optimization_level: tuning mode. The value O1 indicates the high-performance tuning mode, and the value O2 indicates the normal mode. The default value is O2. feature: tuning feature switch. The value can be deeper_opat or nonhomo_split. The value deeper_opat indicates that in-depth operator tuning is enabled. In this case, aoe_mode must be set to 2. The value nonhomo_split indicates that non-uniform subgraph partition tuning is enabled. In this case, aoe_mode must be set to 1. NOTE: In the preceding configuration file, tune_ops_type and tune_ops_name can exist at the same time or one of them. If they exist at the same time, use the union set.

Option

Description

aoe_config.aoe_mode

Tuning mode of AOE.

1: subgraph tuning.
2: operator tuning.
4: gradient splitting tuning.
In the data parallel scenario, AllReduce is used to aggregate gradients. The gradient splitting mode is closely related to the distributed training performance. Improper splitting may result in a long communication tailing from the completion of backward propagation, affecting the cluster training performance and linearity. It is sophisticated to perform manual tuning through the gradient splitting API (set_split_strategy_by_idx or set_split_strategy_by_size) of collective communication. AOE collects profile data in the real-device environment and automatically looks up for the optimal splitting strategy. You only need to set the obtained strategy to your network by passing it to the set_split_strategy_by_idx call.

NOTE:

The tuning mode can be configured by modifying the training script or the AOE_MODE environment variable. If both configuration methods are used, the configuration by modifying the training script takes precedence.
For the Atlas A2 training products/Atlas A2 inference products, subgraph tuning is not supported.
For the Atlas A3 training products/Atlas A3 inference products, subgraph tuning is not supported.

Example:

npu.global_options().aoe_config.aoe_mode="1"

aoe_config.work_path

Working directory of AOE, which stores the configuration and result files. By default, the files are generated in the current directory.

The value is a string. Create the specified path in advance in the environment (either container or host) where training is performed. The running user configured during installation must have the read and write permissions on this path. The value can be an absolute path or a path relative to the path where the training script is executed.

An absolute path starts with a slash (/), for example, /home/HwHiAiUser/output.
A relative path starts with a directory name, for example, output.

Example:

npu.global_options().aoe_config.work_path = "/home/HwHiAiUser/output"

aoe_config.aoe_config_file

Tunes only operators with low performance on the network with AOE. Set this option to the path and name of the configuration file that contains the operator information, for example, /home/test/cfg/tuning_config.cfg.

Example:

npu.global_options().aoe_config.aoe_config_file="/home/test/cfg/tuning_config.cfg"

The configuration file contains information about the operators to be tuned. The file content format is as follows:

{
       "tune_ops_name":["bert/embeddings/addbert/embeddings/add_1","loss/MatMul"],
       "tune_ops_type":["Add", "Mul"],
       "tune_optimization_level":"O1",
       "feature":["deeper_opat"]
}

tune_ops_name: name of the specified operator (whole word match). You can specify one or more operator names. If multiple operator names are specified, separate them with commas (,). The operator name must be the node name of the network model processed by Graph Compiler. You can obtain the operator name from profiling tuning data. For details, see Profiling Instructions.
tune_ops_type: specified operator type (whole word match). You can specify one or more operator types. If multiple operator types are specified, separate them with commas (,). If a fused operator contains the specified operator type, the fused operator will also be tuned.
tune_optimization_level: tuning mode. The value O1 indicates the high-performance tuning mode, and the value O2 indicates the normal mode. The default value is O2.
feature: tuning feature switch. The value can be deeper_opat or nonhomo_split. The value deeper_opat indicates that in-depth operator tuning is enabled. In this case, aoe_mode must be set to 2. The value nonhomo_split indicates that non-uniform subgraph partition tuning is enabled. In this case, aoe_mode must be set to 1.

NOTE:

In the preceding configuration file, tune_ops_type and tune_ops_name can exist at the same time or one of them. If they exist at the same time, use the union set.

Operator and Graph Build

Option	Description
op_compiler_cache_mode	Disk cache mode for operator building. enable is the default value. enable: disk cache mode enabled. The operator build information is cached to the disk, which can be reused by operators with the same build parameters, improving build efficiency. force: cache mode enabled. This mode deletes the existing cache, then recompiles the operators and adds them to the cache. For example, for Python changes, dependency library changes, or repository changes after operator optimization, you need to set this option to force to clean up the existing cache and then change it to enable to prevent the cache from being forcibly refreshed during each build. Note that you are not advised to set the force option for parallel program compilation. Otherwise, the cache used by other models may be cleaned up, causing compilation failures. disable: disabled. Notes: When enabling the operator compilation cache function, you can configure the path for storing the operator compilation cache file by using op_compiler_cache_dir. disable or force is recommended for publishing the final model. If op_debug_level is set to a non-zero value, the op_compiler_cache_mode configuration is ignored, the operator compilation cache function is disabled, and all operators are recompiled. If op_debug_config is not empty and the op_debug_list field is not configured, the op_compiler_cache_mode configuration is ignored, the operator compilation cache function is disabled, and all operators are recompiled. If op_debug_config is not empty, the op_debug_list field is configured, and op_compiler_cache_mode is set to enable or force, the operators in the list are recompiled, and the operator compilation cache function is enabled for operators that are not in the list. However, operators that are not in the list will not be recompiled. When the operator compilation cache function is enabled, the default disk space allocated for cache files is 500 MB. If disk space becomes insufficient, cache files are deleted and 50% of the cache space is reserved by default. You can also customize the disk space allocated for cache files and the percentage of cache space to retain as follows: Using the op_cache.ini configuration file After the operator is compiled, the op_cache.ini file is automatically generated in the directory specified by op_compiler_cache_dir. You can use this file to set the disk space allocated for cache files and the percentage of cache space to retain. If the op_cache.ini file does not exist, manually create it. Add the following information to the op_cache.ini file: # Configure the file format (required). The automatically generated file contains the following information by default. When manually creating a file, enter the following information: [op_compiler_cache] # Limit the disk space of the cache folder on the Ascend AI Processor (unit: MB). max_op_cache_size=500 # When the disk space is insufficient, set the percentage of cache files to retain. Value range: [1, 100] (%). For example, setting it to 80 means that when disk space becomes insufficient, 80% of the cache files will be retained and the rest will be deleted. remain_cache_size_ratio=80 The op_cache.ini file takes effect only when the values of max_op_cache_size and remain_cache_size_ratio in the preceding file are valid. When the size of the compilation cache file exceeds the configured value of max_op_cache_size and the cache file has not been accessed for more than half an hour, the cache file will be aged out. (Operator compilation will not be interrupted if the cache file size exceeds the limit. Therefore, if max_op_cache_size is set too small, the actual compilation cache file size may exceed the configured value.) To disable the compilation cache aging function, set max_op_cache_size to -1. In this case, the access time is not updated when the operator cache is accessed, the operator compilation cache is not aged, and the default disk space of 500 MB is used. If multiple users use the same cache path, the configuration file affects all users. Using environment variable ASCEND_MAX_OP_CACHE_SIZE You can use the environment variable ASCEND_MAX_OP_CACHE_SIZE to limit the disk space for cache files under an Ascend AI Processor. When the compilation cache space reaches the value set by ASCEND_MAX_OP_CACHE_SIZE and a cache file has not been accessed for more than half an hour, the cache file will be aged out. ASCEND_REMAIN_CACHE_SIZE_RATIO can be used to set the percentage of cache space to retain. For details about environment variables, see "Operator Building" in Environment Variables. To disable the compilation cache aging function, set ASCEND_MAX_OP_CACHE_SIZE to -1. If both the op_cache.ini file and environment variables are configured, the configuration items in op_cache.ini take precedence. If neither is configured, the system uses the default values: 500 MB of disk space for the cache, with 50% of the cache space retained. Example: npu.global_options().op_compiler_cache_mode="enable"
op_compiler_cache_dir	Disk cache directory for operator compilation. The directory can contain letters, digits, underscores (_), hyphens (-), and periods (.). If the specified directory exists and is valid, a kernel_cache subdirectory is automatically created. If the specified directory does not exist but is valid, the system automatically creates this directory and the kernel_cache subdirectory. The storage priority of the operator compilation cache files is as follows: op_compiler_cache_dir > ${ASCEND_CACHE_PATH}/kernel_cache > Default path ($HOME/atc_data) For details about ASCEND_CACHE_PATH, see Environment Variables. Example: npu.global_options().op_compiler_cache_dir="/home/test/kernel_cache"
aicore_num	Maximum number of Cube cores and Vector cores used for operator compilation. Format: Integer 1\|Integer 2, where the two values are separated by vertical bars (\|). Integer 1 specifies the maximum number of Cube cores to use, and Integer 2 specifies the maximum number of Vector cores to use. Both values must be greater than 0 and less than or equal to the actual number of Cube cores and Vector cores available on the Ascend AI Processor. NOTE: This option is supported by the following products: Atlas A3 training products/Atlas A3 inference products Atlas A2 training products/Atlas A2 inference products The maximum number of Cube cores and Vector cores for different Ascend AI Processors can be found in the *CANN installation directory/<arch>-linux/data/platform_config/<soc_version>.ini* file. The following example indicates that there are 24 Cube cores and 48 Vector cores on the Ascend AI Processor. [SoCInfo] ai_core_cnt=24 cube_core_cnt=24 vector_core_cnt=48 In static shape scenarios, if an existing operator binary is reused during model compilation (that is, jit_compile set to false), aicore_num does not take effect. Example: npu.global_options().aicore_num="2\|4"
oo_constant_folding	Enables or disables constant folding. Constant folding is to directly compute and replace the values of constant expressions during the graph build stage, thereby reducing the memory usage. In most cases, you are advised to retain the default value to enable constant folding. However, some networks require more memory during compilation and running, and the constant memory is occupied throughout the lifecycle of the graph. If the total memory increases after constant folding, you can use this parameter to disable constant folding. True (default value): enables constant folding. False: disables constant folding. npu.global_options().oo_constant_folding=True NOTE: If constant folding is disabled and an error occurs during network compilation and running, information similar to the following will be displayed: Example 1: Error message from the debug log: [ERROR] GE(3469659,python3.7):2025-02-25-05: [ge_deleted_op.cc*:21]3470503 Run: ErrorNo: 4294967295(failed) [Delete][Node] Node:HcomAllReduce/input type is ExpandDims, should be deleted by ge. This error indicates that the network contains an ExpandDims operator that requires constant folding during graph compilation, meaning that constant folding cannot be disabled. Example 2: Screen output with error code EZ3003: Error Message is : EZ3003: [PID: 3482331] 2025-02-25-14:07:19.774.362 No supported Ops kernel and engine are found* for [import/conv2d_1/convolutionimport/batch_normalization_1/FusedBatchNorm_1_filter_host], optype [ConvBnFilterHost]. Possible Cause: The operator is not supported by the system. Therefore, no hit is found in any operator information library. This error indicates that the network contains a ConvBnFilterHost operator that requires constant folding during graph compilation, meaning that constant folding cannot be disabled.

Exception Remedy

Option	Description
stream_sync_timeout	Timeout for stream synchronization during graph execution. If the timeout exceeds the configured value, a synchronization failure is reported. The unit is ms. The default value is -1, indicating that there is no waiting time and no error is reported when the synchronization fails. Note: For cluster training, the value of this option (stream synchronization waiting timeout) must be greater than the collective communication timeout, which means the value of the environment variable HCCL_EXEC_TIMEOUT. For details about HCCL_EXEC_TIMEOUT, see section "Collective Communication" in the Environment Variables. Example: npu.global_options().stream_sync_timeout=600000
event_sync_timeout	Timeout for event synchronization during graph execution. If the timeout exceeds the configured value, a synchronization failure is reported. The unit is ms. The default value is -1, indicating that there is no waiting time and no error is reported when the synchronization fails. Example: npu.global_options().event_sync_timeout=600000

Option

Description

stream_sync_timeout

Timeout for stream synchronization during graph execution. If the timeout exceeds the configured value, a synchronization failure is reported. The unit is ms.

The default value is -1, indicating that there is no waiting time and no error is reported when the synchronization fails.

Note: For cluster training, the value of this option (stream synchronization waiting timeout) must be greater than the collective communication timeout, which means the value of the environment variable HCCL_EXEC_TIMEOUT. For details about HCCL_EXEC_TIMEOUT, see section "Collective Communication" in the Environment Variables.

Example:

npu.global_options().stream_sync_timeout=600000

event_sync_timeout

Timeout for event synchronization during graph execution. If the timeout exceeds the configured value, a synchronization failure is reported. The unit is ms.

The default value is -1, indicating that there is no waiting time and no error is reported when the synchronization fails.

Example:

npu.global_options().event_sync_timeout=600000

Experiment Options

The experiment options are extended options for debugging and may be changed in later versions. Therefore, they cannot be used in commercial products.

Option	Description
graph_compiler_cache_dir	Drive cache directory for graph compilation. If this parameter is not empty, the drive cache function for graph compilation takes effect. The graph compilation cache function supports drive persistence of graph compilation results. When graph compilation is performed again, the compilation results cached on the drive can be directly loaded to reduce the graph compilation duration. Note: The configured cache directory must exist. Otherwise, the compilation fails. During graph compilation, the cache file is determined based on the value of this parameter. If the cache file does not exist, the cache is saved. If the cache file exists, the existing cache is directly loaded. After a graph is changed, the original cache file is unavailable. You need to manually delete the cache file from the cache directory or rebuild and generate a cache file. The cache does not ensure cross-version compatibility. If the version is upgraded, clear the cache directory and rebuild and generate the cache. This function does not support models with resource operators. Example: npu.global_options().graph_compiler_cache_dir="/root/build_cache_dir"
jit_compile	Online compilation enable for model compilation. auto (default value): For a static shape network, compile the operator online. For a dynamic shape network, search for the compiled operator binary file in the system first. If the corresponding binary file is not available, compile the operator. true: Operators are compiled online. The system performs fusion and tuning based on the obtained graph information to get better performing operators. false: The compiled operator binary file in the system is preferentially searched. If the file can be found, operators are not compiled anymore, which produces better compilation performance. If the file cannot be found, operators will be compiled. NOTICE: This option is used only for networks of large recommendation models. Example: npu.global_options().jit_compile = "auto"
shape_generalization_mode	When jit_compile is set to true (online operator compilation), use this parameter to configure the shape generalization mode. STRICT (default): Uses the shape of the current iteration as is, without any generalization. FULL: Generalizes all axes to -1 if the shape changes between iterations. ADAPTIVE: Generalizes only the shape of the changed axis to -1 if the shape changes between iterations. NOTICE: If compile_dynamic_mode is set to True, all input shapes are generalized to -1 in the first iteration. In this case, the configuration of shape_generalization_mode does not take effect. Example: npu.global_options().shape_generalization_mode = "FULL"
auto_multistream_parallel_mode	This option applies only to graphs with a static shape. You can enable parallel execution of Cube and Vector operators to improve graph execution performance. cv: Parallel execution of Cube and Vector operators is enabled. None (default): Parallel execution of Cube and Vector operators is disabled. NOTICE: This option is used only for recommendation networks. Parallel execution of Cube and Vector operators cannot be enabled at the same time as the multi-stream concurrency function (configured by the ENABLE_DYNAMIC_SHAPE_MULTI_STREAM environment variable). For details about the environment variable, see Environment Variables. Example: npu.global_options().auto_multistream_parallel_mode = "cv"

Options That Will Be Deprecated in Later Versions

Option	Description
op_select_implmode	Operator implementation mode select. The operators built in the Ascend AI Processor can be implemented in either high-precision or high-performance mode. The value can be set to either of the following: high_precision: high-precision implementation mode. In high-precision mode, Taylor's theorem or Newton's method is used to improve operator precision with float16 input. high_performance (default): high-performance implementation mode. In high-performance mode, the optimal performance is implemented without affecting the network precision (float16). The default value is None, indicating that the configuration is disabled. Example: npu.global_options().op_select_implmode="high_precision"
optypelist_for_implmode	List of operator types (separated by commas) that use the mode specified by the op_select_implmode parameter. Currently, Pooling, SoftmaxV2, LRN, and ROIAlign operators are supported. Use this parameter in conjunction with op_select_implmode, for example: npu.global_options().op_select_implmode="high_precision" npu.global_options().optypelist_for_implmode="Pooling,SoftmaxV2" The default value is None, indicating that the configuration is disabled.
variable_format_optimize	Variable format optimization enable. True: enabled. False: disabled. To improve training efficiency, the format of the variables is converted to a format more compatible with the Ascend AI Processor during variable initialization performed by the network. Enable or disable this function as needed. The default value is None, indicating that the configuration is disabled. Example: npu.global_options().variable_format_optimize=True
op_debug_level	Operator debug enable. 0: disables operator debug. 1: Enables operator debug. TBE instruction mapping files are generated in the kernel_meta directory under the training script execution path, including operator CCE files (.cce), Python-CCE mapping files (_loc.json), .o files, and .json files. These files are used for AI Core error analysis with related tools. 2: Enables operator debug. TBE instruction mapping files are generated in the kernel_meta directory under the training script execution path, including operator CCE files (.cce), Python-CCE mapping files (_loc.json), .o files, and .json files. The compilation optimization of the CCE compiler is disabled and the CCE compiler debugging function is enabled (by setting the compiler option to -O0-g). These files are used for AI Core error analysis with related tools. 3: disables operator debug. The operator .o and .json files are retained in the kernel_meta folder in the training script execution directory. 4: disables operator debug. The operator binary (.o) and operator description file (.json) are retained, and a TBE instruction mapping file (.cce) and a UB fusion description file ({$kernel_name}_compute.json) are generated in the kernel_meta folder under the training script execution directory. NOTICE: If this option is set to 0 and op_debug_config is configured, the operator compilation directory kernel_meta is still generated in the current execution path during training. The content generated in the directory is subject to op_debug_config. You are advised to set this option to 0 or 3 for training. To locate AI Core errors, set this parameter to 1 or 2, which might compromise the network performance. If this option is set to 2 (the CCE compiler is enabled), it cannot be used together with the oom option in op_debug_config. Otherwise, an AI Core error is reported. The following is an example of the error message: ...there is an aivec error exception, core id is 49, error code = 0x4 ... If this parameter is set to 2 (the CCE compiler is enabled), the size of the operator kernel file (.o file) increases. In dynamic shape scenarios, all possible scenarios are traversed during operator build, which may cause operator build failures due to large operator kernel files. In this case, 2 is not recommended. If the build failure is caused by the large operator kernel file, the following log is displayed: message:link error ld.lld: error: InputSection too large for range extension thunk* ./kernel_meta_xxxxx.o:(xxxx) If the value of this option is not 0, you can use the debug_dir option to specify the path for storing debugging-related process files. If this option is set to 0 and NPU_COLLECT_PATH is set, the operator compilation directory kernel_meta is generated in the current path after the command is executed. If ASCEND_WORK_PATH is set, kernel_meta is generated in the path specified by the environment variable. For details about the environment variable, see Environment Variables. When the debug function is enabled, if the model contains the following merged compute and communication (MC2) operators, the .o, .json, and .cce files of the operators are not generated in the operator build folder kernel_meta. MatMulAllReduce MatMulAllReduceAddRmsNorm AllGatherMatMul MatMulReduceScatter AlltoAllAllGatherBatchMatMul BatchMatMulReduceScatterAlltoAll The default value is None*, indicating that the configuration is disabled. Example: npu.global_options().op_debug_level=0
graph_memory_max_size	Sizes of the network static memory and the maximum dynamic memory (used in earlier versions). In the current version, this parameter does not take effect. The system dynamically allocates memory resources based on the actual memory usage of the network.
variable_memory_max_size	Size of the variable memory (used in earlier versions). In the current version, this parameter does not take effect. The system dynamically allocates memory resources based on the actual memory usage of the network.

Parent topic: npu.global_options