Session Configuration Options

Basic Options

Option	Description	Application Scenarios
graph_run_mode	Graph run mode. 0: online inference. 1 (default): training Configuration example: custom_op.parameter_map["graph_run_mode"].i = 1	Training/Online inference
session_device_id	Logical ID of a device. Setting this option allows you to run different models on multiple devices by executing a single training script. You can create different sessions for different graphs and pass different session_device_id values. Example: config_0 = tf.ConfigProto() custom_op = config_0.graph_options.rewrite_options.custom_optimizers.add() custom_op.name = "NpuOptimizer" custom_op.parameter_map["session_device_id"].i = 0 config_0.graph_options.rewrite_options.remapping = RewriterConfig.OFF config_0.graph_options.rewrite_options.memory_optimization = RewriterConfig.OFF with tf.Session(config=config_0) as sess_0: sess_0.run(...) config_1 = tf.ConfigProto() custom_op = config_1.graph_options.rewrite_options.custom_optimizers.add() custom_op.name = "NpuOptimizer" custom_op.parameter_map["session_device_id"].i = 1 config_1.graph_options.rewrite_options.remapping = RewriterConfig.OFF config_1.graph_options.rewrite_options.memory_optimization = RewriterConfig.OFF with tf.Session(config=config_1) as sess_1: sess_1.run(...) config_7 = tf.ConfigProto() custom_op = config_7.graph_options.rewrite_options.custom_optimizers.add() custom_op.name = "NpuOptimizer" custom_op.parameter_map["session_device_id"].i = 7 config_7.graph_options.rewrite_options.remapping = RewriterConfig.OFF config_7.graph_options.rewrite_options.memory_optimization = RewriterConfig.OFF with tf.Session(config=config_7) as sess_7: sess_7.run(...)	Training/Online inference
deterministic	Whether to enable deterministic computing. If enabled, the same output is generated if an operator is executed for multiple times with the same hardware and input. The values are as follows: 0 (default): disables deterministic computing. 1: enables deterministic computing. By default, deterministic computing does not need to be enabled, because it slows down operator execution and affects performance. If it is disabled, the results of multiple executions may be different. This is generally caused by asynchronous multi-thread executions during operator implementation, which changes the accumulation sequence of floating point numbers. However, if the execution results of a model are different for multiple times or the precision needs to be tuned, you can enable deterministic computing to assist model debugging and tuning. Note that if you want a completely definite result, you need to set a definite random seed in the training script to ensure that the random numbers generated in the program are also definite. Example: custom_op.parameter_map["deterministic"].i = 1	Training/Online inference

Option

Description

Application Scenarios

graph_run_mode

Graph run mode.

0: online inference.
1 (default): training

Configuration example:

custom_op.parameter_map["graph_run_mode"].i = 1

Training/Online inference

session_device_id

Logical ID of a device. Setting this option allows you to run different models on multiple devices by executing a single training script.

You can create different sessions for different graphs and pass different session_device_id values.

Example:

config_0 = tf.ConfigProto()
custom_op = config_0.graph_options.rewrite_options.custom_optimizers.add()
custom_op.name = "NpuOptimizer"
custom_op.parameter_map["session_device_id"].i = 0
config_0.graph_options.rewrite_options.remapping = RewriterConfig.OFF
config_0.graph_options.rewrite_options.memory_optimization = RewriterConfig.OFF
with tf.Session(config=config_0) as sess_0:
    sess_0.run(...)

config_1 = tf.ConfigProto()
custom_op = config_1.graph_options.rewrite_options.custom_optimizers.add()
custom_op.name = "NpuOptimizer"
custom_op.parameter_map["session_device_id"].i = 1
config_1.graph_options.rewrite_options.remapping = RewriterConfig.OFF
config_1.graph_options.rewrite_options.memory_optimization = RewriterConfig.OFF
with tf.Session(config=config_1) as sess_1:
    sess_1.run(...)

config_7 = tf.ConfigProto()
custom_op = config_7.graph_options.rewrite_options.custom_optimizers.add()
custom_op.name = "NpuOptimizer"
custom_op.parameter_map["session_device_id"].i = 7
config_7.graph_options.rewrite_options.remapping = RewriterConfig.OFF
config_7.graph_options.rewrite_options.memory_optimization = RewriterConfig.OFF
with tf.Session(config=config_7) as sess_7:
    sess_7.run(...)

Training/Online inference

deterministic

Whether to enable deterministic computing. If enabled, the same output is generated if an operator is executed for multiple times with the same hardware and input.

The values are as follows:

0 (default): disables deterministic computing.
1: enables deterministic computing.

By default, deterministic computing does not need to be enabled, because it slows down operator execution and affects performance. If it is disabled, the results of multiple executions may be different. This is generally caused by asynchronous multi-thread executions during operator implementation, which changes the accumulation sequence of floating point numbers.

However, if the execution results of a model are different for multiple times or the precision needs to be tuned, you can enable deterministic computing to assist model debugging and tuning. Note that if you want a completely definite result, you need to set a definite random seed in the training script to ensure that the random numbers generated in the program are also definite.

Example:

custom_op.parameter_map["deterministic"].i = 1

Training/Online inference

Memory Management

Option	Description	Application Scenarios
atomic_clean_policy	Whether to clean up the memory occupied by all operators with the memset attribute (memset operators) on the network. The options are as follows: 0 (default): enabled. 1: disabled. Memory used by each memset operator is cleaned up separately. When the memset operators on the network occupy too much memory, you can try this method. However, this method may cause performance loss. Example: custom_op.parameter_map["atomic_clean_policy"].i = 1	Training/Online inference
static_memory_policy	Memory allocation mode used during network running. 0: dynamic memory allocation. Memory is dynamically allocated based on the actual size. 2: dynamic memory expansion supported by only static shape. In training and online inference scenarios, this environment variable can be used to implement memory overcommitment between multiple graphs in the same session. That is, the memory required by the maximum graph is allocated. For example, if the memory required by the current graph exceeds the memory of the previous graph, the memory of the previous graph is directly released. The memory is reallocated based on the memory required by the current graph. 3: dynamic memory expansion supported by only dynamic shape, which solves the fragment problem during dynamic memory allocation and reduces the memory usage of the dynamic-shape network. 4: dynamic memory expansion supported by both static and dynamic shapes. Configuration example: custom_op.parameter_map["static_memory_policy"].i = 0 NOTE: This option cannot be set to 2 or 4 when multiple graphs are executed concurrently. To be compatible with earlier versions, the system adopts the method of mode 2 even if this option is set to 1. If this parameter is set to 3 or 4, memory gains are generated, but performance loss may occur.	Training/Online inference
external_weight	When multiple models are loaded in a session, if the weights of these models can be reused, you are advised to use this configuration item to externalize the weights of the Const/Constant nodes on the network to implement weight reuse among multiple models and reduce the memory usage of the weights. False (default): The weights are not externalized but are saved in graphs. True: The weights of all Const/Constant nodes on the network are flushed to drives and are converted to FileConstant. The weight file is named in the format of "weight_<hash value>". If the environment variable ASCEND_WORK_PATH is not configured in the environment, the weight files are flushed to the current execution directory *tmp_weight_<pid>*_<sessionid>. If ASCEND_WORK_PATH is configured in the environment, the weight files are flushed to the ${ASCEND_WORK_PATH}/tmp_weight_<pid>_<sessionid> directory. For details about ASCEND_WORK_PATH*, see Installation and Configuration > Flush File Configuration in Environment Variables. When the model is uninstalled, the tmp_weight_<pid>_<sessionid>* directory is automatically deleted. Note: This option is usually not required. If the model loading environment has limitations on memory, you can flush the weight externally. Example: custom_op.parameter_map["external_weight"].b = True	Training/Online inference
input_fusion_size	Threshold for fusing and copying multiple discrete pieces of user input data during data transfer from the host to the device. The unit is byte. The minimum value is 0 byte, the maximum value is 33554432 bytes (32 MB), and the default value is 131072 bytes (128 KB). If: Size of input data ≤ the threshold: The data is fused before transferred from the host to the device. Size of input data > the threshold or the threshold = 0 (function disabled): The data is not fused before transferred from the host to the device. Assume there are 10 user inputs, including two 100 KB inputs, two 50 KB inputs, and the other inputs greater than 100 KB: input_fusion_size = 100 KB: The preceding four inputs are fused into 300 KB data for transfer. The other six inputs are directly transferred from the host to the device. input_fusion_size = 0 KB: This function is disabled. That is, the data is not fused, and the ten inputs are directly transferred from the host to the device. Example: custom_op.parameter_map["input_fusion_size"].i = 25600	Training/Online inference

Dynamic Shape

In the scenario of dynamic dimension size profiles, input_shape, dynamic_dims, and dynamic_node_type must be used together.

Option	Description	Application Scenarios
input_shape	Input shape. Configuration example: custom_op.parameter_map["input_shape"].s = tf.compat.as_bytes("data:1,1,40,-1;label:1,-1;mask:-1,-1") In the preceding example, the network model has three inputs: data (1, 1, 40, -1), label (1, -1), and mask (-1, -1). Separate the name and shapes of each input with colons (:). -1 indicates a dynamic dimension, whose size profiles are configured by using dynamic_dims. Notes: The names entered in input_shape must be in the same alphabetical order as the names of the actual data nodes. For example, for inputs data, label, and mask, the names entered in input_shape must be in the order of data, label, and mask. If a network has both dataset inputs and placeholder inputs, since dynamic inputs are described in only one of the modes, you only need to fill in the shapes of the dynamic inputs. For scalar inputs, set the shape to 0. The shape range specified by this option must be valid.	Online inference
dynamic_dims	Input dimension size choices. Separate the dimension sizes by a semicolon (;). The dimension values match to the -1 placeholders in the input_shape argument with ordering preserved, and the number of -1 placeholders equals the number of dimension sizes of each profile. Set at least two dynamic dimension size profiles. The argument of dynamic_dims must match that of input_shape, as failure to do so may lead to an error and system's exit. Example: custom_op.parameter_map["dynamic_dims"].s = tf.compat.as_bytes("20,20,1,1;40,40,2,2;80,60,4,4") Based on the input_shape information in the preceding example, the supported input shape profiles are as follows: Profile 0: data(1,1,40,20)+label(1,20)+mask(1,1) Profile 1: data(1,1,40,40), label(1,40), mask(2,2) Profile 2: data(1,1,40,80)+label(1,60)+mask(4,4)	Online inference
dynamic_node_type	Type of the dynamic input node. 0: dataset input 1: placeholder input Only one type of dynamic inputs is allowed, dataset or placeholder. Example: custom_op.parameter_map["dynamic_node_type"].i = 0	Online inference
ac_parallel_enable	Indicates whether to allow AI CPU operators and AI Core operators to run in parallel in a dynamic shape graph. In a dynamic shape graph, when this option is enabled, the system automatically identifies AI CPU operators that can be concurrently executed with the AI Core operators in the graph. Operators of different engines are distributed to different flows to implement parallel execution among multiple engines, improving resource utilization and dynamic shape execution performance. 1: AI CPU operators and AI Core operators are allowed to run in parallel. 0: AI CPU operators are not separately distributed. The default value is 0. Configuration example: custom_op.parameter_map["ac_parallel_enable"].s = tf.compat.as_bytes("1")	Training/Online inference
compile_dynamic_mode	Indicates whether to generalize all input shapes in the graph. True: Generalize all input shapes to -1. Also, static shape graphs are generalized to dynamic ones. False: Input shapes are not generalized. The default value is False. Configuration example: custom_op.parameter_map["compile_dynamic_mode"].b = True Note: This option cannot be used together with input_shape, dynamic_dims, or dynamic_node_type.	Training/Online inference

Mixed Computing

Option	Description	Application Scenarios
mix_compile_mode	Mixed computing True: enabled. False (default): disabled. In full offload mode, all compute operators are offloaded to the device. As a supplement to the full offload mode, mixed computing allows certain operators to be executed online within the frontend framework, improving the Ascend AI Processor's adaptability to TensorFlow. Example: custom_op.parameter_map["mix_compile_mode"].b = True	Training/Online inference
in_out_pair_flag	Whether to offload operators specified by in_out_pair to Ascend AI Processor in mixed computing scenarios. True (default) False Example: custom_op.parameter_map['in_out_pair_flag'].b = False	Online inference
in_out_pair	Names of the input-layer and output-layer operators (not) offloaded in mixed computing scenarios. Note that this option supports only one operator configured within the range of [in_nodes, out_nodes]. Example: # Enable mixed computing. custom_op.parameter_map["mix_compile_mode"].b = True # Perform the following configuration: Offload operators within the [in_nodes, out_nodes] range to Ascend AI Processor for execution, and execute other operators in the frontend framework. in_nodes.append('import/conv2d_1/convolution') out_nodes.append('import/conv2d_59/BiasAdd') out_nodes.append('import/conv2d_67/BiasAdd') out_nodes.append('import/conv2d_75/BiasAdd') all_graph_iop.append([in_nodes, out_nodes]) custom_op.parameter_map['in_out_pair'].s = tf.compat.as_bytes(str(all_graph_iop)) # Alternatively, retain operators within the [in_nodes, out_nodes] range for execution in the frontend framework, and offload other operators to Ascend AI Processor for execution. in_nodes.append('import/conv2d_1/convolution') out_nodes.append('import/conv2d_59/BiasAdd') out_nodes.append('import/conv2d_67/BiasAdd') out_nodes.append('import/conv2d_75/BiasAdd') all_graph_iop.append([in_nodes, out_nodes]) custom_op.parameter_map['in_out_pair_flag'].b = False custom_op.parameter_map['in_out_pair'].s = tf.compat.as_bytes(str(all_graph_iop))	Online inference

Option

Description

Application Scenarios

mix_compile_mode

Mixed computing

True: enabled.
False (default): disabled.

In full offload mode, all compute operators are offloaded to the device. As a supplement to the full offload mode, mixed computing allows certain operators to be executed online within the frontend framework, improving the Ascend AI Processor's adaptability to TensorFlow.

Example:

custom_op.parameter_map["mix_compile_mode"].b =  True

Training/Online inference

in_out_pair_flag

Whether to offload operators specified by in_out_pair to Ascend AI Processor in mixed computing scenarios.

True (default)
False

Example:

custom_op.parameter_map['in_out_pair_flag'].b = False

Online inference

in_out_pair

Names of the input-layer and output-layer operators (not) offloaded in mixed computing scenarios.

Note that this option supports only one operator configured within the range of [in_nodes, out_nodes].

Example:

# Enable mixed computing.
custom_op.parameter_map["mix_compile_mode"].b = True

# Perform the following configuration: Offload operators within the [in_nodes, out_nodes] range to Ascend AI Processor for execution, and execute other operators in the frontend framework.
in_nodes.append('import/conv2d_1/convolution')
out_nodes.append('import/conv2d_59/BiasAdd')
out_nodes.append('import/conv2d_67/BiasAdd')
out_nodes.append('import/conv2d_75/BiasAdd')
all_graph_iop.append([in_nodes, out_nodes])
custom_op.parameter_map['in_out_pair'].s = tf.compat.as_bytes(str(all_graph_iop))

# Alternatively, retain operators within the [in_nodes, out_nodes] range for execution in the frontend framework, and offload other operators to Ascend AI Processor for execution.
in_nodes.append('import/conv2d_1/convolution')
out_nodes.append('import/conv2d_59/BiasAdd')
out_nodes.append('import/conv2d_67/BiasAdd')
out_nodes.append('import/conv2d_75/BiasAdd')
all_graph_iop.append([in_nodes, out_nodes])
custom_op.parameter_map['in_out_pair_flag'].b = False
custom_op.parameter_map['in_out_pair'].s = tf.compat.as_bytes(str(all_graph_iop))

Online inference

Debugging

Option	Description	Application Scenarios
enable_exception_dump	Whether to dump data of the exception operator. 0: disabled. 1: The common ExceptionDump function is enabled, to dump the input and output data, tensor description information (such as shape, dtype, and format), and workspace information of the exception operator. In this mode, dump data is stored in the current script execution path by default. 2: The LiteExecptionDump is enabled, to dump the input and output data, workspace information, and tiling information of abnormal operators. The default value is 2. In this mode, dump data is stored in the /extra-info/data-dump/<device_id> directory in the current script execution path by default. If the environment variable ASCEND_WORK_PATH is configured, dump data is stored in the ASCEND_WORK_PATH/extra-info/data-dump/<device_id> directory. NOTE: If the environment variable NPU_COLLECT_PATH is configured, the exception operator data is dumped based on mode 1 regardless of the value of enable_exception_dump. In addition, the dump data is stored in the directory specified by NPU_COLLECT_PATH. For details about the environment variable, see Environment Variables. Example: custom_op.parameter_map["enable_exception_dump"].i = 1	Training/Online inference
op_debug_config	Enable global memory check. The value is the path of the .cfg configuration file. Multiple options in the configuration file are separated by commas (,). oom: checks whether memory overwriting occurs in the Global Memory during operator execution. During operator compilation, the .o file (operator binary file) and .json file (operator description file) are retained in the kernel_meta folder in the current execution path, and the following detection logic is added: inline __aicore__ void CheckInvalidAccessOfDDR(xxx) { if (access_offset < 0 \|\| access_offset + access_extent > ddr_size) { if (read_or_write == 1) { trap(0X5A5A0001); } else { trap(0X5A5A0002); } } } You can use dump_cce to view the preceding code in the generated .cce file. If memory overwriting occurs during compilation, the error code EZ9999 is reported. dump_cce: retains the .cce file, .o file, and .json file of the operator in the kernel_meta* folder in the current execution path during operator compilation. dump_loc: retains the .cce file, .o file, and .json file of the operator, as well as the _loc.json file (mapping file of python-cce) in the kernel_meta folder in the current execution path during operator compilation. ccec_O0: enables the default option -O0 of the CCEC during operator compilation. This option does not perform any optimization based on the debugging information. ccec_g: enables the -g option of the CCEC during operator compilation. Compared with -O0, this option generates optimization and debugging information. check_flag: checks whether pipeline synchronization signals in operators match each other during operator execution. It retains the .o file and .json file in the generated kernel_meta folder and adds the following detection logic during operator compilation: set_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID0); set_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID1); set_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID2); set_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID3); .... pipe_barrier(PIPE_MTE3); pipe_barrier(PIPE_MTE2); pipe_barrier(PIPE_M); pipe_barrier(PIPE_V); pipe_barrier(PIPE_MTE1); pipe_barrier(PIPE_ALL); wait_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID0); wait_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID1); wait_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID2); wait_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID3); ... You can use dump_cce to view the preceding code in the generated .cce file. During compilation, if mismatch exists in the pipeline synchronization signals in operators, a timeout error is reported at the faulty operator. The following is an example of the error information: Aicore kernel execute failed, ..., fault kernel_name=operator name,... rtStreamSynchronizeWithTimeout execute failed.... Example: custom_op.parameter_map["op_debug_config"].s = tf.compat.as_bytes("/root/test0.cfg") The information about the test0.cfg file is as follows: op_debug_config=ccec_g,oom Restrictions: During operator compilation, if you want to compile only some instead of all AI Core operators, you need to add the op_debug_list field to the test0.cfg configuration file. By doing so, only the operators specified in the list are compiled, based on the options configured in op_debug_config. The op_debug_list field has the following requirements: The operator name or operator type can be specified. Operators are separated by commas (,). The operator type is configured in OpType::typeName format. The operator type and operator name can be configured in a mixed manner. The operator to be compiled must be stored in the configuration file specified by op_debug_config. The following is a configuration example of the test0.cfg file: op_debug_config= ccec_g,oom op_debug_list=GatherV2,opType::ReduceSum During model compilation, the GatherV2 and ReduceSum operators are compiled based on the ccec_g and oom options. NOTE: When ccec_O0 and ccec_g are enabled, the size of the operator kernel file (.o file) increases. In dynamic shape scenarios, all possible scenarios are traversed during operator compilation, which may cause operator compilation failures due to large operator kernel files. In this case, do not enable the options of the CCE compiler. If the compilation failure is caused by the large operator kernel file, the following log is displayed: message:link error ld.lld: error: InputSection too large for range extension thunk* ./kernel_meta_xxxxx.o:(xxxx) The CCEC options ccec_O0 and oom cannot be enabled at the same time. Otherwise, an AI Core error is reported. The following is an example of the error information: ...there is an aivec error exception, core id is 49, error code = 0x4 ... If this option is set to dump_cce or dump_loc, you can use debug_dir to specify the path for storing debugging-related process files. When the build options oom, dump_bin, dump_cce, and dump_loc are configured, if the model contains the following MC2 operators, the .o, .json, and .cce files of the operators are not generated in the kernel_meta* directory. MatMulAllReduce MatMulAllReduceAddRmsNorm AllGatherMatMul MatMulReduceScatter AlltoAllAllGatherBatchMatMul BatchMatMulReduceScatterAlltoAll If NPU_COLLECT_PATH is configured, the function of checking whether memory overwriting occurs in the global memory cannot be enabled. That is, this option cannot be set to oom. Otherwise, an error is reported when the compiled model file or operator kernel package is used.	Training/Online inference
debug_dir	Directory of the debug files generated during operator compilation, including the .o, .json, and .cce files. The path priority for storing the debugging files generated during operator compilation is as follows: debug_dir -> environment variable ASCEND_WORK_PATH -> the default storage path (current script execution path) For details about ASCEND_WORK_PATH, see Environment Variables. Example: custom_op.parameter_map["debug_dir"].s = tf.compat.as_bytes("/home/test")	Training/Online inference
export_compile_stat	Whether to generate the result file fusion_result.json of operator fusion information (including graph fusion and UB fusion) during graph compilation. The options are as follows: 0: The result file of operator fusion information is not generated. 1 (default): The result file of operator fusion information is generated when the program exits normally. 2: The result file of operator fusion information is generated after graph compilation is complete. That is, if graph compilation is complete but the program is interrupted, the result file of operator fusion information is also generated. The fusion_result.json file records the fusion patterns used during graph compilation. The key fields in the file are described as follows: session_and_graph_id_ xx_xx: Thread and graph ID to which the fusion result belongs. graph_fusion: Graph fusion. ub_fusion: UB fusion. match_times: Number of times that a fusion pattern is hit during graph build. effect_times: Number of times that a fusion pattern takes effect. repository_hit_times: Number of times that the repository is hit during UB fusion. NOTE: If the ASCEND_WORK_PATH environment variable is not set, the result file is generated in the fusion_result.json file under the execution directory. If the ASCEND_WORK_PATH environment variable is set, the result is saved in *$ASCEND_WORK_PATH/FE/${Process ID}fusion_result.json. For details about the environment variable, see Environment Variables. The fusion pattern disabled by fusion_switch_file* is not displayed in fusion_result.json. Example: custom_op.parameter_map["export_compile_stat"].i = 1	Training/Online inference

Accuracy Tuning

Option	Description	Application Scenarios
precision_mode	A string for the operator precision mode. allow_fp32_to_fp16: For cube operators, use float16. For vector operators, preserve the original precision. If operators in a network model support float32, preserve the original precision float32. If operators in the network model do not support float32, directly reduce the precision to float16. force_fp16: forces float16 for operators supporting both float16 and float32. This parameter applies only to online inference scenarios. cube_fp16in_fp32out/force_fp32: The system selects a proper processing mode based on the operator type for operators supporting both float16 and float32. The force_fp32 and cube_fp16in_fp32out configurations deliver the same effect. cube_fp16in_fp32out is newly added to the new version. For cube operators, this option has clearer semantics. For cube operators, the system processes the computing based on the operator implementation. The preferred input data type is float16 and the output data type is float32. If the scenario where the input data type is float16 and the output data type is float32 is not supported, set both the input and output data types to float32. If the scenario where both the input and output data types are float32 is not supported, set both the input and output data types to float16. If none of the preceding scenarios is supported, an error is reported. For vector operators, float32 is forcibly selected for operators supporting both float16 and float32, even if the original precision is float16. This argument is invalid if your network model contains operators not supporting float32, for example, operators that support only float16. In this case, float16 is preserved. If the operators do not support float32 and are configured to the blocklist for mixed precision (by setting precision_reduce to false), the counterpart AI CPU operators supporting float32 are used. must_keep_origin_dtype: preserves the original precision. If the precision of an operator in the original graph is float16, and the implementation of the operator in the NPU does not support float16 but supports only float32 and bfloat16, the system automatically uses high-precision float32. If the precision of an operator in the original graph is float16, and the implementation of the operator in the AI Core does not support float16 but supports only bfloat16, the AI CPU operator of float16 is used. If the AI CPU operator is not supported, an error is reported. If the precision of an operator in the original graph is float32, and the implementation of the operator in the AI Core does not support float32 but supports only float16, the AI CPU operator of float32 is used. If the AI CPU operator is not supported, an error is reported. allow_mix_precision_fp16/allow_mix_precision: enables automatic mixed precision, indicating that both float16 and float32 are used for neural network processing. The allow_mix_precision and allow_mix_precision_fp16 configurations deliver the same effect. allow_mix_precision_fp16 is newly added to the new version. It has clearer semantics and is easier to understand. For certain operators of the float32 data type on a network, the system automatically reduces their precision to float16 based on the built-in tuning policy. This will improve system performance and reduce memory footprint with minimal accuracy degradation. Use the mixed precision mode in conjunction with loss scaling to compensate for the precision degradation caused by precision reduction. In the Atlas Training Series Product training scenario, the default value is "allow_fp32_to_fp16". In the online inference scenario, the default value is "force_fp16". Example: custom_op.parameter_map["precision_mode"].s = tf.compat.as_bytes("allow_mix_precision") NOTE: This option cannot be used together with precision_mode_v2. precision_mode_v2 is recommended. When this option is used to set the precision mode of the entire network, some operators may have precision problems. In the training scenario, call keep_dtype_scope to set some operators to retain the original image precision. In the inference scenario, call keep_tensors_dtypes to set some operators to retain the original image precision. For details about the built-in tuning policy of each operator in mixed precision mode, see the description of modify_mixlist.	Training/Online inference
precision_mode_v2	A string for the operator precision mode. fp16: forces float16 for operators supporting both float16 and float32. origin: retains the precision of the original image. If the precision of an operator in the original graph is float16, and the implementation of the operator in the NPU does not support float16 but supports only float32 and bfloat16, the system automatically uses high-precision float32. If the precision of an operator in the original graph is float16, and the implementation of the operator in the AI Core does not support float16 but supports only bfloat16, the AI CPU operator of float16 is used. If the AI CPU operator is not supported, an error is reported. If the precision of an operator in the original graph is float32, and the implementation of the operator in the AI Core does not support float32 but supports only float16, the AI CPU operator of float32 is used. If the AI CPU operator is not supported, an error is reported. cube_fp16in_fp32out: The system selects a proper processing mode based on the operator type for operators supporting both float16 and float32. For cube operators, the system processes the computing based on the operator implementation. The preferred input data type is float16 and the output data type is float32. If the scenario where the input data type is float16 and the output data type is float32 is not supported, set both the input and output data types to float32. If the scenario where both the input and output data types are float32 is not supported, set both the input and output data types to float16. If none of the preceding scenarios is supported, an error is reported. For vector operators, float32 is forcibly selected for operators supporting both float16 and float32, even if the original precision is float16. This argument is invalid if your network model contains operators not supporting float32, for example, operators that support only float16. In this case, float16 is preserved. If the operators do not support float32 and are configured to the blocklist for mixed precision (by setting precision_reduce to false), the counterpart AI CPU operators supporting float32 are used. mixed_float16: enables automatic mixed precision, indicating that both float16 and float32 are used for neural network processing. For certain operators of the float32 data type on a network, the system automatically reduces their precision to float16 based on the built-in tuning policy. This will improve system performance and reduce memory footprint with minimal accuracy degradation. Use the mixed precision mode in conjunction with loss scaling to compensate for the accuracy degradation caused by precision reduction. Training scenario: For the Atlas Training Series Product , this configuration item has no default value. The default value of precision_mode is used, that is, allow_fp32_to_fp16. In the online inference scenario, the default value of this option is fp16. Example: custom_op.parameter_map["precision_mode_v2"].s = tf.compat.as_bytes("origin") NOTE: This option cannot be used together with precision_mode. precision_mode_v2 is recommended. This option can be used to set the global precision mode of a network, but it may result in performance or precision problems on particular operators. In this case, you are advised to call keep_dtype_scope to keep the precision of some operators unchanged. For details about the built-in tuning policy of each operator in mixed precision mode, see the description of modify_mixlist.	Training/Online inference
modify_mixlist	When mixed precision is enabled, you can use this option to specify the path and file name of the blocklist, trustlist, and graylist, and specify the operators that allow precision reduction and those that do not allow precision reduction. You can enable the mixed precision by configuring precision_mode_v2 or precision_mode in the script. The blocklist, trustlist, and graylist storage files are in JSON format. A configuration example is as follows: custom_op.parameter_map["modify_mixlist"].s = tf.compat.as_bytes("/home/test/ops_info.json") You can specify the operator types in ops_info.json as follows. Separate operators with commas (,). { "black-list": { // Blocklist "to-remove": [ // Move an operator from the blocklist to the graylist. "Xlog1py" ], "to-add": [ // Move an operator from the trustlist or graylist to the blocklist. "Matmul", "Cast" ] }, "white-list": { // Trustlist "to-remove": [ // Move an operator from the trustlist to the graylist. "Conv2D" ], "to-add": [ // Move an operator from the blocklist or graylist to the trustlist. "Bias" ] } } Note: The operators in the preceding example configuration file are for reference only. The configuration should be based on the actual hardware environment and the built-in tuning policies of the operators. You can query the built-in tuning policy of each operator in mixed precision mode in CANN software installation directory/opp/built-in/op_impl/ai_core/tbe/config/<soc_version>/aic-<soc_version>-ops-info.json. For example: "Conv2D":{ "precision_reduce":{ "flag":"true" }, true (trustlist): The precision of operators on the trustlist can be reduced in mixed precision mode.. false (blocklist): The precision of operators on the blocklist cannot be reduced in mixed precision mode.. Not specified (graylist): follows the same mixed precision processing as the upstream operator.	Training/Online inference
customize_dtypes	If precision_mode is used to set the global precision mode of a network, precision problems may occur on particular operators. In this case, you can use customize_dtypes to configure the precision mode of these operators, and still compile other operators using the precision mode specified by precision_mode. Note if precision_mode is set to must_keep_origin_dtype, customize_dtypes does not take effect. Set it to the path (including the name of the configuration file), for example, /home/test/customize_dtypes.cfg. Configuration example: custom_op.parameter_map["customize_dtypes"].s = tf.compat.as_bytes("/home/test/customize_dtypes.cfg") List the names or types of operators whose precision needs customization in the configuration file. Each operator occupies a line, and the operator type must be defined based on Ascend IR. If both operator name and type are configured for an operator, the operator name applies during compilation. The structure of the configuration file is as follows: # By operator name Opname1::InputDtype:dtype1,dtype2,...OutputDtype:dtype1,... Opname2::InputDtype:dtype1,dtype2,...OutputDtype:dtype1,... # By operator type OpType::TypeName1:InputDtype:dtype1,dtype2,...OutputDtype:dtype1,... OpType::TypeName2:InputDtype:dtype1,dtype2,...OutputDtype:dtype1,... Example: # By operator name resnet_v1_50/block1/unit_3/bottleneck_v1/Relu::InputDtype:float16,int8,OutputDtype:float16,int8 # By operator type OpType::Relu:InputDtype:float16,int8,OutputDtype:float16,int8 NOTE: You can find the operator precisions supported in the operator information library, which is saved in *opp/built-in/op_impl/ai_core/tbe/config/${soc_version}/aic-${soc_version}-ops-info.json* under the CANN component directory by default. The data type specified by this option takes high priority, which may invite accuracy or performance degradation. If the specified data type is not supported, the compilation will fail. If the configuration is performed based on the operator name, the operator name may change due to operations such as fusion and splitting during model compilation. As a result, the configuration does not take effect and the accuracy is not improved. In this case, you need to obtain logs to locate the fault. For details about the logs, see Log Reference.	Online inference/Training

Accuracy comparison

Option	Description	Application Scenarios
enable_dump	Data dump enable. True: enabled. The dump file path is read from dump_path. If dump_path is set to None, an exception occurs. False (default): disabled. NOTE: Data dump and overflow/underflow data collection cannot be enabled at the same time. That is, enable_dump and enable_dump_debug cannot be set to True at the same time. If either enable_dump or enable_dump_debug is set to True and enable_exception_dump is set to 1 (indicating that common ExceptionDump function is enabled): For dynamic-shape networks, only enable_exception_dump takes effect. For static-shape networks, enable_exception_dump and either of enable_dump and enable_dump_debug take effect. Example: custom_op.parameter_map["enable_dump"].b = True	Training/Online inference
dump_mode	Dump mode. The values are as follows: input: dumps only operator inputs. output (default): dumps only operator outputs. all: dumps both operator inputs and outputs. NOTE: If this option is set to all, the input data of some operators, such as collective communication operators HcomAllGather and HcomAllReduce, will be modified during execution. Therefore, the system dumps the operator input before operator execution and dumps the operator output after operator execution. In this way, the dumped input and output data of the same operator is flushed to drives separately, and multiple dump files are generated. After parsing the dump files, you can determine whether the data is an input or output based on the file content. Example: custom_op.parameter_map["dump_mode"].s = tf.compat.as_bytes("all")	Training/Online inference
enable_dump_debug	Overflow/underflow data collection enable. True: enabled. The dump file path is read from dump_path. An abnormality occurs if dump_path is None. False (default): disabled. NOTE: Data dump and overflow/underflow data collection cannot be enabled at the same time. That is, enable_dump and enable_dump_debug cannot be set to True at the same time. If either enable_dump or enable_dump_debug is set to True and enable_exception_dump is set to 1 (indicating that common ExceptionDump function is enabled): For dynamic-shape networks, only enable_exception_dump takes effect. For static-shape networks, enable_exception_dump and either of enable_dump and enable_dump_debug take effect. Example: custom_op.parameter_map["enable_dump_debug"].b = True	Training
dump_debug_mode	Overflow/Underflow detection mode. The values are as follows: aicore_overflow: detects AI Core operator overflow/underflow, that is, detecting whether abnormal extreme values (such as 65500, 38400, and 51200 in float16) are output with normal inputs. Once such fault is detected, analyze the cause of the overflow/underflow and modify the operator implementation based on the network requirements and operator logic. atomic_overflow: detects Atomic Add overflow/underflow. Atomic Add overflow/underflow is detected when data is moved from the UB to the external storage after AI Core computation. all: detects overflow/underflow of both AI Core operators and Atomic Add. The default value is all. Example: custom_op.parameter_map["dump_debug_mode"].s = tf.compat.as_bytes("all")	Training
dump_path	Dump path. This option is required when enable_dump or enable_dump_debug is set to True. Create the specified path in advance in the environment (either container or host) where training is performed. The running user configured during installation must have the read and write permissions on this path. The path can be an absolute path or a path relative to the path where the training script is executed. An absolute path starts with a slash (/), for example, /home/HwHiAiUser/output. A relative path starts with a directory name, for example, output. Example: custom_op.parameter_map["dump_path"].s = tf.compat.as_bytes("/home/HwHiAiUser/output")	Training/Online inference
dump_step	Iterations to dump. Separate multiple iterations using vertical bars (\|), for example, 0\|5\|10. You can also use hyphens (-) to specify the iteration range, for example, 0\|3-5\|10. If this option is not set, dump data of all iterations is collected. Example: custom_op.parameter_map["dump_step"].s = tf.compat.as_bytes("0\|5\|10")	Training
dump_data	Type of operator content to dump. tensor (default): dumps operator data. stats: dumps operator statistics. The result file is in .csv format. In large-scale training scenarios, dumping a large amount of data takes a long time. You can dump the statistics of all operators, identify the operators that may be abnormal based on the statistics, and then dump the input or output data of these abnormal operators. Example: custom_op.parameter_map["dump_data"].s = tf.compat.as_bytes("stats")	Training/Online inference
dump_layer	Name of the operator to dump. Multiple operator names are separated by spaces. If this option is not set, all operators are dumped by default. If the input of the specified operator involves the data operator, the data operator information is also dumped. Example: custom_op.parameter_map["dump_layer"].s = tf.compat.as_bytes("nodename1 nodename2 nodename3")	Training/Online inference
quant_dumpable	If the TensorFlow network is quantized by the AMCT tool, this option can be used to control whether to collect the dump data before quantization. The default value is 0. 0: The input and output before quantization may be optimized during graph compilation. In this case, the dump data before quantization cannot be obtained. 1: After this function is enabled, the dump data before quantization can be collected. Example: custom_op.parameter_map["quant_dumpable"].s = tf.compat.as_bytes("1") NOTE: This option applies only to online inference scenarios. When data dump is enabled, you can set this option to 1 to ensure that the dump data before quantization can be collected.	Online inference
fusion_switch_file	Directory of the fusion switch configuration file, including the file name. The value can contain letters, digits, underscores (_), hyphens (-), and periods (.). The built-in graph fusion and UB fusion patterns are enabled by default. You can disable selected fusion patterns in the configuration file. The following is a template of the fusion_switch.cfg configuration file. on indicates that a fusion pattern is enabled, and off indicates that a fusion pattern is disabled. { "Switch":{ "GraphFusion":{ "RequantFusionPass":"on", "ConvToFullyConnectionFusionPass":"off", "SoftmaxFusionPass":"on", "NotRequantFusionPass":"on", "SplitConvConcatFusionPass":"on", "ConvConcatFusionPass":"on", "MatMulBiasAddFusionPass":"on", "PoolingFusionPass":"on", "ZConcatv2dFusionPass":"on", "ZConcatExt2FusionPass":"on", "TfMergeSubFusionPass":"on" }, "UBFusion":{ "TbePool2dQuantFusionPass":"on" } } } To disable all fusion patterns at a time, refer to this configuration file example. { "Switch":{ "GraphFusion":{ "ALL":"off" }, "UBFusion":{ "ALL":"off" } } } Notes: Some built-in fusion patterns are not switchable due to functionality restrictions and these fusion patterns will remain enabled despite user's switch settings. To disable all fusion patterns except selected ones, refer to the following example. { "Switch":{ "GraphFusion":{ "ALL":"off", "SoftmaxFusionPass":"on" }, "UBFusion":{ "ALL":"off", "TbePool2dQuantFusionPass":"on" } } } Example: custom_op.parameter_map["fusion_switch_file"].s = tf.compat.as_bytes("/home/test/fusion_switch.cfg")	Training/Online inference
buffer_optimize	Enables buffer optimization. This is an advanced switch. l2_optimize (default): enabled off_optimize: disabled. Example: custom_op.parameter_map["buffer_optimize"].s = tf.compat.as_bytes("l2_optimize")	Online inference
use_off_line	Enable training on the Ascend AI Processor. True (default): enabled. False: disabled. Training is performed on the host CPU. Example: custom_op.parameter_map["use_off_line"].b = True	Training/Online inference

Performance Tuning

Basic configuration

Option	Description	Application Scenarios
iterations_per_loop	Number of iterations per loop set by using set_iteration_per_loop in sess.run mode, that is, the number of iterations per training loop every sess.run() call on the device side. The value must be the same as that of iterations_per_loop set by set_iteration_per_loop for function verification. Example: custom_op.parameter_map["iterations_per_loop"].i = 10	Training

Option

Description

Application Scenarios

iterations_per_loop

Number of iterations per loop set by using set_iteration_per_loop in sess.run mode, that is, the number of iterations per training loop every sess.run() call on the device side.

The value must be the same as that of iterations_per_loop set by set_iteration_per_loop for function verification.

Example:

custom_op.parameter_map["iterations_per_loop"].i = 10

Training

Advanced setting

Option	Description	Application Scenarios
hcom_parallel	Enables AllReduce gradient update and forward and backward propagation in parallel during distributed training. True (default): enabled. False: disabled. For a small network (for example, ResNet-18), you are advised to set this option to False. Example: custom_op.parameter_map["hcom_parallel"].b = True	Training
enable_small_channel	Small channel optimization enable. If it is enabled, performance benefits are yielded at the convolutional layers with channel size <= 4. 0: disabled. This function is disabled by default in the training scenario (graph_run_mode is 1). You are advised not to enable this function in the training scenario. 1: enabled. This is the default option that cannot be modified for the online inference scenario (graph_run_mode is 0). NOTE: After this function is included, performance benefits can be obtained on the GoogleNet, ResNet-50, ResNet-101, and ResNet-152 networks. For other networks, the performance may deteriorate. Example: custom_op.parameter_map["enable_small_channel"].i = 1	Online inference/Training
op_precision_mode	High-precision or high-performance mode of an operator. You can pass a custom mode configuration file op_precision.ini to set different modes for operators. You can set this option by operator type (low priority) or node name (high priority). Example: [ByOpType] optype1=high_precision optype2=high_performance optype4=support_out_of_bound_index [ByNodeName] nodename1=high_precision nodename2=high_performance nodename4=support_out_of_bound_index high_precision high_performance support_out_of_bound_index: indicates that the out-of-bounds verification is performed on the indices of the gather, scatter, and segment operators. The verification deteriorates the operator execution performance. keep_fp16: The FP16 data type is used for internal processing of operators. In this scenario, the FP16 data type is not automatically converted to the FP32 data type. If the performance of FP32 computation does not meet the expectation and high precision is not required, you can select the keep_fp16 mode. This low-precision mode sacrifices the precision for improving the performance, which is not recommended. super_performance: indicates ultra-high performance. Compared with high performance, the algorithm calculation formula is optimized. You can view the precision or performance mode supported by an operator in the opp/built-in/op_impl/ai_core/tbe/impl_mode/all_ops_impl_mode.ini file in the file storage path with the CANN software installed. This option is mutually exclusive with op_select_implmode and optypelist_for_implmode. If they are all specified, op_precision_mode takes precedence. Generally, you do not need to set this option. It is used if you need to adjust the precision of a specific operator using the configuration .ini file in the case that you fail to obtain optimal network performance or accuracy in the high-performance or high-precision mode. Example: custom_op.parameter_map["op_precision_mode"].s = tf.compat.as_bytes("/home/test/op_precision.ini")	Training/Online inference
enable_scope_fusion_passes	Scope fusion pattern (or scope fusion patterns separated by commas) to take effect at compilation. Name of the registered fusion pattern. You can pass multiple names. Separate the names by commas (,). Scope fusion patterns (either built-in or custom) are classified into the following two types: General: common scope fusion patterns applicable to all networks. They are enabled by default and cannot be manually invalidated. Non-general scope fusion patterns: applicable to specific networks. By default, they are disabled. You can use enable_scope_fusion_passes to enable selected fusion patterns. Example: custom_op.parameter_map["enable_scope_fusion_passes"].s = tf.compat.as_bytes("ScopeLayerNormPass,ScopeClipBoxesPass")	Training/Online inference
stream_max_parallel_num	This option applies only to neural machine translation (NMT) networks. It specifies the parallelism degree of the AI CPU/AI Core engine to implement parallel execution between AI CPU/AI Core operators. DNN_VM_AICPU is the name of the AI CPU engine. In this example, the number of concurrent tasks on the AI CPU engine is 10. AIcoreEngine is the name of the AI Core engine. In this example, the number of concurrent tasks on the AI Core engine is 1. Defaults to 1. The value cannot exceed the maximum number of AI Cores. Example: custom_op.parameter_map["stream_max_parallel_num"].s = tf.compat.as_bytes("DNN_VM_AICPU:10,AIcoreEngine:1")	Training/Online inference
is_tailing_optimization	This option applies only to Bidirectional Encoder Representations from Transformers (BERT) networks. Communication tailing optimization enable in distributed training scenarios to improve performance. By changing a computation dependency relationship, a computation operation that does not depend on the last AR (gradient aggregation fragment) is scheduled to be performed in parallel with the last AR, to optimize communication tailing. Value: True: enabled. False (default): disabled. This option must work with NPUOptimizer and the value must be the same as that of is_tailing_optimization in NPUOptimizer. Example: custom_op.parameter_map["is_tailing_optimization"].b = True	Training
variable_placement	If the network weight is large, network execution may fail due to insufficient device memory. In this case, you can deploy the variable to the host to reduce the memory usage of the device. Device: The variable is deployed on the device. Host: The variable is deployed on the host. Default value: Device Constraints: If this configuration option is set to Host, mixed computing must be enabled (mix_compile_mode = True). If the training script contains APIs of TensorFlow V1 control flow operators, such as tf.case, tf.cond, and tf.while_loop, setting variable_placement to Host may cause the network execution to fail. To avoid this problem, add the following APIs to the training script to convert the control flow operators of TensorFlow V1 to V2 and enable resource variables: tf.enable_control_flow_v2() tf.enable_resource_variables() Example: custom_op.parameter_map["variable_placement"].s = tf.compat.as_bytes("Device")	Training/Online inference
frozen_variable	To save the weight as a checkpoint, you can use this option to convert the variable to constant to reduce data copies between the host and device and improve inference performance. True: conversion enabled. False: conversion disabled. Default value: False Example: custom_op.parameter_map["frozen_variable"].b = True	Online inference
graph_max_parallel_model_num	In the online inference scenario, you can set this option to specify the maximum number of threads for parallel graph execution. If the value of this option is greater than 1, the corresponding number of threads are started for parallel graph execution, improving the overall graph execution efficiency. The value must be an integer in the range of [1, INT32_MAX]. The default value is 1. INT32_MAX is the maximum value of the INT32 type, which is 2147483647. Example: custom_op.parameter_map["graph_max_parallel_model_num"].i = 4	Online inference

Profiling

Option	Description	Application Scenarios
profiling_mode	Enables profiling. True: enabled. The profiling options are determined by profiling_options. False (default): disabled. Example: custom_op.parameter_map["profiling_mode"].b = True Note: The priority of this configuration item is higher than that of the environment variable PROFILING_MODE. For details about the environment variable, see Auxiliary Functions > Profile Data Collection in Environment Variables.	Training/Online inference
profiling_options	Sets profiling options. output: path for storing profiling result files. Both absolute path and relative path (relative to the path where the command is run) are supported. The path cannot contain the following special characters: "\n", "\f", "\r", "\b", "\t", "\v", and "\u007F". An absolute path starts with a slash (/), for example, /home/HwHiAiUser/output. A relative path starts with a directory name, for example, output. This parameter takes precedence over ASCEND_WORK_PATH. This path does not need to be created in advance because it is automatically created during collection. storage_limit: maximum size of files that can be stored in a specified disk directory. If the size of profile data files in the disk is about to use up the maximum storage space specified by this option or the total remaining disk space is about to be used up (remaining space ≤ 20 MB), the earliest files in the disk are aged and deleted. The value range is [200, 4294967295], and the unit is MB. The unit must be included when you set this parameter, for example, 200 MB. If this parameter is not set, the default value is 90% of the available space of the disk where the directory for storing profile data files is located. training_trace: iteration tracing switch. Collects software profile data of a training job and the AI Software Stack to profile the training job. Focuses on forward and backward propagation, and gradient aggregation and update. This option must be set to on when the forward and backward propagation operator data is collected. task_trace and task_time: Switches that control collection of the operator delivery and execution durations. Related duration data must be output to the task_time, op_summary, and op_statistic files. Possible configuration values are as follows: on: switch on. This is the default value, delivering the same effect as l1. off: switch off. l0: collects operator delivery and execution duration data. Compared with l1, l0 does not collect basic operator information, so the performance overhead during collection is smaller, and this enables more accurate collection of statistics on time duration data. l1: collects operator delivery and execution duration data, as well as basic operator information, to provide more comprehensive performance analysis data. When Profiling is enabled to collect training data, task_trace and training_trace must be set to on. hccl (optional): HCCL tracing switch, either on or off (default). NOTE: This switch will be discarded in later versions. To control data collection, use task_trace and task_time. aicpu: whether to collect details about the AI CPU operator, such as the operator execution time and data copy time. The value can be on or off (default). A value other than on or off is equivalent to off. fp_point: start point of the forward propagated operator in iteration traces, to record the start timestamp of forward propagation. Set the value to the name of the top operator in forward propagation. You can save the graph as a .pbtxt file by using tf.io.write_graph in the training script to obtain the name. Alternatively, you can leave this option empty (for example, "fp_point":""), and the system will automatically identify the start point of the forward propagated operator. bp_point: end point of the backward propagated operator in iteration traces, to record the end timestamp of backward propagation. BP_POINT and FP_POINT are used to compute the time used by forward and backward propagation. Set the value to the name of the bottom operator in backward propagation. You can save the graph as a .pbtxt file by using tf.io.write_graph in the training script to obtain the name. Alternatively, you can leave this option empty (for example, "bp_point":""), and the system will automatically identify the end point of the backward propagated operator. aic_metrics: AI Core metrics to profile. ArithmeticUtilization: arithmetic utilization ratio. PipeUtilization (default): ratio of time taken by the compute units to that of MTEs. Memory: ratio of external memory read/write instructions. MemoryL0: ratio of internal memory L0 read/write instructions. MemoryUB: ratio of internal memory UB read/write instructions. ResourceConflictRatio: ratio of pipeline queue instructions. Atlas Training Series Product : AI Core collection is supported, but AI Vector Core and L2 cache parameters are not supported. NOTE: The registers whose data is to be collected can be customized, for example, *"aic_metrics":"Custom:0x49,0x8,0x15,0x1b,0x64,0x10*". The Custom field indicates the custom type and is set to a specific register value. The value range is [0x1, 0x6E]. A maximum of eight registers can be configured, which are separated with commas (,). The register value can be in hexadecimal or decimal format. l2: L2 cache profiling switch, either on or off (default). msproftx: switch that controls the msproftx user and upper-layer framework program to output profile data, either on or off (default). runtime_api: Runtime API data collection switch, either on or off (default). You can collect Runtime API profile data, including the synchronous/asynchronous memory replication latencies between the host and device and between devices. sys_hardware_mem_freq: indicates the frequency of collecting On-chip memory, QoS bandwidth and memory information, LLC read/write bandwidth data, Acc PMU data and SoC transmission bandwidth data, and component memory information. Must be within the range [1,100]. The unit is Hz. The support for different products varies. NOTE: Sampling memory data in the environment where glibc (2.34 or an earlier version) is installed may trigger a known Bug 19329. This problem can be solved by upgrading the glibc version. llc_profiling: LLC events to profile. Possible values are as follows: Atlas Training Series Product : read (read event, L3 cache read rate) or write (write event, L3 cache write rate). Defaults to read. sys_io_sampling_freq: NIC and RoCE collection frequency. The value range is [1,100]. The unit is Hz. Atlas Training Series Product : supports NIC and RoCE collection. sys_interconnection_freq: HCCS bandwidth and PCIe data collection frequency and inter-chip transmission bandwidth data collection frequency. The value range is [1, 50]. The unit is Hz. Atlas Training Series Product : supports HCCS and PCIe data collection. dvpp_freq: DVPP collection frequency. The value range is [1,100]. The unit is Hz. instr_profiling_freq: AI Core and AI Vector bandwidth and latency collection frequency. The value range is [300, 30000]. The unit is cycle. Atlas Training Series Product : Not supported. host_sys: switch for collecting host profile data. You can select one or more options and separate them with commas (,), for example, "host_sys": "cpu,mem". cpu: process CPU utilization mem: process memory utilization host_sys_usage: CPU and memory data of the system and all processes on the host, selected from cpu and mem. You can select one or more options and separate them with commas (,). host_sys_usage_freq: collection frequency of CPU and memory data of the system and all processes on the host. The value range is [1, 50] and the default value is 50. The unit is Hz. NOTE: fp_point and bp_point require manual configuration only in the dynamic shape scenario. In the dynamic shape scenario, fp_point and bp_point must be configured manually. Online inference supports task_trace and aicpu but does not support training_trace**. Example: custom_op.parameter_map["profiling_options"].s = tf.compat.as_bytes('{"output":"/tmp/profiling","training_trace":"on","fp_point":"","bp_point":""}')	Training/Online inference

Option

Description

Application Scenarios

profiling_mode

Enables profiling.

True: enabled. The profiling options are determined by profiling_options.
False (default): disabled.

Example:

custom_op.parameter_map["profiling_mode"].b = True

Note: The priority of this configuration item is higher than that of the environment variable PROFILING_MODE. For details about the environment variable, see Auxiliary Functions > Profile Data Collection in Environment Variables.

Training/Online inference

profiling_options

Sets profiling options.

output: path for storing profiling result files. Both absolute path and relative path (relative to the path where the command is run) are supported. The path cannot contain the following special characters: "\n", "\f", "\r", "\b", "\t", "\v", and "\u007F".
- An absolute path starts with a slash (/), for example, /home/HwHiAiUser/output.
- A relative path starts with a directory name, for example, output.
- This parameter takes precedence over ASCEND_WORK_PATH.
- This path does not need to be created in advance because it is automatically created during collection.
storage_limit: maximum size of files that can be stored in a specified disk directory. If the size of profile data files in the disk is about to use up the maximum storage space specified by this option or the total remaining disk space is about to be used up (remaining space ≤ 20 MB), the earliest files in the disk are aged and deleted.
The value range is [200, 4294967295], and the unit is MB. The unit must be included when you set this parameter, for example, 200 MB.

If this parameter is not set, the default value is 90% of the available space of the disk where the directory for storing profile data files is located.
training_trace: iteration tracing switch. Collects software profile data of a training job and the AI Software Stack to profile the training job. Focuses on forward and backward propagation, and gradient aggregation and update. This option must be set to on when the forward and backward propagation operator data is collected.
task_trace and task_time: Switches that control collection of the operator delivery and execution durations. Related duration data must be output to the task_time, op_summary, and op_statistic files. Possible configuration values are as follows:
- on: switch on. This is the default value, delivering the same effect as l1.
- off: switch off.
- l0: collects operator delivery and execution duration data. Compared with l1, l0 does not collect basic operator information, so the performance overhead during collection is smaller, and this enables more accurate collection of statistics on time duration data.
- l1: collects operator delivery and execution duration data, as well as basic operator information, to provide more comprehensive performance analysis data.
When Profiling is enabled to collect training data, task_trace and training_trace must be set to on.
hccl (optional): HCCL tracing switch, either on or off (default).
NOTE:

This switch will be discarded in later versions. To control data collection, use task_trace and task_time.
aicpu: whether to collect details about the AI CPU operator, such as the operator execution time and data copy time. The value can be on or off (default). A value other than on or off is equivalent to off.
fp_point: start point of the forward propagated operator in iteration traces, to record the start timestamp of forward propagation. Set the value to the name of the top operator in forward propagation. You can save the graph as a .pbtxt file by using tf.io.write_graph in the training script to obtain the name. Alternatively, you can leave this option empty (for example, "fp_point":""), and the system will automatically identify the start point of the forward propagated operator.
bp_point: end point of the backward propagated operator in iteration traces, to record the end timestamp of backward propagation. BP_POINT and FP_POINT are used to compute the time used by forward and backward propagation. Set the value to the name of the bottom operator in backward propagation. You can save the graph as a .pbtxt file by using tf.io.write_graph in the training script to obtain the name. Alternatively, you can leave this option empty (for example, "bp_point":""), and the system will automatically identify the end point of the backward propagated operator.
aic_metrics: AI Core metrics to profile.
- ArithmeticUtilization: arithmetic utilization ratio.
- PipeUtilization (default): ratio of time taken by the compute units to that of MTEs.
- Memory: ratio of external memory read/write instructions.
- MemoryL0: ratio of internal memory L0 read/write instructions.
- MemoryUB: ratio of internal memory UB read/write instructions.
- ResourceConflictRatio: ratio of pipeline queue instructions.
Atlas Training Series Product : AI Core collection is supported, but AI Vector Core and L2 cache parameters are not supported.
NOTE:
The registers whose data is to be collected can be customized, for example, "aic_metrics":"Custom:0x49,0x8,0x15,0x1b,0x64,0x10".
- The Custom field indicates the custom type and is set to a specific register value. The value range is [0x1, 0x6E].
- A maximum of eight registers can be configured, which are separated with commas (,).
- The register value can be in hexadecimal or decimal format.
l2: L2 cache profiling switch, either on or off (default).
msproftx: switch that controls the msproftx user and upper-layer framework program to output profile data, either on or off (default).
runtime_api: Runtime API data collection switch, either on or off (default). You can collect Runtime API profile data, including the synchronous/asynchronous memory replication latencies between the host and device and between devices.
sys_hardware_mem_freq: indicates the frequency of collecting On-chip memory, QoS bandwidth and memory information, LLC read/write bandwidth data, Acc PMU data and SoC transmission bandwidth data, and component memory information. Must be within the range [1,100]. The unit is Hz.
The support for different products varies.

NOTE:

Sampling memory data in the environment where glibc (2.34 or an earlier version) is installed may trigger a known Bug 19329. This problem can be solved by upgrading the glibc version.
llc_profiling: LLC events to profile. Possible values are as follows:
- Atlas Training Series Product : read (read event, L3 cache read rate) or write (write event, L3 cache write rate). Defaults to read.
sys_io_sampling_freq: NIC and RoCE collection frequency. The value range is [1,100]. The unit is Hz.
- Atlas Training Series Product : supports NIC and RoCE collection.
sys_interconnection_freq: HCCS bandwidth and PCIe data collection frequency and inter-chip transmission bandwidth data collection frequency. The value range is [1, 50]. The unit is Hz.
- Atlas Training Series Product : supports HCCS and PCIe data collection.
dvpp_freq: DVPP collection frequency. The value range is [1,100]. The unit is Hz.
instr_profiling_freq: AI Core and AI Vector bandwidth and latency collection frequency. The value range is [300, 30000]. The unit is cycle.
- Atlas Training Series Product : Not supported.
host_sys: switch for collecting host profile data. You can select one or more options and separate them with commas (,), for example, "host_sys": "cpu,mem".
- cpu: process CPU utilization
- mem: process memory utilization
host_sys_usage: CPU and memory data of the system and all processes on the host, selected from cpu and mem. You can select one or more options and separate them with commas (,).
host_sys_usage_freq: collection frequency of CPU and memory data of the system and all processes on the host. The value range is [1, 50] and the default value is 50. The unit is Hz.

NOTE:

fp_point and bp_point require manual configuration only in the dynamic shape scenario. In the dynamic shape scenario, fp_point and bp_point must be configured manually.
Online inference supports task_trace and aicpu but does not support training_trace.

Example:

custom_op.parameter_map["profiling_options"].s = tf.compat.as_bytes('{"output":"/tmp/profiling","training_trace":"on","fp_point":"","bp_point":""}')

Training/Online inference

AOE

Option	Description	Application Scenarios
aoe_mode	Tuning mode of AOE. 1: subgraph tuning. 2: operator tuning. 4: gradient splitting tuning. In the data parallelism scenario, AllReduce is used to aggregate gradients. The gradient splitting mode is closely related to the distributed training performance. If the splitting is improper, the communication hangover time is long after the backward propagation is complete, affecting the cluster training performance and linearity. It is sophisticated to perform manual tuning through the gradient splitting API (set_split_strategy_by_idx or set_split_strategy_by_size) of collective communication. The AOE tool collects profile data in the real-device environment and automatically looks up for the optimal splitting policy. You only need to set the obtained policy to your network by passing it to the set_split_strategy_by_idx call. NOTE: The tuning mode can be configured by modifying the training script or the AOE_MODE environment variable. If both configuration methods are used, the configuration by modifying the training script takes precedence. Example: custom_op.parameter_map["aoe_mode"].s = tf.compat.as_bytes("2")	Training
work_path	Working directory of AOE, which stores the configuration and tuning result files. By default, the files are generated in the current directory. The value is a string. Create the specified directory in advance in the environment (either container or host) where training is performed. The running user configured during installation must have the read and write permissions on this directory. The value can be an absolute path or a path relative to the path where the training script is executed. An absolute path starts with a slash (/), for example, /home/HwHiAiUser/output. A relative path starts with a directory name, for example, output. Example: custom_op.parameter_map["work_path"].s = tf.compat.as_bytes("/home/HwHiAiUser/output")	Training
aoe_config_file	Tunes only operators with low performance on the network with AOE. Set this option to the path and name of the configuration file that contains the operator information, for example, /home/test/cfg/tuning_config.cfg. Example: custom_op.parameter_map["aoe_config_file"].s=tf.compat.as_bytes("/home/test/cfg/tuning_config.cfg") The configuration file contains information about the operators to be tuned. The file content format is as follows: { "tune_ops_name":["bert/embeddings/addbert/embeddings/add_1","loss/MatMul"], "tune_ops_type":["Add", "Mul"] "tune_optimization_level":"O1", "feature":["deeper_opat"] } tune_ops_name: name of the specified operator (whole word match). You can specify one or more operator names. If multiple operator names are specified, separate them with commas (,). The operator name must be the node name of the network model processed by Graph Compiler. You can obtain the operator name from profiling tuning data. For details, see Performance Tuning Tool User Guide . tune_ops_type: specified operator type (whole word match). You can specify one or more operator types. If multiple operator types are specified, separate them with commas (,). If a fused operator contains the specified operator type, the fused operator will also be tuned. tune_optimization_level: tuning mode. The value O1 indicates the high-performance tuning mode, and the value O2 indicates the normal mode. The default value is O2. feature: tuning feature switch. The value can be deeper_opat or nonhomo_split. The value deeper_opat indicates that in-depth operator tuning is enabled. In this case, aoe_mode must be set to 2. The value nonhomo_split indicates that non-uniform subgraph partition tuning is enabled. In this case, aoe_mode must be set to 1. NOTE: In the preceding configuration file, tune_ops_type and tune_ops_name can exist at the same time or one of them. If they exist at the same time, use the union set.	Training

Option

Description

Application Scenarios

aoe_mode

Tuning mode of AOE.

1: subgraph tuning.
2: operator tuning.
4: gradient splitting tuning.
In the data parallelism scenario, AllReduce is used to aggregate gradients. The gradient splitting mode is closely related to the distributed training performance. If the splitting is improper, the communication hangover time is long after the backward propagation is complete, affecting the cluster training performance and linearity. It is sophisticated to perform manual tuning through the gradient splitting API (set_split_strategy_by_idx or set_split_strategy_by_size) of collective communication. The AOE tool collects profile data in the real-device environment and automatically looks up for the optimal splitting policy. You only need to set the obtained policy to your network by passing it to the set_split_strategy_by_idx call.

NOTE:

The tuning mode can be configured by modifying the training script or the AOE_MODE environment variable. If both configuration methods are used, the configuration by modifying the training script takes precedence.

Example:

custom_op.parameter_map["aoe_mode"].s = tf.compat.as_bytes("2")

Training

work_path

Working directory of AOE, which stores the configuration and tuning result files. By default, the files are generated in the current directory.

The value is a string. Create the specified directory in advance in the environment (either container or host) where training is performed. The running user configured during installation must have the read and write permissions on this directory. The value can be an absolute path or a path relative to the path where the training script is executed.

An absolute path starts with a slash (/), for example, /home/HwHiAiUser/output.
A relative path starts with a directory name, for example, output.

Example:

custom_op.parameter_map["work_path"].s = tf.compat.as_bytes("/home/HwHiAiUser/output")

Training

aoe_config_file

Tunes only operators with low performance on the network with AOE. Set this option to the path and name of the configuration file that contains the operator information, for example, /home/test/cfg/tuning_config.cfg.

Example:

custom_op.parameter_map["aoe_config_file"].s=tf.compat.as_bytes("/home/test/cfg/tuning_config.cfg")

The configuration file contains information about the operators to be tuned. The file content format is as follows:

{
       "tune_ops_name":["bert/embeddings/addbert/embeddings/add_1","loss/MatMul"],
       "tune_ops_type":["Add", "Mul"]
       "tune_optimization_level":"O1",
       "feature":["deeper_opat"]
}

tune_ops_name: name of the specified operator (whole word match). You can specify one or more operator names. If multiple operator names are specified, separate them with commas (,). The operator name must be the node name of the network model processed by Graph Compiler. You can obtain the operator name from profiling tuning data. For details, see Performance Tuning Tool User Guide .
tune_ops_type: specified operator type (whole word match). You can specify one or more operator types. If multiple operator types are specified, separate them with commas (,). If a fused operator contains the specified operator type, the fused operator will also be tuned.
tune_optimization_level: tuning mode. The value O1 indicates the high-performance tuning mode, and the value O2 indicates the normal mode. The default value is O2.
feature: tuning feature switch. The value can be deeper_opat or nonhomo_split. The value deeper_opat indicates that in-depth operator tuning is enabled. In this case, aoe_mode must be set to 2. The value nonhomo_split indicates that non-uniform subgraph partition tuning is enabled. In this case, aoe_mode must be set to 1.

NOTE:

In the preceding configuration file, tune_ops_type and tune_ops_name can exist at the same time or one of them. If they exist at the same time, use the union set.

Training

Operator Compilation

Option	Description	Application Scenarios
op_compiler_cache_mode	Disk cache mode for operator compilation. enable is the default value. enable: disk cache mode enabled. The operator compilation information is cached to the disk, which can be reused by operators with the same compilation options, improving compilation efficiency. force: cache mode enabled. In this mode, the existing cache is cleared up before new operator compilation is added to the cache. For example, for Python changes, dependency library changes, or repository changes after operator optimization, you need to set this option to force to clear up the existing cache and then change it to enable to prevent the cache from being forcibly refreshed during each compilation. Note that you are not advised to set the force option for parallel program compilation. Otherwise, the cache used by other models may be cleared, causing compilation failures. disable: disabled. Notes: When enabling the operator compilation cache function, you can configure the path for storing the operator compilation cache file by using op_compiler_cache_dir. disable and force are recommended for publishing the final model. If op_debug_level is set to a non-zero value, the op_compiler_cache_mode configuration is ignored, the operator compilation cache function is disabled, and all operators are recompiled. If op_debug_config is not empty and the op_debug_list field is not configured, the op_compiler_cache_mode configuration is ignored, the operator compilation cache function is disabled, and all operators are recompiled. If op_debug_config is not empty, the op_debug_list field is configured, and op_compiler_cache_mode is set to enable or force, the operators in the list are recompiled, and the operator compilation cache function is enabled for operators that are not in the list. When the operator compilation cache function is enabled, the default disk space for storing cache files is 500 MB. If the disk space is insufficient, 50% of the cache files are retained by default. You can also customize the ratio of disk space allocated for storing cache files to the reserved cache space as follows: Using the op_cache.ini configuration file After the operator is compiled, the op_cache.ini file is automatically generated in the directory specified by op_compiler_cache_dir. You can use this file to set the ratio of the disk space allocated for cache storage to the reserved cache space. If the op_cache.ini file does not exist, manually create it. Add the following information to the op_cache.ini file: # Configure the file format (required). The automatically generated file contains the following information by default. When manually creating a file, enter the following information: [op_compiler_cache] # Limit the disk space of the cache folder on the Ascend AI Processor (unit: MB). max_op_cache_size=500 # When the disk space is insufficient, set the ratio of the cache space to be reserved. The value ranges from 1 to 100, in percentage. For example, 80 indicates that 80% of the cache files are reserved and other files are deleted when the disk space is insufficient. remain_cache_size_ratio=80 The op_cache.ini file takes effect only when the values of max_op_cache_size and remain_cache_size_ratio in the preceding file are valid. If the size of the compilation cache file exceeds the value of max_op_cache_size and the cache file is not accessed for more than half an hour, the cache file will be aged. (Operator compilation will not be interrupted due to the size of the compilation cache file exceeding the set limit. Therefore, if max_op_cache_size is set to a small value, the size of the actual compilation cache file may exceed the configured value.) To disable the compilation cache aging function, set max_op_cache_size to -1. In this case, the access time is not updated when the operator cache is accessed, the operator compilation cache is not aged, and the default drive space is 500 MB. If multiple users use the same cache path, the configuration file affects all users. Using environment variable ASCEND_MAX_OP_CACHE_SIZE The environment variable ASCEND_MAX_OP_CACHE_SIZE is used to limit the storage space of the cache folder of the Ascend AI Processor. When the compilation cache space reaches the specified value and the cache file is not accessed for more than half an hour, the cache file is aged. The environment variable ASCEND_REMAIN_CACHE_SIZE_RATIO is used to set the ratio of the cache space to be reserved. For details about environment variables, see Building > Operator Building in Environment Variables. To disable the compilation cache aging function, set ASCEND_MAX_OP_CACHE_SIZE to -1. Caution: If both the op_cache.ini file and environment variable are configured, the configuration items in the op_cache.ini file are read first. If neither the op_cache.ini file nor the environment variable are configured, the system default values are read: 500 MB drive space and 50% reserved cache space. Example: custom_op.parameter_map["op_compiler_cache_mode"].s = tf.compat.as_bytes("enable")	Training/Online inference
op_compiler_cache_dir	Disk cache directory for operator compilation. The value can contain letters, digits, underscores (_), hyphens (-), and periods (.). If the specified directory exists and is valid, the kernel_cache subdirectory is automatically created. If the specified directory does not exist but is valid, the system automatically creates this directory and the kernel_cache subdirectory. The storage priority of the operator compilation cache files is as follows: op_compiler_cache_dir -> ${ASCEND_CACHE_PATH}/kernel_cache_host ID -> the default path ($HOME/atc_data) For details about ASCEND_CACHE_PATH, see Installation and Configuration > Flush File Configuration in Environment Variables. Example: custom_op.parameter_map["op_compiler_cache_dir"].s = tf.compat.as_bytes("/home/test/kernel_cache")	Training/Online inference

Option

Description

Application Scenarios

op_compiler_cache_mode

Disk cache mode for operator compilation. enable is the default value.

enable: disk cache mode enabled. The operator compilation information is cached to the disk, which can be reused by operators with the same compilation options, improving compilation efficiency.
force: cache mode enabled. In this mode, the existing cache is cleared up before new operator compilation is added to the cache. For example, for Python changes, dependency library changes, or repository changes after operator optimization, you need to set this option to force to clear up the existing cache and then change it to enable to prevent the cache from being forcibly refreshed during each compilation. Note that you are not advised to set the force option for parallel program compilation. Otherwise, the cache used by other models may be cleared, causing compilation failures.
disable: disabled.

Notes:

When enabling the operator compilation cache function, you can configure the path for storing the operator compilation cache file by using op_compiler_cache_dir.
disable and force are recommended for publishing the final model.
If op_debug_level is set to a non-zero value, the op_compiler_cache_mode configuration is ignored, the operator compilation cache function is disabled, and all operators are recompiled.
If op_debug_config is not empty and the op_debug_list field is not configured, the op_compiler_cache_mode configuration is ignored, the operator compilation cache function is disabled, and all operators are recompiled.
If op_debug_config is not empty, the op_debug_list field is configured, and op_compiler_cache_mode is set to enable or force, the operators in the list are recompiled, and the operator compilation cache function is enabled for operators that are not in the list.
When the operator compilation cache function is enabled, the default disk space for storing cache files is 500 MB. If the disk space is insufficient, 50% of the cache files are retained by default. You can also customize the ratio of disk space allocated for storing cache files to the reserved cache space as follows:
1. Using the op_cache.ini configuration file
  After the operator is compiled, the op_cache.ini file is automatically generated in the directory specified by op_compiler_cache_dir. You can use this file to set the ratio of the disk space allocated for cache storage to the reserved cache space. If the op_cache.ini file does not exist, manually create it.
  
  Add the following information to the op_cache.ini file:
```
# Configure the file format (required). The automatically generated file contains the following information by default. When manually creating a file, enter the following information:
[op_compiler_cache]
# Limit the disk space of the cache folder on the Ascend AI Processor (unit: MB).
max_op_cache_size=500
# When the disk space is insufficient, set the ratio of the cache space to be reserved. The value ranges from 1 to 100, in percentage. For example, 80 indicates that 80% of the cache files are reserved and other files are deleted when the disk space is insufficient.
remain_cache_size_ratio=80
```
  - The op_cache.ini file takes effect only when the values of max_op_cache_size and remain_cache_size_ratio in the preceding file are valid.
  - If the size of the compilation cache file exceeds the value of max_op_cache_size and the cache file is not accessed for more than half an hour, the cache file will be aged. (Operator compilation will not be interrupted due to the size of the compilation cache file exceeding the set limit. Therefore, if max_op_cache_size is set to a small value, the size of the actual compilation cache file may exceed the configured value.)
  - To disable the compilation cache aging function, set max_op_cache_size to -1. In this case, the access time is not updated when the operator cache is accessed, the operator compilation cache is not aged, and the default drive space is 500 MB.
  - If multiple users use the same cache path, the configuration file affects all users.
2. Using environment variable ASCEND_MAX_OP_CACHE_SIZE
  The environment variable ASCEND_MAX_OP_CACHE_SIZE is used to limit the storage space of the cache folder of the Ascend AI Processor. When the compilation cache space reaches the specified value and the cache file is not accessed for more than half an hour, the cache file is aged. The environment variable ASCEND_REMAIN_CACHE_SIZE_RATIO is used to set the ratio of the cache space to be reserved. For details about environment variables, see Building > Operator Building in Environment Variables.
  
  To disable the compilation cache aging function, set ASCEND_MAX_OP_CACHE_SIZE to -1.
Caution: If both the op_cache.ini file and environment variable are configured, the configuration items in the op_cache.ini file are read first. If neither the op_cache.ini file nor the environment variable are configured, the system default values are read: 500 MB drive space and 50% reserved cache space.

Example:

custom_op.parameter_map["op_compiler_cache_mode"].s = tf.compat.as_bytes("enable")

Training/Online inference

op_compiler_cache_dir

Disk cache directory for operator compilation.

The value can contain letters, digits, underscores (_), hyphens (-), and periods (.).

If the specified directory exists and is valid, the kernel_cache subdirectory is automatically created. If the specified directory does not exist but is valid, the system automatically creates this directory and the kernel_cache subdirectory.

The storage priority of the operator compilation cache files is as follows:

op_compiler_cache_dir -> ${ASCEND_CACHE_PATH}/kernel_cache_host ID -> the default path ($HOME/atc_data)

For details about ASCEND_CACHE_PATH, see Installation and Configuration > Flush File Configuration in Environment Variables.

Example:

custom_op.parameter_map["op_compiler_cache_dir"].s = tf.compat.as_bytes("/home/test/kernel_cache")

Training/Online inference

Data Augmentation

Option	Description	Application Scenarios
local_rank_id	Rank ID of the current process, used in data parallel processing. The main process deduplicates the data and distributes the deduplicated data to the devices of other processes for forward and backward propagation. In this mode, multiple devices on a host share one main process for data preprocessing, leaving other processes to receive preprocessed data from the main process. To identify the main process, call the collective communication API get_local_rank_id() to get the rank ID of the current process on its server. Example: custom_op.parameter_map["local_rank_id"].i = 0	Training/Online inference
local_device_list	Devices that the main process sends data to, used in conjunction with local_rank_id. custom_op.parameter_map["local_device_list"].s = tf.compat.as_bytes("0,1")	Training/Online inference

Option

Description

Application Scenarios

local_rank_id

Rank ID of the current process, used in data parallel processing. The main process deduplicates the data and distributes the deduplicated data to the devices of other processes for forward and backward propagation.

In this mode, multiple devices on a host share one main process for data preprocessing, leaving other processes to receive preprocessed data from the main process.

To identify the main process, call the collective communication API get_local_rank_id() to get the rank ID of the current process on its server.

Example:

custom_op.parameter_map["local_rank_id"].i = 0

Training/Online inference

local_device_list

Devices that the main process sends data to, used in conjunction with local_rank_id.

custom_op.parameter_map["local_device_list"].s = tf.compat.as_bytes("0,1")

Training/Online inference

Exception Remedy

Option	Description	Application Scenarios
hccl_timeout	Timeout interval (s) of collective communication. Defaults to 1836. You can set the timeout interval if the default value does not meet your requirement (for example, when a communication failure occurs). For the Atlas Training Series Product , the value range is (0, 17340], in seconds. The default value is 1836. Note: For the Atlas Training Series Product , actual timeout interval set in the system = Value of this environment variable // 68 x 68 (unit: s). If the value of the environment variable is smaller than 68, 68s is used by default. For example, if hccl_timeout is set to 600, the actual timeout interval set in the system is 544s (600 // 6 x 68 = 8 x 68). NOTE: The priority of hccl_timeout supersedes that of the environment variable HCCL_EXEC_TIMEOUT. If both hccl_timeout and HCCL_EXEC_TIMEOUT are configured, hccl_timeout is used. For details about HCCL_EXEC_TIMEOUT, see Environment Variables. Example: custom_op.parameter_map["hccl_timeout"].i = 1800	Training
op_wait_timeout	Operator wait timeout interval (s). Defaults to 120. You can set the timeout interval if the default value does not meet your requirement. Configuration example: custom_op.parameter_map["op_wait_timeout"].i = 120	Training
op_execute_timeout	Operator execution timeout interval (s). Example: custom_op.parameter_map["op_execute_timeout"].i = 90	Training
stream_sync_timeout	Timeout interval for stream synchronization during graph execution. If the timeout interval exceeds the configured value, a synchronization failure is reported. The unit is ms. The default value is -1, indicating that there is no waiting time and no error is reported when the synchronization fails. Note: In the cluster training scenario, the value of this option (timeout interval for stream synchronization) must be greater than the collection communication timeout interval, that is, the value of hccl_timeout or the value of the environment variable HCCL_EXEC_TIMEOUT. Example: custom_op.parameter_map["stream_sync_timeout"].i = 60000	Training
event_sync_timeout	Timeout interval for event synchronization during graph execution. If the timeout interval exceeds the configured value, a synchronization failure is reported. The unit is ms. The default value is -1, indicating that there is no waiting time and no error is reported when the synchronization fails. Configuration example: custom_op.parameter_map["event_sync_timeout"].i = 60000	Training

Experiment Options

The experiment options are extended options for debugging and may be changed in later versions. Therefore, they cannot be used in commercial products.

Option	Description	Application Scenarios
jit_compile	Determines whether to compile the operator online or use the compiled operator binary file. auto: For a static shape network, compile the operator online. For a dynamic shape network, search for the compiled operator binary in the system first. If the corresponding binary file is not available, compile the operator. true: Operators are compiled online. The system performs fusion and tuning based on the obtained graph information to get better performing operators. false: The compiled operator binary file in the system is preferentially searched. If the file can be found, operators are not compiled anymore, which produces better compilation performance. If the file cannot be found, operators will be compiled. Default value: auto NOTICE: This option is used only for networks of large recommendation models. Example: custom_op.parameter_map["jit_compile"].s = tf.compat.as_bytes( "auto")	Training/Online inference
experimental_accelerate_train_mode	If training takes more than one hour, you can trigger training acceleration to improve training performance by configuring this option. The software compiles and runs the corresponding proportion of training processes with reduced precision based on the acceleration type, mode of triggering acceleration, and proportion of low-precision training processes of your configurations. The remaining training processes are compiled and run based on the original precision. The value type of this option is a string. Three fields are separated by vertical bars (\|), for example, fast\|step\|0.9. The first field indicates the acceleration type, which can be fast or fast1. fast indicates that the compilation is performed based on the float16 data type during precision reduction. fast1 indicates that the compilation is performed based on the bf16 data type during precision reduction. The second field supports two values: step and loss, indicating that the entire training process is divided into low-precision training and high-precision training based on the step value or loss value, respectively. The third field indicates the proportion of the low-precision training process to the total step or loss values. When the value of the second field is step, its value ranges from 0.2 to 0.9. Defaults to 0.9. When the value of the second field is loss, its value ranges from 1.01 to 1.5. Defaults to 1.05. Example: Acceleration triggered by step: custom_op.parameter_map["experimental_accelerate_train_mode"].s = tf.compat.as_bytes("fast\|step\|0.9") Acceleration triggered by loss: custom_op.parameter_map["experimental_accelerate_train_mode"].s = tf.compat.as_bytes("fast\|loss\|1.05") Notes: If you need to trigger training acceleration by using this option, ensure that the network script can be properly converged. In scenarios where network script training takes a short time, the end-to-end performance duration may not yield positive benefits, if this option is enabled. The function of this option is related to the precision mode configured in the network script: When precision_mode is used to configure the precision mode, this option can be enabled only when precision_mode is set to allow_fp32_to_fp16, must_keep_origin_dtype, or none. When precision_mode_v2 is used to configure the precision mode, this option can be enabled only when precision_mode_v2 is set to origin or none. The function of this option is related to the number of iterations per loop. When the iterations per loop are enabled, the entire training process may not be split based on the specified value of step or loss, which may finally affect loss and precision. When this option is enabled, you need to modify the network script based on the following rules: If the entire training process is split by step, you need to set the STEP_NOW and TOTAL_STEP environment variables to notify the bottom-layer software of the step value and the total number of steps of each running. If the entire training process is split by loss, you need to set the LOSS_NOW and TARGET_LOSS environment variables to notify the bottom-layer software of the loss value and target loss value of each running. The following is an example of modifying the network script in step splitting mode: # Set the initial value of the environment variable STEP_NOW to 0. os.environ['STEP_NOW'] = "0" # Set the environment variable TOTAL_STEP to the total number of steps. os.environ['TOTAL_STEP'] = str(epoch) for i in range(epoch): # Start training. _, step = sess.run([train_op, global_step]) # Update the value of the environment variable STEP_NOW to the current step. os.environ['STEP_NOW'] = str(step) The following is an example of modifying the network script in loss splitting mode: # Set the initial value of the environment variable LOSS_NOW to the initial loss value of the network. The following is only an example: os.environ['LOSS_NOW'] = "7.0" # Set the value of the environment variable TARGET_LOSS to the target loss value. The following is only an example: os.environ['TARGET_LOSS'] = "3.0" for i in range(epoch): # Start training. _, step = sess.run([train_op, global_step]) # Update the value of the environment variable LOSS_NOW to the loss value that is being executed. os.environ['LOSS_NOW'] = str(loss)	Training

Option

Description

Application Scenarios

jit_compile

Determines whether to compile the operator online or use the compiled operator binary file.

auto: For a static shape network, compile the operator online. For a dynamic shape network, search for the compiled operator binary in the system first. If the corresponding binary file is not available, compile the operator.
true: Operators are compiled online. The system performs fusion and tuning based on the obtained graph information to get better performing operators.
false: The compiled operator binary file in the system is preferentially searched. If the file can be found, operators are not compiled anymore, which produces better compilation performance. If the file cannot be found, operators will be compiled.

Default value: auto

NOTICE:

This option is used only for networks of large recommendation models.

Example:

custom_op.parameter_map["jit_compile"].s = tf.compat.as_bytes( "auto")

Training/Online inference

experimental_accelerate_train_mode

If training takes more than one hour, you can trigger training acceleration to improve training performance by configuring this option.

The software compiles and runs the corresponding proportion of training processes with reduced precision based on the acceleration type, mode of triggering acceleration, and proportion of low-precision training processes of your configurations. The remaining training processes are compiled and run based on the original precision.

The value type of this option is a string. Three fields are separated by vertical bars (|), for example, fast|step|0.9.

The first field indicates the acceleration type, which can be fast or fast1.
- fast indicates that the compilation is performed based on the float16 data type during precision reduction.
- fast1 indicates that the compilation is performed based on the bf16 data type during precision reduction.
The second field supports two values: step and loss, indicating that the entire training process is divided into low-precision training and high-precision training based on the step value or loss value, respectively.
The third field indicates the proportion of the low-precision training process to the total step or loss values.
- When the value of the second field is step, its value ranges from 0.2 to 0.9. Defaults to 0.9.
- When the value of the second field is loss, its value ranges from 1.01 to 1.5. Defaults to 1.05.

Example:

Acceleration triggered by step:

custom_op.parameter_map["experimental_accelerate_train_mode"].s =
tf.compat.as_bytes("fast|step|0.9")

Acceleration triggered by loss:

custom_op.parameter_map["experimental_accelerate_train_mode"].s =
tf.compat.as_bytes("fast|loss|1.05")

Notes:

If you need to trigger training acceleration by using this option, ensure that the network script can be properly converged.
In scenarios where network script training takes a short time, the end-to-end performance duration may not yield positive benefits, if this option is enabled.
The function of this option is related to the precision mode configured in the network script:
- When precision_mode is used to configure the precision mode, this option can be enabled only when precision_mode is set to allow_fp32_to_fp16, must_keep_origin_dtype, or none.
- When precision_mode_v2 is used to configure the precision mode, this option can be enabled only when precision_mode_v2 is set to origin or none.
The function of this option is related to the number of iterations per loop. When the iterations per loop are enabled, the entire training process may not be split based on the specified value of step or loss, which may finally affect loss and precision.

When this option is enabled, you need to modify the network script based on the following rules:

If the entire training process is split by step, you need to set the STEP_NOW and TOTAL_STEP environment variables to notify the bottom-layer software of the step value and the total number of steps of each running.
If the entire training process is split by loss, you need to set the LOSS_NOW and TARGET_LOSS environment variables to notify the bottom-layer software of the loss value and target loss value of each running.

The following is an example of modifying the network script in step splitting mode:

# Set the initial value of the environment variable STEP_NOW to 0.
os.environ['STEP_NOW'] =  "0"
# Set the environment variable TOTAL_STEP to the total number of steps.
os.environ['TOTAL_STEP'] =  str(epoch)
for i in range(epoch):
    # Start training.
    _, step = sess.run([train_op, global_step])
    # Update the value of the environment variable STEP_NOW to the current step.
    os.environ['STEP_NOW'] =  str(step)

The following is an example of modifying the network script in loss splitting mode:

# Set the initial value of the environment variable LOSS_NOW to the initial loss value of the network. The following is only an example:
os.environ['LOSS_NOW'] =  "7.0"
# Set the value of the environment variable TARGET_LOSS to the target loss value. The following is only an example:
os.environ['TARGET_LOSS'] =  "3.0"
for i in range(epoch):
    # Start training.
    _, step = sess.run([train_op, global_step])
    # Update the value of the environment variable LOSS_NOW to the loss value that is being executed.
    os.environ['LOSS_NOW'] =  str(loss)

Training

Options That Will Be Deprecated in Later Versions

The following options will be deprecated in later versions. You are advised not to use them anymore.

Option	Description	Application Scenarios
op_debug_level	Function debugging. Whether to enable operator debugging. The values are as follows: 0: disables operator debug. 1: enables operator debug. TBE instruction mapping files, including an operator CCE file (.cce), a Python-CCE mapping file (_loc.json), and operator .o and .json files, are generated in the kernel_meta* folder under the training script execution directory. You can locate AI Core errors by using tools. 2: enables operator debug. TBE instruction mapping files, including an operator CCE file (.cce), a Python-CCE mapping file (_loc.json), and operator .o and .json files, are generated in the kernel_meta* folder under the training script execution directory. Compilation and optimization are disabled and CCE compiler debug is enabled (by setting -O0-g). You can locate AI Core errors by using tools. 3: disables operator debug. The operator .o and .json files are retained in the kernel_meta folder in the training script execution directory. 4: disables operator debug. The operator binary (.o) and operator description file (.json) are retained, and a TBE instruction mapping file (.cce) and a UB fusion description file ({$kernel_name}_compute.json) are generated in the kernel_meta folder under the training script execution directory. NOTICE: If this option is set to 0 and op_debug_config in Debugging is configured, the operator compilation directory kernel_meta is still generated in the current execution path during training. The content generated in the directory is subject to op_debug_config. You are advised to set this option to 0 or 3 for training. To locate AI Core errors, set this option to 1 or 2, which might compromise the network performance. If this option is set to 2 (the CCE compiler is enabled), it cannot be used together with the oom option in op_debug_config. Otherwise, an AI Core error is reported. The following is an example of the error message: ...there is an aivec error exception, core id is 49, error code = 0x4 ... If this option is set to 2 (the CCE compiler is enabled), the size of the operator kernel file (.o file) increases. In dynamic shape scenarios, all possible scenarios are traversed during operator compilation, which may cause operator compilation failures due to large operator kernel files. In this case, 2 is not recommended. If the compilation failure is caused by the large operator kernel file, the following log is displayed: message:link error ld.lld: error: InputSection too large for range extension thunk* ./kernel_meta_xxxxx.o:(xxxx) If the value of this option is not 0, you can use the debug_dir option in the Debugging to specify the path for storing debugging-related process files. If this option is set to 0 and NPU_COLLECT_PATH is set, the operator compilation directory kernel_meta is generated in the current path after the command is executed. If ASCEND_WORK_PATH is set, kernel_meta is generated in the path specified by the environment variable. For details about the environment variable, see Environment Variables. When the debug function is enabled, if the model contains the following MC2 operators, the .o, .json, and .cce files of the operators are not generated in the kernel_meta* directory. MatMulAllReduce MatMulAllReduceAddRmsNorm AllGatherMatMul MatMulReduceScatter AlltoAllAllGatherBatchMatMul BatchMatMulReduceScatterAlltoAll This option is left empty by default, indicating that the configuration is disabled. Example: custom_op.parameter_map["op_debug_level"].i = 0	Training/Online inference
enable_data_pre_proc	Performance tuning. Enable for the GetNext operator offload to the Ascend AI Processor. The GetNext operator offload is a prerequisite for iteration offload. True: enabled. The prerequisite for GetNext operator offload is that the TensorFlow Dataset mode is used to read data. False (default): disabled. Example: custom_op.parameter_map["enable_data_pre_proc"].b = True	Training
variable_format_optimize	Performance tuning. Variable format optimization enable. True: enabled. False: disabled. If it is enabled, the variables are reformatted during network variable initialization to better target to Ascend AI Processor (for example, from NCHW to NC1HWC0) for improved training efficiency. Enable or disable this function as needed. This option is left empty by default, indicating that the configuration is disabled. Example: custom_op.parameter_map["variable_format_optimize"].b = True	Training
op_select_implmode	Performance tuning. Operator implementation mode select. Certain operators compiled in the Ascend AI Processor can be implemented in either high-precision or high-performance mode at model compile time. Arguments: high_precision: high-precision implementation mode. In high-precision mode, Taylor's theorem or Newton's method is used to improve operator precision with float16 input. high_performance (default): high-performance implementation mode. In high-performance mode, the optimal performance is implemented without affecting the network precision (float16). This option is left empty by default, indicating that the configuration is disabled. Example: custom_op.parameter_map["op_select_implmode"].s = tf.compat.as_bytes("high_precision")	Training/Online inference
optypelist_for_implmode	Performance tuning. List of operator types (separated by commas) that use the mode specified by the op_select_implmode option. Currently, Pooling, SoftmaxV2, LRN, and ROIAlign operators are supported. Use this option in conjunction with op_select_implmode, for example: Set op_select_implmode to high_precision. Set optypelist_for_implmode to Pooling. This option is left empty by default, indicating that the configuration is disabled. Example: custom_op.parameter_map["optypelist_for_implmode"].s = tf.compat.as_bytes("Pooling,SoftmaxV2")	Training/Online inference
dynamic_input	Whether it is a dynamic input. True False (default) Example: custom_op.parameter_map["dynamic_input"].b = True	Training/Online inference
dynamic_graph_execute_mode	Execution mode of a dynamic input. That is, this option takes effect when dynamic_input is set to True. Possible values are: dynamic_execute: dynamic graph compilation. In this mode, the shape range configured in dynamic_inputs_shape_range is used for compilation. Example: custom_op.parameter_map["dynamic_graph_execute_mode"].s = tf.compat.as_bytes("dynamic_execute")	Training/Online inference
dynamic_inputs_shape_range	Shape range of each dynamic input. If a graph has two dataset inputs and one placeholder input, a configuration example is as follows: custom_op.parameter_map["dynamic_inputs_shape_range"].s = tf.compat.as_bytes("getnext:[128 ,3~5, 2~128, -1],[64 ,3~5, 2~128, -1];data:[128 ,3~5, 2~128, -1]") Precautions: When this option is used, constants cannot be set to user inputs. getnext indicates the dataset inputs and data indicates the placeholder inputs. The size of a static dimension is specified by a determinant value. The size range of a dynamic dimension is specified by using a tilde (~). A dynamic dimension without size range specified is denoted by -1. Assume that your graph has three dataset inputs but the first dataset input has a static shape; the static shape must be specified as shown below. custom_op.parameter_map["dynamic_inputs_shape_range"].s = tf.compat.as_bytes("getnext:[3,3,4,10],[-1,3,2~1000,-1],[-1,-1,-1,-1]") For scalar inputs, you also need to fill in the shape range by using square brackets ([]). No space is allowed before []. If there are multiple getnext inputs or data inputs on the network, the input ordering must be preserved. For example: If there are multiple dataset inputs on the network: def func(x): x = x + 1 y = x + 2 return x,y dataset = tf.data.Dataset.range(min_size, max_size) dataset = dataset.map(func) Assume that the first input of the network is x (with shape range [3–5]) and the second input is y (with shape range [3–6]). When configuring the dynamic ranges in dynamic_inputs_shape_range, the ordering must be preserved. custom_op.parameter_map["dynamic_inputs_shape_range"].s = tf.compat.as_bytes("getnext:[3~5],[3~6]") If there are multiple placeholder inputs on the network: If the placeholder names are not specified, for example: x = tf.placeholder(tf.int32) y = tf.placeholder(tf.int32) # Set the dynamic ranges of the placeholder inputs in dynamic_inputs_shape_range in the same order as that defined in the script. That is, the first input x (with shape range [3-5]) goes first and the second input y (with shape range [3-6]) follows. custom_op.parameter_map["dynamic_inputs_shape_range"].s = tf.compat.as_bytes("data:[3~5],[3~6]") If the placeholder names are specified, for example: x = tf.placeholder(tf.int32, name='b') y = tf.placeholder(tf.int32, name='a') The inputs are in the alphabetical order of the name fields, that is, when setting dynamic_inputs_shape_range, the first input y (with shape range [3-6]) goes first and the second input x (with shape range [3-5]) follows. custom_op.parameter_map["dynamic_inputs_shape_range"].s = tf.compat.as_bytes("data:[3~6],[3~5]") NOTICE: For subgraphs with different input shapes, set_graph_exec_config is recommended for supporting dynamic inputs. dynamic_inputs_shape_range acts only on a single graph, which may cause execution errors. If the placeholder names are not specified in the network script, the placeholders are named in the following format: xxx_0, xxx_1, xxx_2, ... The content following the underscore (_) is the sequence index of a placeholder in the network script. Placeholders are arranged in alphabetical order of the index. If the number of placeholders is greater than 10, the sequence is xxx_0 > xxx_10 > xxx_2 > xxx_3. In the network script, the placeholder with index 10 is placed before the placeholder with index 2. As a result, the defined shape range does not match the input placeholder. To avoid this problem, when the number of input placeholders is greater than 10, you are advised to specify the placeholder names in the network script. In this case, the placeholders are named based on the specified names, to associate the shape ranges with the placeholder names. This option cannot be used together with dynamic_dims. If they are used together, dynamic_dims has a higher priority and this option does not take effect.	Training/Online inference
graph_memory_max_size	Sizes of the network static memory and the maximum dynamic memory (used in earlier versions). In the current version, this option does not take effect. The system dynamically allocates memory resources based on the actual memory usage of the network.	Training/Online inference
variable_memory_max_size	Size of the variable memory (used in earlier versions). In the current version, this option does not take effect. The system dynamically allocates memory resources based on the actual memory usage of the network.	Training/Online inference

Parent topic: Session Configuration