Command-Line Options
This section describes the configuration options passed to GEInitialize, the Session constructor, and AddGraph, which take effect globally, in a session and in a graph respectively.
The following table lists only the configuration options supported by the current version. If an option is not listed in the table, it is reserved or applicable to other Ascend AI Processor versions.
Basic Functions
|
Options Key |
Options Value |
Required/Optional |
Global/Session/Graph |
|---|---|---|---|
|
ge.graphRunMode |
Graph run mode.
Configuration example: {"ge.graphRunMode", "0"}; |
Optional |
Global/Session |
|
ge.exec.deviceId |
Logical ID of the operated device when the GE instance is running.
N indicates the number of available Ascend AI Processors on the server. Configuration example: {"ge.exec.deviceId", "-1"}; |
Optional |
Global |
|
ge.session_device_id |
Logical ID of a device. Setting this parameter allows you to run different models on multiple devices by executing a single training script. You can create multiple threads, each of which is a session. Each session transfers a different value of ge.session_device_id. Configuration example: {"ge.session_device_id", "0"}; |
Optional |
session |
|
ge.socVersion |
Target model of the Ascend AI Processor for model build and optimization.
|
Optional |
all |
|
ge.enableSingleStream |
Whether to enable single-stream serial execution of graph in the static shape scenario. Streams preserve the order of a stack of asynchronous operations being executed on the device. Arguments:
Restrictions: If the model contains the Cmo operator and the following control operators, the single-stream feature cannot be used. In this case, use the default value false.
Configuration example: {"ge.enableSingleStream", "false"}; |
Optional |
graph |
|
ge.exec.rankTableFile |
Information about the cluster participating in collective communication, including the organization information about the server, device, and container. Set this option to the ranktable file path, including the file name. |
Optional |
all |
|
ge.exec.rankId |
Rank ID, the ID of a process in a group. The value ranges from 0 to (rank size – 1). For a custom group, the rank starts from 0 in the group. For an HCCL world group, the rank ID is the same as the world rank ID.
|
Optional |
all |
|
ge.constLifecycle |
Lifecycle of constant nodes in the training and online inference scenario.
The default value is session in the training scenario and graph in the online inference scenario. |
Optional |
all |
|
ge.deterministic |
Whether to enable deterministic computing. By default, deterministic computing is disabled. Multiple execution results of an operator with the same hardware and input may be different. This is generally caused by asynchronous multi-thread executions during operator implementation, which changes the accumulation sequence of floating-point numbers. When deterministic computing is enabled, the same output is generated if an operator is executed for multiple times with the same hardware and input. This often slows down operator execution. If multiple execution results of a model are different or the precision needs to be optimized, you can enable deterministic computing to assist model debugging and optimization. Arguments:
Configuration example: {"ge.deterministic", "0"}; |
Optional |
Global |
|
ge.exec.frozenInputIndexes |
Index of the input tensor whose address is not refreshed. This parameter can be called only for LoadGraph. The input tensor index varies according to the model.
Configuration example: # Pass only the input tensor index.
{"ge.exec.frozenInputIndexes", "0;1;2"};
# Pass the input tensor index, address of the data on the device, and data length.
{"ge.exec.frozenInputIndexes", "0,88832131,4;1,888213294,4;2,193492421,2"};
For details about the examples and precautions, see Running a Graph Asynchronously in the Single-Process and Single-Device Mode. Restrictions: The input tensor whose address is not refreshed must have a static shape. For a dynamic shape model, the input tensor must also have a static shape. |
Optional |
graph |
|
ge.exec.hostInputIndexes |
Input tensor index whose placement attribute is host information in the in-line copy scenario. Use semicolons (;) to separate multiple input tensor indexes. In-line copy refers to the process of copying the input tensor data from host memory to device memory when the operator address of the model is updated. Configuration example: {"ge.exec.hostInputIndexes", "0;1;2"};
Restrictions:
|
Optional |
graph |
Memory Management
|
Options Key |
Options Value |
Required/Optional |
Global/Session/Graph |
|---|---|---|---|
|
ge.exec.disableReuseMemory |
Memory reuse switch.
Configuration example: {"ge.exec.disableReuseMemory", "0"}; |
Optional |
all |
|
ge.exec.atomicCleanPolicy |
Whether to collectively clean up the memory occupied by all operators with the memset attribute (memset operators) on the network. Arguments:
Configuration example: {"ge.exec.atomicCleanPolicy", "0"}; |
Optional |
session |
|
ge.externalWeight |
When multiple models are loaded in a session, if the weights of these models can be reused, you are advised to use this configuration item to externalize the weights of the Const/Constant nodes on the network to implement weight reuse among multiple models and reduce the memory usage of the weights. Arguments:
Description of the file flush path:
Priority of the flush path: ge.externalWeightDir > ${ASCEND_WORK_PATH}/tmp_weight_<pid>_<sessionid> > current execution directory tmp_weight_<pid>_<sessionid> When the model is uninstalled, the tmp_weight_<pid>_<sessionid> directory is deleted. Configuration example: {"ge.externalWeight", "1"}; |
Optional |
Global/Session |
|
ge.externalWeightDir |
Flush path for the external weight file Restrictions:
Configuration example: {"ge.externalWeight", "1"};
{"ge.externalWeightDir", "$HOME/your_tmp_path"}; |
Optional |
Global/Session |
|
ge.exec.staticMemoryPolicy |
Memory allocation mode used during network running. Arguments:
NOTE:
Configuration example: {"ge.exec.staticMemoryPolicy", "0"}; |
Optional |
Global/Session |
|
ge.featureBaseRefreshable |
Whether the feature memory address can be refreshed. To manage the feature memory and refresh the address for multiple times, set this parameter to the refreshable value. This parameter applies only to static shape graphs. Arguments: 0 (default): The feature memory address cannot be refreshed. 1: The feature memory address of a model can be refreshed. Configuration example: {"ge.featureBaseRefreshable", "0"}; |
Optional |
all |
|
ge.exec.inputReuseMemIndexes |
Whether to enable the memory reuse function of the input node of a graph. After the function is enabled, the memory of the input node can be reused as the intermediate memory required during model execution, reducing the memory peak. The value is the index of the input node. If memory reuse is enabled for multiple input nodes, use commas (,) to separate multiple indexes. The index attribute of the input node is required, specifying the sequence number of the input. The index starts from 0. Note:
Configuration example: {"ge.exec.inputReuseMemIndexes", "0,1,2"}; |
Optional |
graph |
|
ge.exec.outputReuseMemIndexes |
Whether to enable the memory reuse function for the entire graph output. After the function is enabled, the memory of the entire graph output can be overcommitted as the intermediate memory required during model execution, reducing the memory peak. If this function is enabled, the value is the index of the entire graph output. If memory reuse is enabled for multiple outputs, use commas (,) to separate multiple indexes. Note:
Configuration example: {"ge.exec.outputReuseMemIndexes", "0,1,2"}; |
Optional |
graph |
|
ge.exec.input_fusion_size |
Threshold for fusing and copying multiple discrete pieces of user input data during data transfer from the host to the device. The minimum value is 0, the maximum value is 32 MB (33,554,432 bytes), and the default value is 128 KB (131,072 bytes). If:
Assume there are 10 user inputs, including two 100 KB inputs, two 50 KB inputs, and the other inputs greater than 100 KB:
This parameter takes effect only when the static graph is run asynchronously. That is, the API mentioned in RunGraphAsync is used to run the graph. |
Optional |
all |
|
ge.inputBatchCpy |
Whether to enable the batch memory copy function when input data is transferred from the host to the device. The function controlled by this parameter improves the performance of data transfer from the host to the device. It applies to the scenario where data needs to be frequently transferred and the PCIe bandwidth usage is low. After the function is enabled, bandwidth utilization can be improved. Arguments:
Restrictions:
Configuration example: {"ge.inputBatchCpy", "0"}; |
Optional |
all |
Dynamic Shape
Operator and Graph Build
Debugging
Precision Tuning
Precision Comparison
Performance Tuning
|
Key |
Value |
Required/Optional |
Global/Session/Graph |
|---|---|---|---|
|
ge.exec.variable_acc |
Whether to enable variable format optimization. Arguments:
To improve training efficiency, the format of the variables is converted to a format more compatible with the Ascend AI Processor during variable initialization performed by the network. However, this function should be disabled in special scenarios. Restrictions: When this function is enabled, ge.AllowMultiGraphParallelCompile cannot be set to 1. Otherwise, an error is reported during verification. Configuration example: {"ge.exec.variable_acc", "True"}; |
Optional |
All |
|
ge.exec.op_precision_mode |
Precision mode of one or more specified operators during internal processing. This parameter is used to transfer the customized precision mode configuration file op_precision.ini to set different precision modes for different operators. Set the precision mode based on the operator type (low priority) or node name (high priority) in each row in the .ini file. The following precision modes can be set in the configuration file:
You can view the precision or performance mode supported by an operator in the opp/built-in/op_impl/ai_core/tbe/impl_mode/all_ops_impl_mode.ini file in the file storage path with the CANN software installed. A configuration example for the op_precision.ini file is as follows: [ByOpType] optype1=high_precision optype2=high_performance optype3=enable_hi_float_32_execution optype4=support_out_of_bound_index [ByNodeName] nodename1=high_precision nodename2=high_performance nodename3=enable_hi_float_32_execution nodename4=support_out_of_bound_index Configuration example: {"ge.exec.op_precision_mode", "$HOME/conf/op_precision.ini"}; |
Optional |
Global |
|
ge.optypelistForImplmode |
List of operator types. The operators in the list use the mode specified by the ge.opSelectImplmode option. Restrictions:
|
Optional |
Global |
|
ge.tiling_schedule_optimize |
Whether to enable the optimization for tiling offload scheduling. As internal storage of the AI Core in the NPU cannot store all the input and output data of operators, the input data is tiled into different parts. The first part is transferred in, computed, and then transferred out, so does the next part. This process is called tiling. Then, a computation program, called tiling implementation, determines tiling parameters (such as the block size transferred each time and the total number of cycles) based on operator information such as shape. The AI Cores are not good at scalar computation in the tiling implementation. Therefore, tiling implementation is generally executed on the CPU on the host. However, tiling implementation is executed on the device when the following conditions are met:
Arguments:
Configuration example: {"ge.tiling_schedule_optimize", "0"};
This option can be used only by the following Products: |
Optional |
Global/Session |
|
ge.graphMaxParallelModelNum |
In graph execution mode, a graph can be concurrently loaded and executed by multiple models on the same device. This parameter is used to specify the maximum number of models that can be concurrently loaded. Arguments: 1 to INT32_MAX. The default value is 8. Configuration example: {"ge.graphMaxParallelModelNum", "8"}; |
Optional |
All |
Profiling
AOE
|
Key |
Value |
Required/Optional |
Global/Session/Graph |
|---|---|---|---|
|
ge.mdl_bank_path |
Path of the custom repository generated after subgraph tuning. This option must be used together with ge.bufferOptimize and takes effect only when buffer optimization is enabled, to improve performance by temporarily storing data in the buffer. Argument: directory of the custom repository generated after model tuning. Format: The directory can contain letters, digits, underscores (_), hyphens (-), and periods (.). Default: $HOME/Ascend/latest/data/aoe/custom/graph/<soc_version> Restrictions: Priority ranked from high to low: directory specified by ge.mdl_bank_path > directory specified by TUNE_BANK_PATH > default directory.
|
Optional |
All |
|
ge.op_bank_path |
Directory of the custom repository generated after operator tuning. Format: The directory can contain letters, digits, underscores (_), hyphens (-), and periods (.). Default: ${HOME}/Ascend/latest/data/aoe/custom/op Restrictions: Path (path of the custom repository generated after operator tuning) priority ranked from high to low: path specified by the TUNE_BANK_PATH environment variable > path specified by ge.op_bank_path > default path of the custom repository generated after operator tuning.
|
Optional |
All |
Exception Remedy
|
Key |
Value |
Required/Optional |
Global/Session/Graph |
|---|---|---|---|
|
stream_sync_timeout |
Timeout for stream synchronization during graph execution. If the timeout exceeds the configured value, a synchronization failure is reported. The unit is ms. The default value is –1, indicating that there is no waiting time and no error is reported when the synchronization fails. Configuration example: {"stream_sync_timeout", "-1"}; |
Optional |
Global/Session |
|
event_sync_timeout |
Timeout for event synchronization during graph execution. If the timeout exceeds the configured value, a synchronization failure is reported. The unit is ms. The default value is –1, indicating that there is no waiting time and no error is reported when the synchronization fails. Configuration example: {"event_sync_timeout", "-1"}; |
Optional |
Global/Session |
Experiment Parameters
|
Key |
Value |
Required/Optional |
Global/Session/Graph |
|---|---|---|---|
|
ge.jit_compile |
This option is not supported in the current version. |
Optional |
Global/Session |
|
ge.build_inner_model |
This option is not supported in the current version. |
Optional |
N/A |
|
ge.disableOptimizations |
This option is used for debugging and cannot be used in commercial products. The function specified by this option will be released as a feature in later versions. This option applies only to the following products: This option is used to specify one or more compilation and optimization passes to be disabled. Currently, only the following passes can be disabled: "RemoveSameConstPass","ConstantFoldingPass","TransOpWithoutReshapeFusionPass" Note:
Configuration example:
|
Optional |
All |
|
ge.oo.level |
Extended option for debugging. It cannot be used in commercial products and will be released as a formal function in later versions. Multi-level optimization options for graph build include subgraph optimization, entire graph optimization, and static shape model offloading. Static shape model offloading: In this approach, the input and output shapes of all operators in a static shape model can be determined at build time, allowing for model-level memory orchestration and operator tiling computation to be completed on the host. These computations are then batched and sent to the device stream when the model is loaded, but they are not executed immediately. Instead, the execution of all tasks within the model is triggered by the delivery of model execution tasks. Arguments:
Restrictions: If the value is O1, all graph fusion and UB fusion passes are disabled, and only passes related to static offloading are enabled. However, the graph fusion passes in the following files are enabled by default because function problems may occur if they are disabled: All graph fusion passes under the ExceptionalPassOfO1Level field in the ${INSTALL_DIR}/x86_64-linux/lib64/plugin/opskernel/fusion_pass/config/fusion_config.json file Replace ${INSTALL_DIR} with the CANN component directory. For example, if the installation is performed by the root user, the default file storage path is /usr/local/Ascend/cann. Configuration example: {"ge.oo.level", "O3"}; |
Optional |
All |
|
ge.oo.constantFolding |
Extended option for debugging. It cannot be used in commercial products and will be released as a formal function in later versions. Sets whether to enable constant folding optimization. Constant folding is the process of replacing nodes in a computational graph that can be evaluated to a constant output value with that constant, and simplifying the structure of the computational graph accordingly. Arguments:
Configuration example: {"ge.oo.constantFolding", "true"};
Restrictions: If other compilation optimization options, such as ge.disableOptimizations, are configured, ge.disableOptimizations has a higher priority. |
Optional |
All |
|
ge.oo.deadCodeElimination |
Extended option for debugging. It cannot be used in commercial products and will be released as a formal function in later versions. Whether to enable dead-edge elimination optimization Dead-edge elimination: When pred (input 1) of a switch statement is a constant node, one of the branches can be eliminated based on the value of const. If const is true, the false branch is eliminated; if const is false, the true branch is eliminated. Arguments:
Configuration example: {"ge.oo.deadCodeElimination", "true"}; |
Optional |
All |
|
ge.autoMultistreamParallelMode |
Extended option for debugging. It cannot be used in commercial products and will be released as a formal function in later versions. This option applies only to graphs with a static shape. You can enable parallel execution of Cube and Vector operators to improve graph execution performance. Arguments:
NOTICE:
Configuration example:
{"ge.autoMultistreamParallelMode", "cv"};
|
Optional |
session/graph |
|
ge.DeterministicLevel |
Extended option for debugging. It cannot be used in commercial products and will be released as a formal function in later versions. It applies only to 8.5.1 and later versions. Specifies the deterministic computing level. By default, the deterministic level is 0, that is, deterministic computing is disabled. The value of ge.deterministic must also be 0. To enable deterministic computing, set the deterministic level to 1 and the deterministic computing option ge.deterministic to 1. To enable strong consistency computing, set the level to 2 and the deterministic computing option ge.deterministic to 1. If strong consistency computing is enabled (ge.DeterministicLevel=2 is set to 2), the computing result is deterministic, meaning that multiple executions will generate the same result. In addition, the computing result is irrelevant to the data location. For example, when performing matrix multiplication, the order of accumulation across different rows may vary, which can lead to slight differences in results for the same data in different rows. However, when strong consistency computing is enabled, the computing results will be consistent as long as the inputs are the same, even if they are in different rows. By default, the strong consistency computing function is disabled. In this default mode, the computing results may be inconsistent when the same data appears in different rows. For performance considerations, you are advised not to enable strong consistency computing because it slows down the computing speed of operators and affects the overall efficiency. You are advised to enable this function only when the computing result is required to be strictly consistent for the same data in different locations or the model precision is being adjusted and debugged to optimize the overall performance. Arguments:
Configuration example: {"ge.deterministic", "0"};
{"ge.DeterministicLevel", "0"};
Restrictions: This configuration item must be used together with ge.deterministic. |
Optional |
Global |
Parameters That Will Be Deprecated in Later Versions
|
Key |
Value |
Required/Optional |
Global/Session/Graph |
|---|---|---|---|
|
ge.graphMemoryMaxSize |
Do not use this option because it will be deprecated in later versions. Network static memory size and maximum dynamic memory size. The value varies according to the network size. The unit is byte and the value range is [0, 256 × 1024 × 1024 × 1024] or [0, 274877906944]. Due to chip hardware performance restrictions, the sum of ge.graphMemoryMaxSize and ge.variableMemoryMaxSize must not exceed 31 GB. If this option is not set, the default value 26 GB is used. |
Optional |
All |
|
ge.variableMemoryMaxSize |
Do not use this option because it will be deprecated in later versions. Variable memory size. The value varies according to the network size. The unit is byte and the value range is [0, 256 × 1024 × 1024 × 1024] or [0, 274877906944]. Due to chip hardware performance restrictions, the sum of ge.graphMemoryMaxSize and ge.variableMemoryMaxSize must not exceed 31 GB. If this option is not set, the default value 5 GB is used. |
Optional |
All |
|
ge.exec.dynamicGraphExecuteMode |
This option is deprecated. Avoid using it. Execution mode, applicable to the dynamic input scenario. The value is dynamic_execute. |
Optional |
Graph |
|
ge.exec.dataInputsShapeRange |
This option is deprecated. Avoid using it. Shape range of dynamic input. If a graph has two data inputs, the configuration example is as follows: std::map<ge::AscendString, ge::AscendString> ge_options = {{"ge.exec.deviceId", "0"},
{"ge.graphRunMode", "1"},
{"ge.exec.dynamicGraphExecuteMode", "dynamic_execute"},
{"ge.exec.dataInputsShapeRange", "[128 ,3~5, 2~128, -1],[ 128 ,3~5, 2~128, -1]"}};
NOTE:
|
Optional |
Graph |
|
ge.opSelectImplmode |
The function of this option does not evolve and will be deprecated in later versions. You are advised to use ge.exec.op_precision_mode. Operator implementation mode selection. Certain operators built in the Ascend AI Processor can be implemented in either high-precision or high-performance mode at model build time. In high-precision mode, Taylor's theorem or Newton's method is used to improve operator precision with float16 input. In high-performance mode, the optimal performance is implemented without affecting the network precision (float16). Arguments:
The preceding implementation modes are distinguished based on dtype of the operator. Replace ${INSTALL_DIR} with the CANN component directory. For example, if the installation is performed by the root user, the default file storage path is /usr/local/Ascend/cann. Configuration example: {"ge.opSelectImplmode", "high_performance"}; |
Optional |
Global |
|
ge.shape_generalized_build_mode |
Do not use this option because it will be deprecated in later versions. |
Optional |
Graph |


