Command-Line Options

Atlas 200I/500 A2 inference products : This feature is not supported.

This section describes the configuration options passed to GEInitialize, the Session constructor, and AddGraph, which take effect globally, in a session and in a graph respectively.

The following table lists only the configuration options supported by the current version. If an option is not listed in the table, it is reserved or applicable to other Ascend AI Processor versions.

Basic Functions

Options Key

Options Value

Required/Optional

Global/Session/Graph

ge.graphRunMode

Graph run mode.

  • 0 (default): online inference
  • 1: training

Configuration example:

{"ge.graphRunMode", "0"};

Optional

Global/Session

ge.exec.deviceId

Logical ID of the operated device when the GE instance is running.

  • In the online inference scenario, the value ranges from –1 to N – 1. The default value is -1.
  • In the training scenario, the value ranges from 0 to N – 1. The default value is 0.

N indicates the number of available Ascend AI Processors on the server.

Configuration example:

{"ge.exec.deviceId", "-1"};

Optional

Global

ge.session_device_id

Logical ID of a device. Setting this parameter allows you to run different models on multiple devices by executing a single training script.

You can create multiple threads, each of which is a session. Each session transfers a different value of ge.session_device_id.

Configuration example:

{"ge.session_device_id", "0"};

Optional

session

ge.socVersion

Target model of the Ascend AI Processor for model build and optimization.

  • For the following products: Run the npu-smi info command on the server where Ascend AI Processor is installed to obtain the Name information. The actual value is AscendName. For example, if Name is xxxyy, the actual value is Ascendxxxyy.

    Atlas A2 training products / Atlas A2 inference products

    Atlas 200I/500 A2 inference products

    Atlas inference products

    Atlas training products

  • For the following products: Run the npu-smi info -t board -i id -c chip_id command on the server where Ascend AI Processor is installed to obtain the Chip Name and NPU Name information. The actual value is Chip Name_NPU Name. For example, if the value of Chip Name is Ascendxxx and the value of NPU Name is 1234, the actual value is Ascendxxx_1234. Note that:
    • id: device ID, which is the NPU ID obtained by running the npu-smi info -l command.
    • chip_id: chip ID, which is obtained by running the npu-smi info -m command.

    Atlas A3 training products / Atlas A3 inference products

Optional

all

ge.enableSingleStream

Whether to enable single-stream serial execution of graph in the static shape scenario.

Streams preserve the order of a stack of asynchronous operations being executed on the device.

Arguments:

  • true: The graph is executed in single-stream serial mode.
  • false (default): Multiple streams are executed concurrently during graph running.

Restrictions:

If the model contains the Cmo operator and the following control operators, the single-stream feature cannot be used. In this case, use the default value false.

  • Merge
  • Switch
  • Enter
  • RefEnter

Configuration example:

{"ge.enableSingleStream", "false"};

Optional

graph

ge.exec.rankTableFile

Information about the cluster participating in collective communication, including the organization information about the server, device, and container. Set this option to the ranktable file path, including the file name.

Optional

all

ge.exec.rankId

Rank ID, the ID of a process in a group. The value ranges from 0 to (rank size – 1). For a custom group, the rank starts from 0 in the group. For an HCCL world group, the rank ID is the same as the world rank ID.

  • world rank ID: indicates the rank ID of a process in an HCCL world group. The value ranges from 0 to (rank size – 1).
  • Local rank ID: indicates the rank ID of a process in a group on the server where the process is located. The value ranges from 0 to (local rank size – 1).

Optional

all

ge.constLifecycle

Lifecycle of constant nodes in the training and online inference scenario.

  • session: Constant nodes are stored at the session level. If session is used, memory reuse is supported for constant nodes between multiple graphs in a session. However, ensure that constant nodes with the same name in multiple graphs are the same.
  • graph: Constant nodes are stored at the graph level. You can call SetGraphConstMemoryBase to manage the const memory at the graph level.

The default value is session in the training scenario and graph in the online inference scenario.

Optional

all

ge.deterministic

Whether to enable deterministic computing.

By default, deterministic computing is disabled. Multiple execution results of an operator with the same hardware and input may be different. This is generally caused by asynchronous multi-thread executions during operator implementation, which changes the accumulation sequence of floating-point numbers. When deterministic computing is enabled, the same output is generated if an operator is executed for multiple times with the same hardware and input. This often slows down operator execution. If multiple execution results of a model are different or the precision needs to be optimized, you can enable deterministic computing to assist model debugging and optimization.

Arguments:

  • 0 (default): Disables deterministic computing.
  • 1: Enables deterministic computing.

Configuration example:

{"ge.deterministic", "0"};

Optional

Global

ge.exec.frozenInputIndexes

Index of the input tensor whose address is not refreshed. This parameter can be called only for LoadGraph. The input tensor index varies according to the model.

  • Dynamic shape model: The index of the input tensor, address of the data on the device, and data length must be passed. The address must be in decimal format.
  • Static shape model: Only the index of the input tensor needs to be passed. Other parameters, such as data length, do not take effect.

Configuration example:

# Pass only the input tensor index.
{"ge.exec.frozenInputIndexes", "0;1;2"};
# Pass the input tensor index, address of the data on the device, and data length.
{"ge.exec.frozenInputIndexes", "0,88832131,4;1,888213294,4;2,193492421,2"};

For details about the examples and precautions, see Running a Graph Asynchronously in the Single-Process and Single-Device Mode.

Restrictions:

The input tensor whose address is not refreshed must have a static shape. For a dynamic shape model, the input tensor must also have a static shape.

Optional

graph

ge.exec.hostInputIndexes

Input tensor index whose placement attribute is host information in the in-line copy scenario. Use semicolons (;) to separate multiple input tensor indexes.

In-line copy refers to the process of copying the input tensor data from host memory to device memory when the operator address of the model is updated.

Configuration example:

{"ge.exec.hostInputIndexes", "0;1;2"};

Restrictions:

  • This parameter applies only to static shape models.
  • The in-line copy feature is not suitable for large data size input, as the performance may deteriorate.
  • If the model has only one input, this parameter cannot be used together with ge.exec.frozenInputIndexes. If the model has multiple inputs, this parameter cannot be used together with ge.exec.frozenInputIndexes to specify the same input.
  • After the parameter setting is loaded through the options configuration item, RunGraphAsync among subsequent execution APIs cannot be used.

Optional

graph

Memory Management

Options Key

Options Value

Required/Optional

Global/Session/Graph

ge.exec.disableReuseMemory

Memory reuse switch.

  • 1: memory reuse disabled
  • 0 (default): memory reuse enabled

Configuration example:

{"ge.exec.disableReuseMemory", "0"};

Optional

all

ge.exec.atomicCleanPolicy

Whether to collectively clean up the memory occupied by all operators with the memset attribute (memset operators) on the network.

Arguments:

  • 0 (default): Enables collective cleanup.
  • 1: disabled. Memory used by each memset operator is cleaned up separately. When the memset operators on the network occupy too much memory, you are advised to use this mode to reduce the memory usage. However, this may cause performance loss.

Configuration example:

{"ge.exec.atomicCleanPolicy", "0"};

Optional

session

ge.externalWeight

When multiple models are loaded in a session, if the weights of these models can be reused, you are advised to use this configuration item to externalize the weights of the Const/Constant nodes on the network to implement weight reuse among multiple models and reduce the memory usage of the weights.

Arguments:

  • 0 (default): The weights are not externalized and are directly saved in the graph.
  • 1: The weights are externalized, the weight files of all Const/Constant nodes on the network are flushed to the disk, and the node type is converted to FileConstant. Weights of different nodes are stored in different files named as weight_<hash>.

Description of the file flush path:

  • If ge.externalWeightDir is configured, the weight files are flushed to the specified directory.
  • If the ASCEND_WORK_PATH environment variable is not configured in the environment, the weight files are flushed to the current execution directory tmp_weight_<pid>_<sessionid>.
  • If the ASCEND_WORK_PATH environment variable is configured in the environment, the weight files are flushed to the ${ASCEND_WORK_PATH}/tmp_weight_<pid>_<sessionid> directory.

Priority of the flush path: ge.externalWeightDir > ${ASCEND_WORK_PATH}/tmp_weight_<pid>_<sessionid> > current execution directory tmp_weight_<pid>_<sessionid>

When the model is uninstalled, the tmp_weight_<pid>_<sessionid> directory is deleted.

Configuration example:

{"ge.externalWeight", "1"};

Optional

Global/Session

ge.externalWeightDir

Flush path for the external weight file

Restrictions:

  • To specify the flush path for the external weight file, use this parameter together with ge.externalWeight.
  • Priority: ge.externalWeightDir > ${ASCEND_WORK_PATH}/tmp_weight_<pid>_<sessionid> > current execution directory tmp_weight_<pid>_<sessionid>

Configuration example:

{"ge.externalWeight", "1"};
{"ge.externalWeightDir", "$HOME/your_tmp_path"};

Optional

Global/Session

ge.exec.staticMemoryPolicy

Memory allocation mode used during network running.

Arguments:

  • 0 (default): dynamic memory allocation. Memory is dynamically allocated based on the actual size. Dynamic memory expansion is not supported.
  • 2: dynamic memory expansion of static shapes. In training and online inference scenarios, this environment variable can be used to implement memory reuse between multiple graphs in the same session. That is, the memory required by the maximum graph is allocated. For example, if the memory required by the current graph exceeds the memory of the previous graph, the memory of the previous graph is directly released. The memory is reallocated based on the memory required by the current graph.
  • 3: Only dynamic shape supports dynamic memory expansion, which solves the fragment problem during dynamic memory allocation and reduces the memory usage of the dynamic-shape network.
  • 4: Both static and dynamic shapes support dynamic memory expansion.
NOTE:
  • This parameter cannot be set to 2 or 4 when multiple graphs are executed concurrently.
  • To be compatible with earlier versions, the system dynamically expands the memory based on the value 2 even if this parameter is set to 1.
  • If this parameter is set to 3 or 4, memory gains are generated, but performance may deteriorate.

Configuration example:

{"ge.exec.staticMemoryPolicy", "0"};

Optional

Global/Session

ge.featureBaseRefreshable

Whether the feature memory address can be refreshed. To manage the feature memory and refresh the address for multiple times, set this parameter to the refreshable value.

This parameter applies only to static shape graphs.

Arguments:

0 (default): The feature memory address cannot be refreshed.

1: The feature memory address of a model can be refreshed.

Configuration example:

{"ge.featureBaseRefreshable", "0"};

Optional

all

ge.exec.inputReuseMemIndexes

Whether to enable the memory reuse function of the input node of a graph. After the function is enabled, the memory of the input node can be reused as the intermediate memory required during model execution, reducing the memory peak.

The value is the index of the input node. If memory reuse is enabled for multiple input nodes, use commas (,) to separate multiple indexes. The index attribute of the input node is required, specifying the sequence number of the input. The index starts from 0.

Note:

  • If memory reuse of the input node is enabled, ensure that the input memory size is 32-byte aligned.
  • If the configured input index is greater than or equal to the value of the input count, index is invalid and the index configuration does not take effect.
  • If memory reuse of the input node is enabled, the memory data of the input node will be overwritten. After the graph is run, the memory data of the input node cannot be used.

Configuration example:

{"ge.exec.inputReuseMemIndexes", "0,1,2"};

Optional

graph

ge.exec.outputReuseMemIndexes

Whether to enable the memory reuse function for the entire graph output. After the function is enabled, the memory of the entire graph output can be overcommitted as the intermediate memory required during model execution, reducing the memory peak.

If this function is enabled, the value is the index of the entire graph output. If memory reuse is enabled for multiple outputs, use commas (,) to separate multiple indexes.

Note:

  • If memory reuse is enabled for the entire graph output, ensure that the output memory size is 32-byte aligned.
  • Output indexes are identified based on the output sequence of the entire graph. The index starts from 0.
  • If the configured output index is greater than or equal to the value of the output count, index is invalid and the index configuration does not take effect.

Configuration example:

{"ge.exec.outputReuseMemIndexes", "0,1,2"};

Optional

graph

ge.exec.input_fusion_size

Threshold for fusing and copying multiple discrete pieces of user input data during data transfer from the host to the device. The minimum value is 0, the maximum value is 32 MB (33,554,432 bytes), and the default value is 128 KB (131,072 bytes). If:

  • If the size of input data is less than or equal to the threshold, the data is fused and then transferred from the host to the device.
  • If the size of input data is greater than the threshold or the threshold is 0 (the function is disabled), the data is not fused but is directly transferred from the host to the device.

Assume there are 10 user inputs, including two 100 KB inputs, two 50 KB inputs, and the other inputs greater than 100 KB:

  • ge.exec.input_fusion_size = 100 KB: The preceding four inputs are fused into 300 KB data for transfer. The other six inputs are directly transferred from the host to the device.
  • ge.exec.input_fusion_size = 0: This function is disabled and no inputs are fused. That is, the 10 inputs are directly transferred from the host to the device.

This parameter takes effect only when the static graph is run asynchronously. That is, the API mentioned in RunGraphAsync is used to run the graph.

Optional

all

ge.inputBatchCpy

Whether to enable the batch memory copy function when input data is transferred from the host to the device.

The function controlled by this parameter improves the performance of data transfer from the host to the device. It applies to the scenario where data needs to be frequently transferred and the PCIe bandwidth usage is low. After the function is enabled, bandwidth utilization can be improved.

Arguments:

  • 1: The batch memory copy function is enabled. This value takes effect only when the number of user inputs is greater than 1.
  • 0 (default): The batch memory copy function is disabled.

Restrictions:

  • This parameter can be used only by the following products:

    Atlas A3 training products / Atlas A3 inference products

    Atlas A2 training products / Atlas A2 inference products

  • The function is usually used in multi-session scenarios. Considering that the number of inputs in different sessions may be different, you are advised to set this parameter at the session level and determine whether to enable this function based on the input. The setting at the global or graph level is not recommended.
  • This parameter is passed during session initialization. When the graph is run later, this function can be enabled only by calling the API described in RunGraphAsync.
  • If the number of initial network inputs is 1, the batch copy function does not take effect even if it is configured.
  • If both ge.exec.input_fusion_size (fusion and copy) and ge.inputBatchCpy (batch copy) are configured, the threshold for the fusion and copy function may affect the batch copy function.

    For example, if a user has five inputs and four of them meet the threshold for the fusion and copy function, the fusion and copy function is performed on the four inputs, and the batch copy function is not performed on the remaining input.

Configuration example:

{"ge.inputBatchCpy", "0"};

Optional

all

Dynamic Shape

Options Key

Options Value

Required/Optional

Global/Session/Graph

ge.inputShape

Shape of the model input. For online inference or training, this parameter is mandatory only in dynamic profile scenarios, and does not take effect in non-dynamic profile scenarios even if it is configured.

In dynamic profile scenarios, if one or more dimension values of the input data in the original model are not fixed, the model can be converted by setting the shape profile.

Arguments:

  • The model has a non-static shape.

    When setting ge.inputShape, set the corresponding dimension value to -1. This parameter must be used together with ge.dynamicDims and ge.dynamicNodeType.

  • The model has a scalar+dynamic-profile shape.

    If the model input has both scalar shape and dynamic-profile shape, the scalar input must be configured. For example, if a model has three inputs: A:[-1,c1,h1,w1], B:[], and C:[n2,c2,h2,w2], the shape information is "A:-1,c1,h1,w1;B:;C:n2,c2,h2,w2". Scalar input B must be configured.

Configuration example:

  • Non-static shape:

    For details about how to set profiles for a specified dimension, see ge.dynamicDims.

  • Scalar+dynamic-profile shape

    Shape is a scalar input, which is mandatory. For example, if the model has three inputs and the shape information is A:[-1,32,208,208], B:[], and C:[16,64,208,208], the configuration example is as follows (A is the dynamic profile input, and the dynamic dimensions profile is used):

    {"ge.inputShape", "A:-1,32,208,208;B:;C:16,64,208,208"}, 
    {"ge.dynamicDims", "1;2;4"} 

Optional

session/graph

ge.dynamicDims

Dynamic dimension profile in ND format. This parameter applies to the scenario where any dimension is processed each time during inference. This parameter must be used together with ge.inputShape. For details about the example, see Graph Build and Run.

Argument: formatted as "dim1,dim2,dim3;dim4,dim5,dim6;dim7,dim8,dim9"

Format: Enclose the whole argument in double quotation marks (""), separate the profiles by semicolons (;), and separate values within each profile by commas (,). The dimension size values match the -1 placeholders in the ge.inputShape parameter with ordering preserved, and the number of -1 placeholders equals the number of dimension sizes of each profile. More than one dynamic dimension profile must be provided.

Restrictions:

  • For the following products, the profile range is (1,100]. That is, at least two profiles must be set, and a maximum of 100 profiles are supported. Three to four profiles are recommended.

    Atlas A3 training products / Atlas A3 inference products

    Atlas A2 training products / Atlas A2 inference products

    Atlas training products

    Atlas inference products

Examples:

  • If the model has only one input:
    {"ge.inputShape", "data:1,-1,-1"}
    {"ge.dynamicDims", "1,2;3,4;5,6;7,8"}
    // During graph running, the supported shape of the data operator is 1,1,2; 1,3,4; 1,5,6; 1,7,8.
  • If the network model has multiple inputs:

    The dimension size values match the -1 placeholders in the argument with ordering preserved, and the number of -1 placeholders equals the number of dimension sizes of each profile. Assume that a network model has three inputs: data (1,1,40,T), label (1,T), and mask (T,T), where T indicates a dynamic dimension. A configuration example is as follows:

    {"ge.inputShape", "data:1,1,40,-1;label:1,-1;mask:-1,-1" }
    {"ge.dynamicDims", "20,20,1,1;40,40,2,2;80,60,4,4"}
    // During graph running, the following input dims combinations are supported:
    // Profile 0: data(1,1,40,20)+label(1,20)+mask(1,1)
    // Profile 1: data(1,1,40,40)+label(1,40)+mask(2,2)
    // Profile 2: data(1,1,40,80)+label(1,60)+mask(4,4)

Optional

session/graph

ge.dynamicNodeType

Type of a dynamic input node.

  • 0: dataset input
  • 1: placeholder input

Only one type of dynamic inputs is allowed: dataset or placeholder.

Configuration example:

{"ge.dynamicNodeType", "0"};

Optional

session/graph

ac_parallel_enable

Whether to allow AI CPU operators and AI Core operators to run in parallel in a dynamic shape graph.

In a dynamic shape graph, when this function is enabled, the system automatically identifies AI CPU operators that can be run in parallel with the AI Core operators in the graph. Operators of different engines are distributed to different streams to run in parallel, improving resource utilization and dynamic shape execution performance.

Arguments:

  • 1: AI CPU operators and AI Core operators are allowed to run in parallel.
  • 0 (default): AI CPU operators are not separately distributed.

Configuration example:

{"ac_parallel_enable", "1"};

Optional

Global

ge.exec.hostSchedulingMaxThreshold

Maximum threshold to enable dynamic shape scheduling when a static small graph (root graph) is executed. The default value is 0. It is recommended that this parameter be used in foundation model scenarios.

  • If the number of static root graph nodes is less than the maximum threshold, dynamic shape scheduling is used. For foundation models, this mode saves stream resources.
  • If the number of static root graph nodes is greater than the maximum threshold, the original process remains unchanged.

Note: If the static root graph node contains subgraphs, this parameter does not take effect.

Optional

all

ge.exec.static_model_ops_lower_limit

Lower limit of the number of operators in a static subgraph. The value ranges from –1 to positive infinity. If other values are used, an error is reported. The default value is 4.

  • ≥ 0:
    • If the number of operators is less than the lower limit, static subgraphs are not partitioned separately, but are merged into dynamic graphs for dynamic execution.
    • If the number of operators is greater than or equal to the lower limit, the original process remains unchanged (static subgraphs are still partitioned).
  • -1: All operators are executed through dynamic graphs.

For example, if there are four operators in a static subgraph and this parameter is set to 10, static subgraphs are not partitioned separately, but are executed through dynamic graphs.

Optional

graph

Operator and Graph Build

Table 1

Options Key

Options Value

Required/Optional

Global/Session/Graph

ge.op_compiler_cache_mode

Disk cache mode for operator build.

Arguments:

  • enable (default): Cache is enabled for operator build. If it is enabled, operators with the same build configurations and operator configurations will not be built repeatedly, thus accelerating the build speed.
  • force: Cache is enabled for operator build, with the cache forcibly refreshed. That is, the existing cache is cleared up before the operator is rebuilt and added to the cache. Therefore, the implementation when force is used is different from that when enable is used. For example, for Python changes, dependency library changes, or repository changes after operator optimization, you need to set this parameter to force to clear up the existing cache and then change it to enable to prevent the cache from being forcibly refreshed during each build.
  • disable: Cache is disabled for operator build, resulting in operator rebuild.

Restrictions:

  • To specify the disk cache path for operator build, use this parameter together with ge.op_compiler_cache_dir.
  • If it is set to force, the existing cache will be cleared. Therefore, it is not recommended for parallel program compilation, as this may cause the cache used by other models to be cleared, resulting in failures.
  • disable and force are recommended for publishing the final model.
  • If the repository changes after operator tuning, set this parameter to force to refresh the cache and then set it to enable for re-build. Otherwise, the new tuning repository cannot be applied, and the tuning application fails to be executed.
  • When the debugging function is enabled:
    • If ge.opDebugLevel is set to a non-zero value, the ge.op_compiler_cache_mode parameter configuration does not take effect, the operator build cache function is disabled, and all operators are rebuilt.
    • If op_debug_config is not empty and op_debug_list is not configured, the ge.op_compiler_cache_mode parameter configuration does not take effect, the operator build cache function is disabled, and all operators are rebuilt.
    • If op_debug_config is not empty and op_debug_list is configured in the configuration file:
      • For operators in the list, ignore the ge.op_compiler_cache_mode parameter configuration and continue with the rebuild.
      • For operators out of the list, if ge.op_compiler_cache_mode is set to enable or force, the cache function is enabled. If it is set to disable, the cache function is disabled and the operators are rebuilt.
  • When you enable the operator build cache function, set the disk space of the cache folder by using the configuration file (with the op_cache.ini file automatically generated in the path specified by ge.op_compiler_cache_dir after operator build) or using environment variables.
    1. Using the op_cache.ini configuration file:

      If the op_cache.ini file does not exist, manually create it. Open the file and add the following information:

      # Configure the file format (required). The automatically generated file contains the following information by default. When manually creating a file, enter the following information:
      [op_compiler_cache]
      # Limit the disk space of the cache folder on a chip, in MB. The default value is 500. The value must be an integer.
      max_op_cache_size=500
      # Set the ratio of the cache size to be reserved, in percentage. The value range is [1, 100]. The default value is 50. For example, 80 indicates that when the cache space is insufficient, 80% of the cache space is reserved and the rest is cleared up.
      remain_cache_size_ratio=50    
      • The op_cache.ini file takes effect only when the values of max_op_cache_size and remain_cache_size_ratio in the preceding file are valid.
      • If the size of the build cache file exceeds the value of max_op_cache_size and the cache file is not accessed for more than half an hour, the cache file will be aged. (Operator build will not be interrupted due to the size of the build cache file exceeding the set limit. Therefore, if max_op_cache_size is set to a small value, the size of the actual build cache file may exceed the configured value.)
      • To disable the build cache aging function, set max_op_cache_size to -1. In this case, the access time is not updated when the operator cache is accessed, the operator build cache is not aged, and the default disk space of 500 MB is used.
      • If multiple users use the same cache path, you are advised to use the configuration file to set the cache path. In this scenario, the op_cache.ini file affects all users.
    2. Using environment variables

      In this scenario, the environment variable ASCEND_MAX_OP_CACHE_SIZE is used to limit the storage space of the cache folder of a chip. When the build cache space reaches the specified value and the cache file is not accessed for more than half an hour, the cache file is aged. The environment variable ASCEND_REMAIN_CACHE_SIZE_RATIO is used to set the ratio of the cache space to be reserved.

      A configuration example is as follows:

      # The ASCEND_MAX_OP_CACHE_SIZE environment variable defaults to 500, in MB. The value must be an integer.
      export ASCEND_MAX_OP_CACHE_SIZE=500
      # The value range of the ASCEND_REMAIN_CACHE_SIZE_RATIO environment variable is [1, 100]. The default value is 50, in percentage. For example, 80 indicates that when the cache space is insufficient, 80% of the cache space is reserved and the rest is cleared up.
      export ASCEND_REMAIN_CACHE_SIZE_RATIO=50
      • The argument configured through environment variables takes effect only for the current user.
      • To disable the build cache aging function, set the environment variable ASCEND_MAX_OP_CACHE_SIZE to -1. In this case, the access time is not updated when the operator cache is accessed, the operator build cache is not aged, and the default disk space of 500 MB is used.

    If both the op_cache.ini file and the environment variable are configured, the configuration items in the op_cache.ini file are read first. If both the op_cache.ini file and the environment variable are not configured, the system's default values (500 MB disk space and 50% reserved cache space) are read.

Configuration example:

{"ge.op_compiler_cache_mode", "enable"};

Optional

all

ge.op_compiler_cache_dir

Disk cache directory for operator build.

Format: The directory can contain letters (a–z, A–Z), digits (0–9), underscores (_), hyphens (-), and periods (.).

Default value: $HOME/atc_data

  • If the specified directory exists and is valid, a kernel_cache subdirectory is automatically created in this directory. If the specified directory does not exist but is valid, the system automatically creates this directory and the kernel_cache subdirectory.
  • Do not store other self-owned content in the default cache directory. The self-owned content will be deleted together with the default cache directory during software package installation or upgrade.
  • The non-default cache directory specified by this parameter cannot be deleted. The directory will not be deleted during software package installation or upgrade.
  • In addition to ge.op_compiler_cache_dir, the ASCEND_CACHE_PATH environment variable can be used to set the disk cache directory for operator build. The priorities of the configuration methods are as follows: ge.op_compiler_cache_dir > ASCEND_CACHE_PATH > default storage directory.

Optional

all

ge.graph_compiler_cache_dir

Disk cache directory for graph build. This parameter is used together with ge.graph_key. This function takes effect only when both ge.graph_compiler_cache_dir and ge.graph_key are not empty.

The configured cache directory must exist. Otherwise, the build will fail.

After a graph is changed, the original cache file is unavailable. You need to manually delete the cache file from the cache directory or modify ge.graph_key to rebuild and generate a cache file.

For details about other restrictions and usage methods, see Graph Build Cache.

Optional

session

ge.graph_key

Unique graph ID. The value contains a maximum of 128 characters, including only letters (A–Z, a–z), digits (0–9), underscores (_), and hyphens (-).

Optional

graph

ge.optimizationSwitch

Fusion pattern (pass) control switch used during operator build.

The difference between this parameter and ge.fusionSwitchFile is as follows: This parameter applies to all patterns. It can be used to specify a fusion pattern without a JSON file. ge.fusionSwitchFile can only be used to disable the graph fusion and UB fusion patterns, and a JSON file needs to be configured separately. If both parameters are set and the same fusion pattern is configured, the setting of ge.optimizationSwitch takes precedence.

Argument: Passname1:on;Passname2:off. Multiple key-value pairs can be concatenated. key is the pass name, and value can be set to on (enabled) or off (disabled). Case-sensitive matching is not supported. Multiple groups of configurations are separated by semicolons (;). For details about the fusion patterns that can be configured, see Fusion Pattern List.

Configuration example:

{"ge.optimizationSwitch", "Passname1:on;Passname2:off"};

Optional

all

ge.topoSortingMode

Traversal mode for operator build in graph mode. It is mainly used in online inference scenarios.

Arguments:

  • 0: Breadth-first search (BFS)
  • 1 (default): Depth First Search (DFS)
  • 2: Reverse DFS (RDFS)
  • 3: Stable RDFS. For existing operators in the graph, the computation sequence is not changed. For new operators in the graph, RDFS is used.

Configuration example:

{"ge.topoSortingMode", "1"};

Optional

all

ge.aicoreNum

Number of AI Cores used for operator build.

Argument: "integer 1|integer 2", separated by a vertical bar (|).

  • Scenario 1: For the following products, integer 1 indicates the number of Cube Cores in the AI Core used for operator build, and integer 2 indicates the number of Vector Cores in the AI Core used for operator build. Both integer 1 and integer 2 must be greater than 0 and less than or equal to the maximum numbers of Cube Cores and Vector Cores included in the Ascend AI Processor.

    Atlas A3 training products / Atlas A3 inference products

    Atlas A2 training products / Atlas A2 inference products

  • Scenario 2: For the following products, only integer 1 needs to be configured in the format of "integer 1|", indicating the number of AI Cores used for operator build. If integer 2 is configured, it does not take effect.

    Atlas inference products

    Atlas training products

Restrictions:

  • For scenario 1 of the argument:
    You can view the maximum numbers of Cube Cores and Vector Cores of different Ascend AI Processors in the ${INSTALL_DIR}/<arch>-linux/data/platform_config/xxx.ini file. The following information indicates that there are 24 Cube Cores and 48 Vector Cores on the Ascend AI Processor:
    [SoCInfo]
    # Use the default parameter values, which are the maximum values.
    ai_core_cnt=24
    cube_core_cnt=24
    vector_core_cnt=48
  • For scenario 2 of the argument:
    You can view the maximum number of AI Cores contained in different Ascend AI Processors in the ${INSTALL_DIR}/<arch>-linux/data/platform_config/xxx.ini file. The following information indicates that there are 10 AI Cores on the Ascend AI Processor:
    [SoCInfo]
    # Use the default parameter value, which indicates the maximum number of AI Cores.
    ai_core_cnt=10
    vector_core_cnt=8
  • If the operator build cache function is enabled (ge.op_compiler_cache_mode set to enable or force; default value: enable) and this parameter is configured, this parameter takes effect only during the first build. To make this parameter take effect during non-initial build, you need to clear the cache of the build disk.

Replace ${INSTALL_DIR} with the CANN component directory. For example, if the installation is performed by the root user, the default file storage path is /usr/local/Ascend/cann. <arch> indicates the OS architecture and xxx varies depending on the product.

Configuration example:

  • Configuration example for scenario 1:
    {"ge.aicoreNum", "24|48"};
  • Configuration example for scenario 2:
    {"ge.aicoreNum", "10|"};
    Or
    {"ge.aicoreNum", "10"};

Relationships between AI Cores, Cube Cores, and Vector Cores:

The definition of a Core helps you better understand the relationships between AI Cores, Cube Cores, and Vector Cores. A Core is a compute core with an independent scalar compute unit. Generally, the scalar compute unit provides multiple functions for the compute core, such as the single instruction multiple data (SIMD) instruction dispatch. Therefore, the scalar compute unit is also called the intra-core scheduling unit. The AI data processing core unit varies with products. Currently, there are the following types:

  • The AI data processing core unit is an AI Core:
    • In an AI Core, a Cube and a Vector share a Scalar scheduling unit, for example, Atlas training products .

    • In an AI Core, a Cube and a Vector have their own Scalar scheduling units, which are also called a Cube Core and a Vector Core. In this case, a Cube Core and a group of Vector Cores are defined as an AI Core. The number of AI Cores is usually calculated based on the number of Cube Cores, for example, Atlas A2 training products / Atlas A2 inference products .

  • The AI data processing core units are AI Cores and independent Vector Cores. The AI Cores and Vector Cores have independent Scalar scheduling units, for example, Atlas inference products .

Optional

Global/Session

ge.AllowMultiGraphParallelCompile

Whether to allow multiple threads to build multiple graphs in the same session. If this parameter is set to 1, the variable format cannot be converted. For details, see the description of ge.exec.variable_acc.

Arguments:

  • 0 (default): One thread can be used to build multiple graphs in the same session.
  • 1: Multiple threads can be used to build multiple graphs concurrently in the same session.

Restrictions:

  • If this parameter is set to 1, ge.exec.variable_acc cannot be set to True. Otherwise, an error is reported during verification.
  • If this parameter is set to 1, an error is reported immediately when resource operators cause rebuild of other graphs.

Configuration example:

{"ge.AllowMultiGraphParallelCompile", "1"};

Optional

Global/Session

Debugging

Options Key

Options Value

Required/Optional

Global/Session/Graph

ge.exec.enable_exception_dump

Whether to dump data of the exception operator.
  • 0 (default): The dump function of the exception operator is disabled.
  • 1: The common exception dump (L1 exception dump) function is enabled, to dump the input and output data, tensor description (such as shape, dtype, and format), and workspace information of the exception operator.

    The dump data is stored in the following directories in descending order of priority: NPU_COLLECT_PATH > ASCEND_WORK_PATH > default directory (extra-info in the script execution directory)

  • 2: The Lite exception dump (L0 exception dump) is enabled to dump the input and output data, workspace information, and tiling information of the exception operator.

    The dump data is stored in the following directories in descending order of priority: ASCEND_WORK_PATH > default directory (/extra-info/data-dump/<device_id> in the script execution directory)

NOTE:
  • If the NPU_COLLECT_PATH environment variable is configured, only common exception dump information, including the input and output data of the exception operator, is collected, regardless of the value of ge.exec.enable_exception_dump. The dump data is stored in the directory specified by NPU_COLLECT_PATH.
  • L1 exception dump is common exception dump information, while L0 exception dump is lite exception dump information. Both of them export information such as the operator input and output and the workspace data. Compared with L0 exception dump, L1 exception dump provides more information. When L1 exception dump is enabled, the dtype information of each tensor is printed in the host application log file (plog), and the operator name and kernel related to the operator are also printed.

For details about the environment variable, see Environment Variables.

Configuration example:
std::map<ge::AscendString, ge::AscendString> ge_options = {"ge.exec.enable_exception_dump", "0"}, 

Optional

Global

ge.opDebugLevel

Whether to enable operator debugging. The values are as follows:

  • 0 (default): Disables operator debug. The operator build folder kernel_meta is not generated in the current execution path.
  • 1: Enables operator debug. The kernel_meta folder is generated in the current execution path, and the .o file (operator binary file), .json file (operator description file), and TBE instruction mapping files (operator file *.cce and python-CCE mapping file *_loc.json) are generated in the folder for later analysis of AI Core errors.
  • 2: Enables operator debug. The kernel_meta folder is generated in the current execution path, and the .o file (operator binary file), .json file (operator description file), and TBE instruction mapping files (operator file *.cce and python-CCE mapping file *_loc.json) are generated in the folder for later analysis of AI Core errors. Setting this option to 2 also disables build optimization and enables the CCE compiler debug function (the CCE compiler option is set to -O0-g).
  • 3: Disables operator debug. The kernel_meta folder is generated in the current execution path, and the .o file (operator binary file) and .json file (operator description file) are generated in the folder. You can refer to these files when analyzing operator errors.
  • 4: Disables operator debug. The kernel_meta folder is generated in the current execution path, and the .o file (operator binary file), .json file (operator description file), TBE instruction mapping file (operator file *.cce), and UB fusion description file ({$kernel_name}_compute.json) are generated in the folder. These files can be used for problem reproduction and accuracy comparison during operator error analysis.
NOTICE:
  • If ge.opDebugLevel is set to 0 and op_debug_config is also set, the operator build directory kernel_meta is still generated in the current execution directory.
  • You are advised to set this parameter to 0 or 3 for training. To locate errors, set this parameter to 1 or 2, which might compromise the network performance.
  • If this option is set to 2, the CCE compiler is enabled, and the size of the operator kernel file (*.o file) increases. In the dynamic shape scenario, all possible shape scenarios are traversed during operator build, which may cause operator build failures due to large operator kernel files. In this case, you are advised not to enable the CCE compiler options.

    If a build failure is caused by the large operator kernel file, the following log is displayed:

    message:link error ld.lld: error: InputSection too large for range extension thunk ./kernel_meta_xxxxx.o
  • When the debug function is enabled, if the model contains the following merged compute and communication (MC2) operators, the *.o, *.json, and *.cce files of the operators are not generated in the operator build folder kernel_meta.

    MatMulAllReduce

    MatMulAllReduceAddRmsNorm

    AllGatherMatMul

    MatMulReduceScatter

    AlltoAllAllGatherBatchMatMul

    BatchMatMulReduceScatterAlltoAll

Optional

all

op_debug_config

Global memory check switch.

The value is the path of the .cfg configuration file. Multiple options in the configuration file are separated by commas (,).

  • oom: Checks whether memory overwriting occurs in the global memory during operator execution.
    • Configuring this option retains the binary operator file (.o) and operator description file (.json) in the kernel_meta folder under the current execution directory during operator build.
    • If this option is used, the following detection logic is added during operator build. You can use the dump_cce option to view the following code in the generated .cce file:
      inline __aicore__ void  CheckInvalidAccessOfDDR(xxx) {
          if (access_offset < 0 || access_offset + access_extent > ddr_size) {
              if (read_or_write == 1) {
                  trap(0X5A5A0001);
              } else {
                  trap(0X5A5A0002);
              }
          }
      }
  • dump_cce: Retains the operator CCE file (.cce), binary operator file (.o), and operator description file (.json) in the kernel_meta folder under the current execution directory during operator build.
  • dump_loc: Retains the python-CCE mapping file *_loc.json, binary operator file (.o), and operator description file (.json) in the kernel_meta folder under the current execution directory during operator build.
  • ccec_O0: Enables the CCEC option -O0 during operator build. This option does not optimize the debugging information for later analysis of AI Core errors.
  • ccec_g: Enables the CCEC option -g during operator build. This option optimizes the debugging information for later analysis of AI Core errors.
  • check_flag: Checks whether pipeline synchronization signals in operators match each other during operator execution.
    • Configuring this option retains the binary operator file (.o) and operator description file (.json) in the kernel_meta folder under the current execution directory during operator build.
    • If this option is used, the following detection logic is added during operator build. You can use the dump_cce option to view the following code in the generated .cce file:
        set_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID0);
        set_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID1);
        set_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID2);
        set_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID3);
        ....
        pipe_barrier(PIPE_MTE3);
        pipe_barrier(PIPE_MTE2);
        pipe_barrier(PIPE_M);
        pipe_barrier(PIPE_V);
        pipe_barrier(PIPE_MTE1);
        pipe_barrier(PIPE_ALL);
        wait_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID0);
        wait_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID1);
        wait_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID2);
        wait_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID3);
        ...

      During actual inference, if the pipeline synchronization signals in operators do not match each other, a timeout error is reported at the faulty operator, and the program is terminated. The following is an example of the error message:

      Aicore kernel execute failed, ..., fault kernel_name=operator name,...
      rtStreamSynchronizeWithTimeout execute failed....

Configuration example:

{"op_debug_config", "/root/test0.cfg"};

The information about the test0.cfg file is as follows:

op_debug_config = ccec_g,oom

Restrictions:

During operator build, if you want to build only some instead of all AI Core operators, you need to add the op_debug_list field to the test0.cfg configuration file. By doing so, only the operators specified in the list are built, based on the options configured in op_debug_config. The op_debug_list field has the following requirements:

  • The operator name or operator type can be specified.
  • Operators are separated by commas (,). The operator type is configured in the OpType::typeName format. The operator type and operator name can be configured in a mixed manner.
  • The operator to be built must be stored in the configuration file specified by op_debug_config.

Configuration example: Add the following information to the configuration file (for example, test0.cfg) specified by op_debug_config:

op_debug_config= ccec_g,oom
op_debug_list=GatherV2,opType::ReduceSum

During model compilation, the GatherV2,ReduceSum operator is compiled based on the ccec_g and oom options.

NOTE:
  • When ccec compilation options (ccec_O0 and ccec_g) are enabled, the size of the operator kernel file (*.o file) increases. In dynamic shape scenarios, all possible scenarios are traversed during operator compilation, which may cause operator compilation failures due to large operator kernel files. In this case, do not enable the CCEC options.

    If the compilation failure is caused by large operator kernel files, the following log is displayed:

    message:link error ld.lld: error: InputSection too large for range extension thunk ./kernel_meta_xxxxx.o:(xxxx)

  • The ccec_O0 and oom options of the CCEC cannot be both enabled. Otherwise, an AI Core error may be reported. The following is an example of the error message:
    ...there is an aivec error exception, core id is 49, error code = 0x4 ...
  • If the NPU_COLLECT_PATH environment variable is configured, the function of checking whether global memory overwriting occurs cannot be enabled (the configuration file specified by op_debug_config is set to oom). Otherwise, an error is reported when the compiled model file or operator kernel package is used.
  • When the build options oom, dump_cce, and dump_loc are configured, if the model contains the following MC2 operators, the *.o, *.json, and *.cce files of the operators are not generated in the operator build folder kernel_meta.

    MatMulAllReduce

    MatMulAllReduceAddRmsNorm

    AllGatherMatMul

    MatMulReduceScatter

    AlltoAllAllGatherBatchMatMul

    BatchMatMulReduceScatterAlltoAll

Optional

Global

ge.debugDir

Directory of the debug-related process files generated during operator build, including the .o (operator binary file), .json (operator description file), and .cce files.

Files are generated in the current training script execution directory by default.

Restrictions:

  • If you want to specify the path for storing the process file of operator build, use ge.debugDir and ge.opDebugLevel together. If ge.opDebugLevel is set to 0, ge.debugDir cannot be used.
  • In addition to ge.debugDir, the ASCEND_WORK_PATH environment variable can be used to set the path for storing the debugging file generated during operator build. The priorities of the configuration methods are as follows: ge.debugDir > ASCEND_WORK_PATH > default storage path.

Optional

all

ge.exportCompileStat

Whether to generate the fusion_result.json result file of operator fusion information (including graph fusion and UB fusion) during graph build.

This file records the fusion patterns used during graph build. The ge.fusionSwitchFile parameter for precision comparison can be used to disable specified fusion patterns. Disabled fusion patterns are not displayed in the fusion_result.json file. In the file:

  • session_and_graph_id_xx_xx: thread and graph ID of the fusion result.
  • graph_fusion: graph fusion.
  • ub_fusion: UB fusion.
  • match_times: number of times that the fusion pattern is matched during graph build.
  • effect_times: actual number of times that the fusion takes effect.
  • repository_hit_times: number of times that the UB fusion repository is hit.

Arguments:

  • 0: The result file of operator fusion information is not generated.
  • 1 (default): The result file of operator fusion information is generated when the program exits normally.
  • 2: The result file of operator fusion information is generated when graph build is complete. If graph build is complete, the result file of operator fusion information is generated even if the program is interrupted in advance.
NOTE:

If the ASCEND_WORK_PATH environment variable is not set, the result file is generated in the current path where the script is executed by default. If the ASCEND_WORK_PATH environment variable is set, the result file is saved in $ASCEND_WORK_PATH/FE/${Process ID}/fusion_result.json.

Configuration example:

{"ge.exportCompileStat", "1"};

Optional

Global

Precision Tuning

Options Key

Options Value

Required/Optional

Global/Session/Graph

ge.exec.precision_mode

Operator precision mode, which must be of the string type. This parameter cannot be used together with ge.exec.precision_mode_v2. You are advised to use ge.exec.precision_mode_v2.

  • force_fp32/cube_fp16in_fp32out:
    force_fp32 and cube_fp16in_fp32out have the same effect. This option indicates that the system selects different processing modes based on the operator type when the operator in the AI Core supports both the float32 and float16 data types. cube_fp16in_fp32out is newly added to the new version. For cube operators, this option has clearer semantics.
    • For cube operators, the system processes the computation based on the operator implementation.
      1. The preferred input data type is float16 and the output data type is float32.
      2. If the float16 input data and float32 output data types are not supported, set both the input and output data types to float32.
      3. If the float32 input and output data types are not supported, set both the input and output data types to float16.
      4. If the float16 input and output data types are not supported, an error is reported.
    • For vector compute operators, the operator precision in the original graph is float16 or bfloat16, and float32 is forcibly selected.

      This option is invalid if the original graph contains operators not supporting float32 in the AI Core, for example, an operator that supports only float16. In this case, float16 is retained. If the operator in the AI Core does not support float32 and it is configured to the blocklist of precision reduction (by setting precision_reduce to false), the counterpart AI CPU operator supporting float32 is used. If the AI CPU operator does not support float32, an error is reported.

  • force_fp16:

    Indicates that float16 is forcibly selected if the operator precision in the original graph is float16, bfloat16, and float32.

  • allow_fp32_to_fp16:
    • For matrix operators:
      • If the operator precision in the original graph is float32, the precision is preferably reduced to float16. If the operator in the AI Core does not support float16, float32 is used. If the operator in the AI Core does not support float32, the AI CPU operator is used for computation. If the AI CPU operator also does not support float32, an error is reported during execution.
      • If the operator precision in the original graph is bfloat16, the precision of the original graph is preferably used. If the operator in the AI Core does not support bfloat16, float32 is used. If the operator in the AI Core does not support float32, the precision is directly reduced to float16. If the operator in the AI Core does not support float16, the AI CPU operator is used for computation. If the AI CPU operator also does not support float16, an error is reported during execution.
    • For vector operators, the precision of the original graph is retained preferably.
      • If the operator precision in the original graph is float32, the precision of the original graph is preferably used. If the operator in the AI Core does not support float32, the precision is directly reduced to float16. If the operator in the AI Core does not support float16, the AI CPU operator is used for computation. If the AI CPU operator also does not support float16, an error is reported during execution.
      • If the operator precision in the original graph is bfloat16, the precision of the original graph is preferably used. If the operator in the AI Core does not support bfloat16, float32 is used. If the operator in the AI Core does not support float32, the precision is directly reduced to float16. If the operator in the AI Core does not support float16, the AI CPU operator is used for computation. If the AI CPU operator also does not support float16, an error is reported during execution.
  • must_keep_origin_dtype:

    Retain the original precision.

    • If the precision of an operator in the original graph is float16, and the implementation of the operator in the AI Core does not support float16 but supports only float32 and bfloat16, the system automatically uses high-precision float32.
    • If the precision of an operator in the original graph is float16, and the implementation of the operator in the AI Core does not support float16 but supports only bfloat16, the AI CPU operator of float16 is used. If the AI CPU operator is not supported, an error is reported.
    • If the precision of an operator in the original graph is float32, and the implementation of the operator in the AI Core does not support float32 but supports only float16, the AI CPU operator of float32 is used. If the AI CPU operator is not supported, an error is reported.
  • allow_mix_precision/allow_mix_precision_fp16:

    allow_mix_precision has the same effect as that of allow_mix_precision_fp16, indicating that mixed precision of float16, bfloat16, and float32 is used for neural network processing. allow_mix_precision_fp16 is newly added to the new version, which has clearer semantics for easy understanding.

    For float32 and befloat16 operators in the original model, float16 is automatically used for certain float32 and bfloat16 operators based on the built-in tuning policy. This will improve system performance and reduce memory usage with minimal precision degradation.

    If this mode is configured, you can view the value of precision_reduce in the built-in tuning policy file of ${INSTALL_DIR}/opp/built-in/op_impl/ai_core/tbe/config/xxx/aic-xxx-ops-info-*.json.

    • If it is set to true, the operator is on the mixed precision trustlist and its precision will be reduced from float32 and bfloat16 to float16.
    • If it is set to false, the operator is on the mixed precision blocklist and its precision will not be reduced from float32 and bfloat16 to float16. In this case, the operator still uses the precision of float32 or bfloat16.
    • If an operator in the network model does not have the precision_reduce option configured, the operator is on the graylist and will follow the same precision processing as the upstream operator.
  • allow_mix_precision_bf16:

    Mixed precision of bfloat16 and float32 is used for neural network processing. In this mode, bfloat16 is automatically used for certain float32 operators on the original model based on the built-in tuning policy. This will improve system performance and reduce memory usage with minimal precision degradation. If the operator in the AI Core does not support bfloat16 and float32, the AI CPU operator is used for computation. If AI CPU operator also does not support bfloat16 and float32, an error is reported during execution.

    If this mode is configured, you can view the value of precision_reduce in the built-in tuning policy file of ${INSTALL_DIR}/opp/built-in/op_impl/ai_core/tbe/config/xxx/aic-xxx-ops-info-*.json.

    • If it is set to true, the operator is on the mixed precision trustlist and its precision will be reduced from float32 to bfloat16.
    • If the field value is false, the operator is on the mixed precision blocklist and its precision will not be reduced from float32 to bfloat16.
    • If an operator in the network model does not have the precision_reduce option configured, the operator is on the graylist and will follow the same precision processing as the upstream operator.
  • allow_fp32_to_bf16:
    • If the operator precision in the original graph is float32, the precision of the original graph is preferably used. If the operator in the AI Core does not support float32, the precision is reduced to bfloat16. If the operator in the AI Core does not support bfloat16, the AI CPU operator is used for computation. If the AI CPU operator also does not support bfloat16, an error is reported during execution.
    • If the operator precision in the original graph is bfloat16, the precision of the original graph is preferably used. If the operator in the AI Core does not support bfloat16, float32 is used. If the operator in the AI Core does not support float32, the AI CPU operator is used for computation. If the AI CPU operator also does not support float32, an error is reported during execution.

Replace ${INSTALL_DIR} with the CANN component directory. For example, if the installation is performed by the root user, the default file storage path is /usr/local/Ascend/cann. xxx varies depending on the product.

Default values:

In the Atlas training products training scenario, the default value of this parameter is allow_fp32_to_fp16.

In the Atlas A2 training products / Atlas A2 inference products training scenario, the default value of this parameter is must_keep_origin_dtype.

In the Atlas A3 training products / Atlas A3 inference products training scenario, the default value of this parameter is must_keep_origin_dtype.

In the online inference scenario, the default value of this parameter is force_fp16.

Restrictions:

The bfloat16 data type supports only the following products:

Atlas A2 training products / Atlas A2 inference products

Atlas A3 training products / Atlas A3 inference products

Atlas 200I/500 A2 inference products

Optional

all

ge.exec.precision_mode_v2

Operator precision mode, which must be of the string type. This parameter cannot be used together with ge.exec.precision_mode. You are advised to use ge.exec.precision_mode_v2.

  • fp16:

    Indicates that float16 is forcibly selected if the operator precision in the original graph is float16, bfloat16, or float32.

  • origin:

    Retain the original precision.

    • If the precision of an operator in the original graph is float16, and the implementation of the operator in the AI Core does not support float16 but supports only float32 and bfloat16, the system automatically uses high-precision float32.
    • If the precision of an operator in the original graph is float16, and the implementation of the operator in the AI Core does not support float16 but supports only bfloat16, the AI CPU operator of float16 is used. If the AI CPU operator is not supported, an error is reported.
    • If the precision of an operator in the original graph is float32, and the implementation of the operator in the AI Core does not support float32 but supports only float16, the AI CPU operator of float32 is used. If the AI CPU operator is not supported, an error is reported.
  • cube_fp16in_fp32out:
    The system selects a processing mode based on the operator type for AI Core operators supporting both float32 and float16.
    • For cube operators, the system processes the computation based on the operator implementation.
      1. The preferred input data type is float16 and the output data type is float32.
      2. If the float16 input data and float32 output data types are not supported, set both the input and output data types to float32.
      3. If the float32 input and output data types are not supported, set both the input and output data types to float16.
      4. If the float16 input and output data types are not supported, an error is reported.
    • For vector compute operators, the operator precision in the original graph is float16 or bfloat16, and float32 is forcibly selected.

      This option is invalid if the original graph contains operators not supporting float32 in the AI Core, for example, an operator that supports only float16. In this case, float16 is retained. If the operator in the AI Core does not support float32 and it is configured to the blocklist of precision reduction (by setting precision_reduce to false), the counterpart AI CPU operator supporting float32 is used. If the AI CPU operator does not support float32, an error is reported.

  • mixed_float16:

    Mixed precision of float16, bfloat16, and float32 is used for neural network processing. For float32 and befloat16 operators in the original graph, float16 is automatically used for certain float32 and bfloat16 operators based on the built-in tuning policy. This will improve system performance and reduce memory usage with minimal precision degradation.

    If this mode is configured, you can view the value of precision_reduce in the built-in tuning policy file of ${INSTALL_DIR}/opp/built-in/op_impl/ai_core/tbe/config/xxx/aic-xxx-ops-info-*.json.

    • If it is set to true, the operator is on the mixed precision trustlist and its precision will be reduced from float32 and bfloat16 to float16.
    • If it is set to false, the operator is on the mixed precision blocklist and its precision will not be reduced from float32 and bfloat16 to float16. In this case, the operator still uses the precision of float32 or bfloat16.
    • If an operator in the network model does not have the precision_reduce option configured, the operator is on the graylist and will follow the same precision processing as the upstream operator.
  • mixed_bfloat16:

    Mixed precision of bfloat16 and float32 is used for neural network processing. In this mode, bfloat16 is automatically used for certain float32 operators in the original graph based on the built-in tuning policy. This will improve system performance and reduce memory usage with minimal precision degradation. If the operators do not support bfloat16 and float32, the AI CPU operators are used for computation. If AI CPU operators also do not support float16 and float32, an error is reported during execution.

    If this mode is configured, you can view the value of precision_reduce in the built-in tuning policy file of ${INSTALL_DIR}/opp/built-in/op_impl/ai_core/tbe/config/xxx/aic-xxx-ops-info-*.json.

    • If it is set to true, the operator is on the mixed precision trustlist and its precision will be reduced from float32 to bfloat16.
    • If the field value is false, the operator is on the mixed precision blocklist and its precision will not be reduced from float32 to bfloat16.
    • If an operator in the network model does not have the precision_reduce option configured, the operator is on the graylist and will follow the same precision processing as the upstream operator.
  • mixed_hif8:

    Enables automatic mixed precision, indicating that hifloat8 (for details about this data type, see Link), float16, bfloat16, and float32 are used together for neural network processing. In this mode, hifloat8 is automatically used for certain float16, bfloat16, and float32 operators in the original graph based on the built-in tuning policy. This will improve system performance and reduce memory usage with minimal precision degradation. The current version does not support this argument.

    If this mode is configured, you can view the value of precision_reduce in the built-in tuning policy file of ${INSTALL_DIR}/opp/built-in/op_impl/ai_core/tbe/config/xxx/aic-xxx-ops-info-*.json.

    • If it is set to true, the operator is on the mixed precision trustlist and its precision will be reduced from float16, bfloat16, and float32 to hifloat8.
    • If it is set to false, the operator is on the mixed precision blocklist and its precision will not be reduced from float16, bfloat16, and float32 to hifloat8. In this case, the operator still uses the precision of float16, bfloat16, or float32.
    • If an operator in the original graph does not have the precision_reduce option configured, the operator is on the graylist and will follow the same precision processing as the upstream operator.
  • cube_hif8:

    The hifloat8 data type is forcibly used if the Cube operator in the original graph supports both hifloat8 and float16, bfloat16, or float32. The current version does not support this argument.

Replace ${INSTALL_DIR} with the CANN component directory. For example, if the installation is performed by the root user, the default file storage path is /usr/local/Ascend/cann. xxx varies depending on the product.

Default values:

In the Atlas training products training scenario, this parameter has no default value. In this case, allow_fp32_to_fp16, the default value of precision_mode, is used.

In the Atlas A2 training products / Atlas A2 inference products training scenario, the default value of this parameter is origin.

In the Atlas A3 training products / Atlas A3 inference products training scenario, the default value of this parameter is origin.

In the online inference scenario, the default value of this parameter is fp16.

Restrictions:

The bfloat16 data type supports only the following products:

Atlas A2 training products / Atlas A2 inference products

Atlas A3 training products / Atlas A3 inference products

Atlas 200I/500 A2 inference products

Optional

all

ge.exec.modify_mixlist

When mixed precision is enabled, you can use this parameter to specify the path and file name of the blocklist, trustlist, and graylist, and specify the operators that allow precision degradation and those that do not allow precision degradation. Set this parameter to the path including the file name. The file is in JSON format.

You can view the flag value under precision_reduce in the built-in tuning policy file of ${INSTALL_DIR}/opp/built-in/op_impl/ai_core/tbe/config/xxx/aic-xxx-ops-info-*.json. Replace ${INSTALL_DIR} with the CANN component directory. For example, if the installation is performed by the root user, the default file storage path is /usr/local/Ascend/cann. xxx varies depending on the product.

  • true (trustlist): Precision reduction is allowed in mixed precision mode.
  • false (blocklist): Precision reduction is not allowed in mixed precision mode.
  • Not specified (graylist): Operators on the graylist follow the same precision processing as its upstream operator.
Configuration example:
{"ge.exec.modify_mixlist", "/home/test/ops_info.json"};

You can specify the operator types in ops_info.json as follows. Separate operators with commas (,).

{
  "black-list": {                  // Blocklist
     "to-remove": [                // Move an operator from the blocklist to the graylist.
     "Xlog1py"
     ],
     "to-add": [                   // Move an operator from the trustlist or graylist to the blocklist.
     "Matmul",
     "Cast"
     ]
  },
  "white-list": {                  // Trustlist
     "to-remove": [                // Move an operator from the trustlist to the graylist.
     "Conv2D"
     ],
     "to-add": [                   // Move an operator from the blocklist or graylist to the trustlist.
     "Bias"
     ]
  }
}

The operators in the preceding example configuration file are for reference only. The configuration should be based on the actual hardware environment and the built-in tuning strategies of the operators. The following is an example of querying the blocklist, trustlist, and graylist:

"Conv2D":{
    "precision_reduce":{
        "flag":"true"
     }
},

true: trustlist; false: blocklist; Not configured: graylist.

Optional

all

ge.customizeDtypes

Customized operator precision during model build. Other operators in the model are built according to ge.exec.precision_mode or ge.exec.precision_mode_v2. This parameter is set to the path (including name of the configuration file), for example, /home/test/customize_dtypes.cfg.

Restrictions:

  • List the names or types of operators whose computing precision needs customization in the configuration file. Each operator occupies a line, and the operator type must be defined based on IR.
  • If both of the operator name and type are configured for an operator, the operator name applies during build.
  • The computing precision of an operator specified by this parameter does not take effect if the operator is fused during build.

The structure of the configuration file is as follows:

# Configuration by operator name
Opname1::InputDtype:dtype1,dtype2,...OutputDtype:dtype1,...
Opname2::InputDtype:dtype1,dtype2,...OutputDtype:dtype1,...
# Configuration by operator type
OpType::TypeName1:InputDtype:dtype1,dtype2,…OutputDtype:dtype1,…
OpType::TypeName2:InputDtype:dtype1,dtype2,…OutputDtype:dtype1,…

The following is an example of the configuration file:

# Configuration by operator name
resnet_v1_50/block1/unit_3/bottleneck_v1/Relu::InputDtype:float16,int8,OutputDtype:float16,int8
# Configuration by operator type
OpType::Relu:InputDtype:float16,int8,OutputDtype:float16,int8
NOTE:
  • You can view the computing precision supported by an operator in the operator information library, which is stored in ${INSTALL_DIR}/opp/built-in/op_impl/ai_core/tbe/config/xxx/aic-xxx-ops-info-*.json by default.

    Replace ${INSTALL_DIR} with the CANN component directory. For example, if the installation is performed by the root user, the default file storage path is /usr/local/Ascend/cann. xxx varies depending on the product.

  • The data type specified by this option takes high priority, which may cause precision or performance degradation. If the specified data type is not supported, the build will fail.

Optional

session

Precision Comparison

Options Key

Options Value

Required/Optional

Global/Session/Graph

ge.exec.enableDump

Whether to enable the dump function.

  • 1: The dump function is enabled. The dump file path is read from dump_path. If dump_path is set to None, an exception occurs.
  • 0 (default): The dump function is disabled.

Configuration example:

{"ge.exec.enableDump", "0"};
NOTE:
  • This parameter cannot be used together with ge.exec.enableDumpDebug in the global scenario or in the same session.
  • If either ge.exec.enableDump or ge.exec.enableDumpDebug is set to 1 and ge.exec.enable_exception_dump is set to 1 (indicating that common ExceptionDump function is enabled):
    • For dynamic-shape networks, only ge.exec.enable_exception_dump takes effect.
    • For static-shape networks, ge.exec.enable_exception_dump and either of ge.exec.enableDump and ge.exec.enableDumpDebug take effect.

Optional

Global/Session

ge.exec.dumpPath

Path for storing the dump file. This parameter is required when dump and overflow/underflow detection are enabled.

Create the specified path in advance in the environment (either container or host) where training is performed. The running user configured during installation must have the read and write permissions on this path. The path can be an absolute path or a path relative to the path where the training script is executed.

  • An absolute path starts with a slash (/), for example, /home/HwHiAiUser/output.
  • A relative path starts with a directory name, for example, output.

The dump data file is generated in the specified directory, that is, the {dump_path}/{time}/{deviceid}/{model_name}/{model_id}/{data_index} directory. For example, if dump_path is set to /home/HwHiAiUser/output, the dump data file is stored in the /home/HwHiAiUser/output/20200808163566/0/ge_default_20200808163719_121/11/0 directory.

Optional

Global/Session

ge.exec.dumpStep

Iterations to be dumped. Default value: None, indicating that all iterations are dumped.

Separate multiple iterations using vertical bars (|), for example, 0|5|10. You can also use hyphens (-) to specify the iteration range, for example, 0|3-5|10.

Configuration example:

{"ge.exec.dumpStep", "None"};

Optional

Global/Session

ge.exec.dumpMode

Dump mode, specifying whether the operator input or output is dumped. The values are as follows:

  • input: Only operator inputs are dumped.
  • output (default): Only operator outputs are dumped.
  • all: Both operator inputs and outputs are dumped.

Configuration example:

{"ge.exec.dumpMode", "input"};

Restrictions:

If this parameter is set to all, the input data of some operators, such as collective communication operators HcomAllGather and HcomAllReduce, will be modified during execution. Therefore, the system dumps the operator input before operator execution and dumps the operator output after operator execution. In this way, the dumped input and output data of the same operator is flushed to disks separately, and multiple dump files are generated. After parsing the dump files, you can determine whether the data is an input or output based on the file content.

Optional

Global/Session

ge.exec.dumpData

Type of operator content to be dumped.

  • tensor (default): Operator data is dumped.
  • stats: Operator statistics are dumped and the result is saved in CSV format. As the operator data amount is large in most cases, you can try to dump the operator statistics.

Configuration example:

{"ge.exec.dumpData", "tensor"};

Optional

Global/Session

ge.exec.dumpLayer

Operator to be dumped. The value is an operator name. Multiple operator names are separated by spaces.

If the input of the specified operator involves the data operator, the data operator information is also dumped.

Configuration example:
{"ge.exec.dumpLayer", "layer1 layer2 layer3"};

Optional

Global/Session

ge.exec.enableDumpDebug

Whether to enable overflow/underflow detection.

  • 1: Overflow/underflow detection is enabled. The dump file path is read from ge.exec.dumpPath. If ge.exec.dumpPath is set to None, an exception occurs.
  • 0 (default): Overflow/underflow detection is disabled.

Configuration example:

{"ge.exec.enableDumpDebug", "0"};
NOTE:
  • This parameter cannot be used together with ge.exec.enableDump in the global scenario or in the same session.
  • If either ge.exec.enableDump or ge.exec.enableDumpDebug is set to 1 and ge.exec.enable_exception_dump is set to 1 (indicating that common ExceptionDump function is enabled):
    • For dynamic-shape networks, only ge.exec.enable_exception_dump takes effect.
    • For static-shape networks, ge.exec.enable_exception_dump and either of ge.exec.enableDump and ge.exec.enableDumpDebug take effect.

Optional

Global/Session

ge.exec.dumpDebugMode

Overflow/underflow detection mode. The values are as follows:

  • aicore_overflow: detects AI Core operator overflow, that is, detecting whether abnormal extreme values (such as 65500, 38400, and 51200 in float16) are output with normal inputs. Once such fault is detected, analyze the cause of the overflow and modify the operator implementation based on the network requirements and operator logic.
  • atomic_overflow: detects Atomic Add overflow, for checking modules involved in floating-point computing (such as SDMA) in addition to AI Core.
  • all: detects overflow of both AI Core operators and Atomic Add.

For the following products, this parameter can only be set to all:

Atlas A2 training products / Atlas A2 inference products

Atlas A3 training products / Atlas A3 inference products

Optional

Global/Session

ge.bufferOptimize

Data buffer optimization switch.

Arguments:

  • l1_optimize: Enables L1 optimization. This argument is invalid in the current version and equivalent to off_optimize.
  • l2_optimize (default): Enables L2 optimization.
  • off_optimize: Disables buffer optimization.

Suggestions:

You are advised to enable buffer optimization as this function can improve compute efficiency and performance. However, it is possible that your model contains an operator that is not yet covered by the current implementation, which affects the precision. Therefore, you can disable data buffer optimization when the precision is affected. If the precision meets requirements after buffer optimization is disabled, locate the fishy operator and submit the issue to the technical support for further analysis. After the operator issue is resolved, you are advised to enable buffer optimization.

Configuration example:

{"ge.bufferOptimize", "l2_optimize"};

Optional

session/graph

ge.fusionSwitchFile

Directory (including the file name) of the configuration file for the fusion pattern (including graph fusion and UB fusion) switch. The directory can contain letters (a–z, A–Z), digits (0–9), underscores (_), hyphens (-), and periods (.).

The built-in graph fusion and UB fusion patterns are enabled by default. You can disable specified fusion patterns in the configuration file. Some fusion patterns cannot be disabled due to functionality restrictions. For the full list of fusion patterns that can be disabled, see Graph Fusion and UB Fusion Patterns.

The following is a template of the fusion_switch.cfg configuration file. on indicates that the setting is enabled, and off indicates that the setting is disabled.

  1. Configuration file example:
    {
        "Switch":{
            "GraphFusion":{
                "RequantFusionPass":"on",
                "ConvToFullyConnectionFusionPass":"off",
                "SoftmaxFusionPass":"on",
                "NotRequantFusionPass":"on",
                "SplitConvConcatFusionPass":"on",
                "ConvConcatFusionPass":"on",
                "MatMulBiasAddFusionPass":"on",
                "PoolingFusionPass":"on",
                "ZConcatv2dFusionPass":"on",
                "ZConcatExt2FusionPass":"on",
                "TfMergeSubFusionPass":"on"
            },
            "UBFusion":{
                "TbePool2dQuantFusionPass":"on"
            }
        }
    }

To disable all fusion patterns at a time, refer to this configuration file example.

  1. Configuration file example:
    {
        "Switch":{
            "GraphFusion":{
                "ALL":"off"
            },
            "UBFusion":{
                "ALL":"off"
             }
        }
    }

Notes:

  1. Some built-in fusion patterns are not switchable due to functionality restrictions and these fusion patterns will remain enabled despite user's switch settings.
  2. To disable all fusion patterns except selected ones, refer to the following example.
    1. Configuration file example:
      {
          "Switch":{
              "GraphFusion":{
                  "ALL":"off",
                  "SoftmaxFusionPass":"on"
              },
              "UBFusion":{
                  "ALL":"off",
                  "TbePool2dQuantFusionPass":"on"
              }
          }
      }

Configuration example:

{"ge.fusionSwitchFile", "/home/test/fusion_switch.cfg"};

Optional

all

Performance Tuning

Key

Value

Required/Optional

Global/Session/Graph

ge.exec.variable_acc

Whether to enable variable format optimization.

Arguments:

  • True (default): enabled
  • False: disabled

To improve training efficiency, the format of the variables is converted to a format more compatible with the Ascend AI Processor during variable initialization performed by the network. However, this function should be disabled in special scenarios.

Restrictions:

When this function is enabled, ge.AllowMultiGraphParallelCompile cannot be set to 1. Otherwise, an error is reported during verification.

Configuration example:

{"ge.exec.variable_acc", "True"};

Optional

All

ge.exec.op_precision_mode

Precision mode of one or more specified operators during internal processing. This parameter is used to transfer the customized precision mode configuration file op_precision.ini to set different precision modes for different operators.

Set the precision mode based on the operator type (low priority) or node name (high priority) in each row in the .ini file.

The following precision modes can be set in the configuration file:

  • high_precision
  • high_performance
  • enable_float_32_execution: The FP32 data type is used for internal processing of operators. In this scenario, the FP32 data type is not automatically converted to the HF32 data type. If you are using the HF32 data type for computation and find that the accuracy drop exceeds your expectation, you can enable this configuration to specify the use of FP32 for internal computation of certain operators in order to maintain accuracy.

    This option supports only the following products:

    Atlas A2 training products / Atlas A2 inference products

    Atlas A3 training products / Atlas A3 inference products

  • enable_hi_float_32_execution: The HF32 data type is used for internal processing of operators. After it is enabled, the FP32 data type is automatically converted to the HF32 data type. This configuration reduces the space occupied by data and improves performance. It is not supported in the current version.
  • support_out_of_bound_index: indicates that the out-of-bounds verification is performed on the indices of the gather, scatter, and segment operators. The verification deteriorates the operator execution performance.
  • keep_fp16: The FP16 data type is used for internal processing of operators. In this scenario, the FP16 data type is not automatically converted to the FP32 data type. If the performance of FP32 computation does not meet the expectation and high precision is not required, you can select the keep_fp16 mode. This low-precision mode sacrifices the precision for improving the performance, which is not recommended.
  • super_performance: Indicates ultra-high performance. Compared with high performance, the algorithm calculation formula is optimized.

You can view the precision or performance mode supported by an operator in the opp/built-in/op_impl/ai_core/tbe/impl_mode/all_ops_impl_mode.ini file in the file storage path with the CANN software installed.

A configuration example for the op_precision.ini file is as follows:

[ByOpType]
optype1=high_precision
optype2=high_performance
optype3=enable_hi_float_32_execution
optype4=support_out_of_bound_index

[ByNodeName]
nodename1=high_precision
nodename2=high_performance
nodename3=enable_hi_float_32_execution
nodename4=support_out_of_bound_index

Configuration example:

{"ge.exec.op_precision_mode", "$HOME/conf/op_precision.ini"};

Optional

Global

ge.optypelistForImplmode

List of operator types. The operators in the list use the mode specified by the ge.opSelectImplmode option.

Restrictions:

  • The operators in the list use the mode specified by ge.opSelectImplmode, which is either high_precision or high_performance. Use commas (,) to separate operators.
  • This option must be used together with ge.opSelectImplmode and takes effect only for specified operators. For other operators, the default implementation mode is used. For example, ge.opSelectImplmode is set to high_precision, and ge.optypelistForImplmode is set to Pooling or SoftmaxV2. The preceding configuration example indicates that the high-precision mode is used only for the Pooling and SoftmaxV2 operators. For operators whose precision modes are not specified, the default implementation mode is used.

Optional

Global

ge.tiling_schedule_optimize

Whether to enable the optimization for tiling offload scheduling.

As internal storage of the AI Core in the NPU cannot store all the input and output data of operators, the input data is tiled into different parts. The first part is transferred in, computed, and then transferred out, so does the next part. This process is called tiling. Then, a computation program, called tiling implementation, determines tiling parameters (such as the block size transferred each time and the total number of cycles) based on operator information such as shape. The AI Cores are not good at scalar computation in the tiling implementation. Therefore, tiling implementation is generally executed on the CPU on the host. However, tiling implementation is executed on the device when the following conditions are met:

  1. The model is static-shape.
  2. Operators in the model, such as the FusedInferAttentionScore and IncreFlashAttention fused operators, support tiling offload.
  3. The output values of the operators that support tiling offload have dependencies, that is, the output value of the previous operator contains the execution result of the device. If the value to be depended on is a Const value, tiling offload is not required, and tiling is completed during build.

Arguments:

  • 0 (default): Tiling offload is disabled.
  • 1: Tiling offload is enabled.

Configuration example:

{"ge.tiling_schedule_optimize", "0"};

This option can be used only by the following Products:

Atlas A2 training products / Atlas A2 inference products

Atlas A3 training products / Atlas A3 inference products

Atlas inference products

Optional

Global/Session

ge.graphMaxParallelModelNum

In graph execution mode, a graph can be concurrently loaded and executed by multiple models on the same device. This parameter is used to specify the maximum number of models that can be concurrently loaded.

Arguments:

1 to INT32_MAX. The default value is 8.

Configuration example:

{"ge.graphMaxParallelModelNum", "8"};

Optional

All

Profiling

Key

Value

Required/Optional

Global/Session/Graph

ge.exec.profilingMode

Whether to enable the profiling function.

  • 1: enabled. The Profiling option to be traced is determined by ge.exec.profilingOptions.
  • 0 (default): disabled.

Configuration example:

{"ge.exec.profilingMode", "0"};

Optional

Global

ge.exec.profilingOptions

Profiling options.

  • output: path for storing profiling result files. Both absolute path and relative path (relative to the path where the command is run) are supported. The path cannot contain the following special characters: "\n", "\f", "\r", "\b", "\t", "\v", and "\u007F".
    • An absolute path starts with a slash (/), for example, /home/output.
    • A relative path starts with a directory name, for example, output.
    • It takes precedence over ASCEND_WORK_PATH.
    • This path does not need to be created in advance because it is automatically created during collection.
  • storage_limit: maximum size of files that can be stored in a specified disk directory. If the size of profile data files in the disk is about to use up the maximum storage space specified by this option or the total remaining disk space is about to be used up (remaining space ≤ 20 MB), the earliest files in the disk are aged and deleted.

    The value range is [200, 4294967295], and the unit is MB. The unit must be included when you set this parameter, for example, 200 MB.

    If this parameter is not set, the default value is 90% of the available space of the disk where the directory for storing profile data files is located.

  • training_trace: iteration tracing switch. Collects software profile data of a training job and the AI Software Stack to profile the training job. Focuses on forward and backward propagation, and gradient aggregation and update. This option must be set to on when the forward and backward propagation operator data is collected.
  • task_trace and task_time: switches that control collection of the operator delivery and execution durations. Related duration data must be output to the task_time, op_summary, and op_statistic files. Possible configuration values are as follows:
    • on: switch on. This is the default value, delivering the same effect as l1.
    • off: switch off.
    • l0: collects operator delivery and execution duration data. Compared with l1, l0 does not collect basic operator information, so the performance overhead during collection is smaller, and this enables more accurate collection of statistics on time duration data.
    • l1: collects operator delivery and execution duration data, as well as basic operator information, to provide more comprehensive performance analysis data.

    When Profiling is enabled to collect training data, task_trace and training_trace must be set to on.

  • ge_api: switch that controls collection of the time consumption data of dynamic-shape operators in the host scheduling phase. Possible values are:
    • off: switch off. The default value is off.
    • l0: collects the time consumption data of dynamic-shape operators in the main host scheduling phase to facilitate accurate statistics.
    • l1: collects finer-grained time consumption data of dynamic-shape operators in the host scheduling phase to provide more comprehensive performance analysis data.
  • hccl: communication data collection switch, either on or off (default).
    NOTE:

    This switch will be deprecated in later versions. To control data collection, use task_time.

  • aicpu: whether to collect details about the AI CPU operator, such as the operator execution time and data copy time. The value can be on or off (default).
  • fp_point: start point of the forward propagated operator in iteration traces, to record the start timestamp of forward propagation. Set the value to the name of the top operator in forward propagation. You can save the graph as a .pbtxt file by using tf.io.write_graph in the training script to obtain the name. Alternatively, you can leave this option empty (for example, "fp_point":""), and the system will automatically identify the start point of the forward propagated operator.
  • bp_point: end point of the backward propagated operator in iteration traces, to record the end timestamp of backward propagation. bp_point and fp_point are used to compute the time used by forward and backward propagation. Set the value to the name of the bottom operator in backward propagation. You can save the graph as a .pbtxt file by using tf.io.write_graph in the training script to obtain the name. Alternatively, you can leave this option empty (for example, "bp_point":""), and the system will automatically identify the end point of the backward propagated operator.
  • aic_metrics: AI Core metric to profile. The options are as follows:
    • ArithmeticUtilization: arithmetic utilization ratio.
    • PipeUtilization (default): ratio of time taken by the compute units to that of MTEs.
    • Memory: ratio of external memory read/write instructions.
    • MemoryL0: ratio of internal memory L0 read/write instructions.
    • MemoryUB: ratio of internal memory UB read/write instructions.
    • ResourceConflictRatio: ratio of pipeline queue instructions.
    • L2Cache: read/write L2 cache hits and re-allocations after cache misses

      Atlas inference products : This parameter is not supported.

      Atlas training products : This parameter is not supported.

    • MemoryAccess: bandwidth of the operator's memory access on cores.

      Atlas inference products : This parameter is not supported.

      Atlas training products : This parameter is not supported.

    NOTE:
    The registers whose data is to be collected can be customized, for example, "aic_metrics":"Custom:0x49,0x8,0x15,0x1b,0x64,0x10".
    • The Custom field indicates the customization type. It is set to specific register values in the range of [0x1, 0x6E].
    • A maximum of eight registers can be configured, which are separated with commas (,).
    • The register value can be in hexadecimal or decimal format.
  • l2: switch that controls L2 cache and TLB page table cache hit ratio, either on or off (default).
    • Atlas inference products : supports collection of the L2 cache hit ratio.
    • Atlas training products : supports collection of the L2 cache hit ratio.
    • Atlas A2 training products / Atlas A2 inference products : supports collection of L2 cache and TLB page table cache hit ratio. aic-metrics=L2Cache is recommended for analyzing the number of hits on L2 cache from the AI Core.
    • Atlas A3 training products / Atlas A3 inference products : supports collection of L2 cache and TLB page table cache hit ratio. aic-metrics=L2Cache is recommended for analyzing the number of hits on L2 cache from the AI Core.
  • msproftx: switch that controls the msproftx user and upper-layer framework program to output profile data, either on or off (default).

    Add the following mstx API or msproftx API to the application script. The mstx API is recommended.

  • runtime_api: runtime API data collection switch, either on or off (default). You can collect runtime API profile data, including the synchronous/asynchronous memory replication latencies between the host and device and between devices.
  • sys_hardware_mem_freq: switch that controls the collection of the on-chip memory, QoS transmission bandwidth, LLC L3 cache bandwidth, accelerator bandwidth, SoC transmission bandwidth, and component memory usage. The collected content varies depending on the product. The actual result prevails. The value range is [1, 100], in Hz.

    Sampling memory data in the environment where glibc (2.34 or earlier) is installed may trigger a known Bug 19329. This problem can be solved by upgrading the glibc version.

  • llc_profiling: LLC events to profile. Possible values are as follows:
    • read (default): read events, that is, the L3 cache read rate.
    • write: write events, that is, the L3 cache write rate.
  • sys_io_sampling_freq: NIC and ROCE data collection frequency. The value range is [1, 100], in Hz.

    Atlas inference products : This parameter is not supported.

    Atlas A2 training products / Atlas A2 inference products : supports NIC and RoCE collection.

    Atlas A3 training products / Atlas A3 inference products : supports NIC and RoCE collection.

  • sys_interconnection_freq: frequency of collecting collective communication bandwidth data (HCCS), SIO data, PCIe data and inter-chip transmission bandwidth information. The value range is [1, 50], in Hz.
    • Atlas training products : supports HCCS and PCIe data collection.
    • Atlas A2 training products / Atlas A2 inference products : supports HCCS, PCIe data, and inter-chip transmission bandwidth information collection.
    • Atlas A3 training products / Atlas A3 inference products : supports HCCS, PCIe data, inter-chip transmission bandwidth information, and SIO data collection.
  • dvpp_freq: DVPP collection frequency. The value range is [1, 100], in Hz.
  • instr_profiling: AI Core and AI Vector bandwidth and latency collection switch. The value can be on or off (default).
    • Atlas training products : This function is not supported.
    • Atlas A2 training products / Atlas A2 inference products : This switch is not supported. This function is controlled through instr_profiling_freq.
    • Atlas A3 training products / Atlas A3 inference products : This switch is not supported. This function is controlled through instr_profiling_freq.
  • instr_profiling_freq: AI Core and AI Vector bandwidth and latency collection switch. If the collection frequency is configured, the related collection capability is enabled. The value range is [300, 30000]. The unit is Hz.
    • Atlas training products : This function is not supported.
    • Atlas A2 training products / Atlas A2 inference products : supported. However, instr_profiling_freq is mutually exclusive with training_trace, task_trace, hccl, aicpu, fp_point, bp_point, aic_metrics, l2, task_time, and runtime_api, so they cannot be executed at the same time.
    • Atlas A3 training products / Atlas A3 inference products : supported. However, instr_profiling_freq is mutually exclusive with training_trace, task_trace, hccl, aicpu, fp_point, bp_point, aic_metrics, l2, task_time, and runtime_api, so they cannot be executed at the same time.
  • host_sys: switch for collecting host profile data. You can select one or more options and separate them with commas (,), for example, "host_sys": "cpu,mem".
    • cpu: process CPU utilization
    • mem: process memory utilization
  • host_sys_usage: Host-side system and process CPU and memory data collection option, selected from cpu and mem. You can select one or more options and separate them with commas (,).
  • host_sys_usage_freq: Host-side system and process CPU and memory data collection frequency. The value range is [1, 50] and the default value is 50. The unit is Hz.

Configuration example:

std::map<ge::AscendString, ge::AscendString> ge_options = {{"ge.exec.deviceId", "0"},
                                  {"ge.graphRunMode", "1"},
                                  {"ge.exec.profilingMode", "1"},
                                  {"ge.exec.profilingOptions", R"({"output":"/tmp/profiling","training_trace":"on","fp_point":"resnet_model/conv2d/Conv2Dresnet_model/batch_normalization/FusedBatchNormV3_Reduce","bp_point":"gradients/AddN_70"})"}};

Optional

Global

AOE

Key

Value

Required/Optional

Global/Session/Graph

ge.mdl_bank_path

Path of the custom repository generated after subgraph tuning.

This option must be used together with ge.bufferOptimize and takes effect only when buffer optimization is enabled, to improve performance by temporarily storing data in the buffer.

Argument: directory of the custom repository generated after model tuning.

Format: The directory can contain letters, digits, underscores (_), hyphens (-), and periods (.).

Default: $HOME/Ascend/latest/data/aoe/custom/graph/<soc_version>

Restrictions:

Priority ranked from high to low: directory specified by ge.mdl_bank_path > directory specified by TUNE_BANK_PATH > default directory.

  1. The custom repository directory specified by ge.mdl_bank_path takes effect and the directory specified by TUNE_BANK_PATH does not when TUNE_BANK_PATH is used to specify the directory before model compilation, and then ge.mdl_bank_path is used to specify the directory during model build.
  2. The default directory takes effect if both the directories specified by ge.mdl_bank_path and TUNE_BANK_PATH are invalid or contain no custom repository.
  3. If no custom repository is available in the preceding directories, the built-in repository for subgraph tuning is searched in the ${INSTALL_DIR}/<arch>-linux/data/fusion_strategy/built-in path. <arch>/ indicates the OS architecture.

Optional

All

ge.op_bank_path

Directory of the custom repository generated after operator tuning.

Format: The directory can contain letters, digits, underscores (_), hyphens (-), and periods (.).

Default: ${HOME}/Ascend/latest/data/aoe/custom/op

Restrictions:

Path (path of the custom repository generated after operator tuning) priority ranked from high to low: path specified by the TUNE_BANK_PATH environment variable > path specified by ge.op_bank_path > default path of the custom repository generated after operator tuning.

  1. If the TUNE_BANK_PATH environment variable is used to specify the custom repository path before model conversion and ge.op_bank_path is used to specify the custom repository path during model build, then the path specified by the TUNE_BANK_PATH environment variable takes effect and the path specified by ge.op_bank_path does not take effect.
  2. The default directory takes effect if both the directories specified by ge.op_bank_path and the environment variable are invalid.
  3. If none of the preceding directories contains the custom repository, the system searches the built-in directory of the custom repository generated after operator tuning.

Optional

All

Exception Remedy

Key

Value

Required/Optional

Global/Session/Graph

stream_sync_timeout

Timeout for stream synchronization during graph execution. If the timeout exceeds the configured value, a synchronization failure is reported. The unit is ms.

The default value is –1, indicating that there is no waiting time and no error is reported when the synchronization fails.

Configuration example:

{"stream_sync_timeout", "-1"};

Optional

Global/Session

event_sync_timeout

Timeout for event synchronization during graph execution. If the timeout exceeds the configured value, a synchronization failure is reported. The unit is ms.

The default value is –1, indicating that there is no waiting time and no error is reported when the synchronization fails.

Configuration example:

{"event_sync_timeout", "-1"};

Optional

Global/Session

Experiment Parameters

Key

Value

Required/Optional

Global/Session/Graph

ge.jit_compile

This option is not supported in the current version.

Optional

Global/Session

ge.build_inner_model

This option is not supported in the current version.

Optional

N/A

ge.disableOptimizations

This option is used for debugging and cannot be used in commercial products. The function specified by this option will be released as a feature in later versions.

This option applies only to the following products:

Atlas A2 training products / Atlas A2 inference products

Atlas A3 training products / Atlas A3 inference products

This option is used to specify one or more compilation and optimization passes to be disabled.

Currently, only the following passes can be disabled:

"RemoveSameConstPass","ConstantFoldingPass","TransOpWithoutReshapeFusionPass"

Note:

  1. Separate multiple passes with commas (,).
  2. If other passes are disabled, only warning logs are printed during graph build.
  3. If ConstantFoldingPass is disabled, graph build or running may fail.
  4. If other compilation optimization options, such as ge.oo.constantFolding, are configured, ge.disableOptimizations has a higher priority.

Configuration example:

  • Disabling a single pass
    std::map <AscendString, AscendString> session_options = {
    {"ge.disableOptimizations", "RemoveSameConstPass"}
    };
  • Disabling multiple passes
    std::map <AscendString, AscendString> session_options = {
    {"ge.disableOptimizations", "RemoveSameConstPass, ConstantFoldingPass"}
    };

Optional

All

ge.oo.level

Extended option for debugging. It cannot be used in commercial products and will be released as a formal function in later versions.

Multi-level optimization options for graph build include subgraph optimization, entire graph optimization, and static shape model offloading.

Static shape model offloading: In this approach, the input and output shapes of all operators in a static shape model can be determined at build time, allowing for model-level memory orchestration and operator tiling computation to be completed on the host. These computations are then batched and sent to the device stream when the model is loaded, but they are not executed immediately. Instead, the execution of all tasks within the model is triggered by the delivery of model execution tasks.

Arguments:

  • O1: Disables all graph fusion and UB fusion passes, and performs only optimizations related to static offloading, such as InferShape (output tensor shape inference), constant folding, dead-edge elimination, and other optimizations.
  • O3 (default): Enable s all optimizations.

Restrictions:

If the value is O1, all graph fusion and UB fusion passes are disabled, and only passes related to static offloading are enabled. However, the graph fusion passes in the following files are enabled by default because function problems may occur if they are disabled:

All graph fusion passes under the ExceptionalPassOfO1Level field in the ${INSTALL_DIR}/x86_64-linux/lib64/plugin/opskernel/fusion_pass/config/fusion_config.json file

Replace ${INSTALL_DIR} with the CANN component directory. For example, if the installation is performed by the root user, the default file storage path is /usr/local/Ascend/cann.

Configuration example:

{"ge.oo.level", "O3"};

Optional

All

ge.oo.constantFolding

Extended option for debugging. It cannot be used in commercial products and will be released as a formal function in later versions.

Sets whether to enable constant folding optimization.

Constant folding is the process of replacing nodes in a computational graph that can be evaluated to a constant output value with that constant, and simplifying the structure of the computational graph accordingly.

Arguments:

  • true (default): enabled
  • false: disabled

Configuration example:

{"ge.oo.constantFolding", "true"};

Restrictions:

If other compilation optimization options, such as ge.disableOptimizations, are configured, ge.disableOptimizations has a higher priority.

Optional

All

ge.oo.deadCodeElimination

Extended option for debugging. It cannot be used in commercial products and will be released as a formal function in later versions.

Whether to enable dead-edge elimination optimization

Dead-edge elimination: When pred (input 1) of a switch statement is a constant node, one of the branches can be eliminated based on the value of const. If const is true, the false branch is eliminated; if const is false, the true branch is eliminated.

Arguments:

  • true (default): enabled
  • false: disabled

Configuration example:

{"ge.oo.deadCodeElimination", "true"};

Optional

All

ge.autoMultistreamParallelMode

Extended option for debugging. It cannot be used in commercial products and will be released as a formal function in later versions.

This option applies only to graphs with a static shape. You can enable parallel execution of Cube and Vector operators to improve graph execution performance.

Arguments:

  • cv: Parallel execution of Cube and Vector operators is enabled.
  • None (default): Parallel execution of Cube and Vector operators is disabled.
NOTICE:
  • This option is used only for recommendation networks.
  • If you execute Cube and Vector operators concurrently, tasks on multiple streams cannot be run at the same time (this can be controlled by ENABLE_DYNAMIC_SHAPE_MULTI_STREAM).

    For details about environment variables, see Environment Variables.

Configuration example:
{"ge.autoMultistreamParallelMode", "cv"};

Optional

session/graph

ge.DeterministicLevel

Extended option for debugging. It cannot be used in commercial products and will be released as a formal function in later versions. It applies only to 8.5.1 and later versions.

Specifies the deterministic computing level.

By default, the deterministic level is 0, that is, deterministic computing is disabled. The value of ge.deterministic must also be 0. To enable deterministic computing, set the deterministic level to 1 and the deterministic computing option ge.deterministic to 1. To enable strong consistency computing, set the level to 2 and the deterministic computing option ge.deterministic to 1.

If strong consistency computing is enabled (ge.DeterministicLevel=2 is set to 2), the computing result is deterministic, meaning that multiple executions will generate the same result. In addition, the computing result is irrelevant to the data location. For example, when performing matrix multiplication, the order of accumulation across different rows may vary, which can lead to slight differences in results for the same data in different rows. However, when strong consistency computing is enabled, the computing results will be consistent as long as the inputs are the same, even if they are in different rows.

By default, the strong consistency computing function is disabled. In this default mode, the computing results may be inconsistent when the same data appears in different rows.

For performance considerations, you are advised not to enable strong consistency computing because it slows down the computing speed of operators and affects the overall efficiency. You are advised to enable this function only when the computing result is required to be strictly consistent for the same data in different locations or the model precision is being adjusted and debugged to optimize the overall performance.

Arguments:

  • 0 (default): Disables deterministic computing.
  • 1: Enables deterministic computing.
  • 2: Enables strong consistency computing.

Configuration example:

{"ge.deterministic", "0"};
{"ge.DeterministicLevel", "0"};

Restrictions:

This configuration item must be used together with ge.deterministic.

Optional

Global

Parameters That Will Be Deprecated in Later Versions

Key

Value

Required/Optional

Global/Session/Graph

ge.graphMemoryMaxSize

Do not use this option because it will be deprecated in later versions.

Network static memory size and maximum dynamic memory size. The value varies according to the network size. The unit is byte and the value range is [0, 256 × 1024 × 1024 × 1024] or [0, 274877906944]. Due to chip hardware performance restrictions, the sum of ge.graphMemoryMaxSize and ge.variableMemoryMaxSize must not exceed 31 GB. If this option is not set, the default value 26 GB is used.

Optional

All

ge.variableMemoryMaxSize

Do not use this option because it will be deprecated in later versions.

Variable memory size. The value varies according to the network size. The unit is byte and the value range is [0, 256 × 1024 × 1024 × 1024] or [0, 274877906944]. Due to chip hardware performance restrictions, the sum of ge.graphMemoryMaxSize and ge.variableMemoryMaxSize must not exceed 31 GB. If this option is not set, the default value 5 GB is used.

Optional

All

ge.exec.dynamicGraphExecuteMode

This option is deprecated. Avoid using it.

Execution mode, applicable to the dynamic input scenario. The value is dynamic_execute.

Optional

Graph

ge.exec.dataInputsShapeRange

This option is deprecated. Avoid using it.

Shape range of dynamic input. If a graph has two data inputs, the configuration example is as follows:

std::map<ge::AscendString, ge::AscendString> ge_options = {{"ge.exec.deviceId", "0"},
      {"ge.graphRunMode", "1"},
      {"ge.exec.dynamicGraphExecuteMode", "dynamic_execute"},
      {"ge.exec.dataInputsShapeRange", "[128 ,3~5, 2~128, -1],[ 128 ,3~5, 2~128, -1]"}};
  • Set it in the format: "[n1, c1, h1, w1],[n2, c2, h2, w2]" (for example, "[8~20, 3, 5, –1],[5, 3~9, 10, –1]"). If node names are not configured, the first pair of brackets ([]) denotes the first input node. Separate the nodes with commas (,). In this case, the index attribute must be set sequentially from 0 for data nodes.
  • The size of a static dimension is specified by a determinant value. The size range of a dynamic dimension (with the shape range) is specified by using a tilde (~). A dynamic dimension without size range specified is denoted by –1.
  • For a scalar input, the shape range is also needed. Enclose the range in square brackets ([]).
  • Assume that your graph has three inputs and only the first one has a static shape; the static shape must be specified in the options field.

    {"ge.exec.dataInputsShapeRange", "[3,3,4,10], [-1,3,2~1000,-1],[-1,-1,-1,-1]"}};

NOTE:
  • If no node name is specified, nodes are stored in the index sequence by default. The following is an example:

    xxx_0, xxx_1, xxx_2, ...

    The content following the underscore (_) is the sequence index of a node in the network script. Nodes are arranged in alphabetical order of the index. If the number of nodes is greater than 10, the sequence is xxx_0 > xxx_10 > xxx_2 > xxx_3. In the network script, the node with index 10 is placed before the node with index 2. As a result, the defined shape range does not match the input node.

    To avoid this problem, when the number of input nodes is greater than 10, you are advised to specify node names in the network script. Consequently, nodes are named with specified names to associate the shape range.

  • If this option and ge.dynamicDims are both configured as follows:
    std::map<ge::AscendString, ge::AscendString> ge_options = 
         {{"ge.inputShape", "data:1,1,40,-1;label:1,-1;mask:-1,-1" },
          {"ge.dynamicDims", "20,20,1,1;40,40,2,2;80,60,4,4"},
            xxx
          {"ge.exec.dataInputsShapeRange", "[128, 3~5, 2~128, -1],[ 128 ,3~5, 2~128, -1]"}};

    The priority of ge.dynamicDims (dynamic dimension size profiles) is higher than that of ge.exec.dataInputsShapeRange (dynamic shape range).

Optional

Graph

ge.opSelectImplmode

The function of this option does not evolve and will be deprecated in later versions. You are advised to use ge.exec.op_precision_mode.

Operator implementation mode selection. Certain operators built in the Ascend AI Processor can be implemented in either high-precision or high-performance mode at model build time.

In high-precision mode, Taylor's theorem or Newton's method is used to improve operator precision with float16 input. In high-performance mode, the optimal performance is implemented without affecting the network precision (float16).

Arguments:

  • high_precision: high-precision mode.

    This option sets the operator implementation mode by using the built-in configuration file, which is stored in ${INSTALL_DIR}/opp/op_impl/built-in/ai_core/tbe/impl_mode/high_precision.ini.

    To ensure compatibility, this argument takes effect only for the operator list in the high_precision.ini file. This list can be used to control the effective scope of operators and ensure that the network models of earlier versions are not affected.

  • high_performance (default): high-performance mode.

    This option sets the operator implementation mode by using the built-in configuration file, which is stored in ${INSTALL_DIR}/opp/built-in/op_impl/ai_core/tbe/impl_mode/high_performance.ini.

    To ensure compatibility, this argument takes effect only for the operator list in the high_performance.ini file. This list can be used to control the effective scope of operators and ensure that the network models of earlier versions are not affected.

  • high_precision_for_all: high-precision mode.

    This option sets the operator implementation mode by using the built-in configuration file, which is stored in ${INSTALL_DIR}/opp/built-in/op_impl/ai_core/tbe/impl_mode/high_precision_for_all.ini. The list in this file may be updated with the version.

    This implementation mode may cause incompatibility. If an operator in the new software package sets the implementation mode (that is, an implementation mode is added for a certain operator in the configuration file), the performance of the earlier network model that uses the high_precision_for_all mode may deteriorate.

  • high_performance_for_all: high-performance mode.

    This option sets the operator implementation mode by using the built-in configuration file, which is stored in ${INSTALL_DIR}/opp/built-in/op_impl/ai_core/tbe/impl_mode/high_performance_for_all.ini. The list in this file may be updated with the version.

    This implementation mode may cause incompatibility. If an operator in the new software package sets the implementation mode (that is, an implementation mode is added for a certain operator in the configuration file), the precision of the earlier network model that uses the high_performance_for_all mode may deteriorate.

The preceding implementation modes are distinguished based on dtype of the operator. Replace ${INSTALL_DIR} with the CANN component directory. For example, if the installation is performed by the root user, the default file storage path is /usr/local/Ascend/cann.

Configuration example:

{"ge.opSelectImplmode", "high_performance"};

Optional

Global

ge.shape_generalized_build_mode

Do not use this option because it will be deprecated in later versions.

Optional

Graph