Command-Line Options

Atlas 200/300/500 Inference Product : This feature is not supported.

This section describes the configuration options passed to GEInitialize, the Session constructor, and AddGraph, which take effect globally, in a session and in a graph respectively.

Table 1 lists only the configuration options supported by the current version. If an option is not listed in the table, it is reserved or applicable to other Ascend AI Processor versions.

Table 1 Options key-value configuration

Key

Value

Required

Global/Session/Graph

ge.graphRunMode

Graph run mode.

  • 0: online inference. Defaults to 0.
  • 1: training

Optional

Global/Session

ge.exec.deviceId

Logical ID of the operated device when the GE instance is running.

  • In the online inference scenario, the value ranges from –1 to N – 1. The default value is -1.
  • In the training scenario, the value ranges from 0 to N – 1. The default value is 0.

N indicates the number of available Ascend AI Processors on the server.

Optional

Global

ge.socVersion

Target model of the Ascend AI Processor for model build and optimization.

  • Run the npu-smi info command on the server where the Ascend AI Processor is installed to obtain the Chip Name information. The actual value is AscendChip Name. For example, if Chip Name is xxxyy, the actual value is Ascendxxxyy.

No

all

ge.inputShape

Shape of model input.

Arguments:

  • Static shape.
    • If the model has a single input, the shape information is "input_name:n,c,h,w".
    • If the model has multiple inputs, the shape information is "input_name1:n1,c1,h1,w1;input_name2:n2,c2,h2,w2". Different inputs are separated by semicolons (;). input_name must be the name of a node in the network model before conversion.
  • If dimension values of the input data in the original model are not fixed, the model can be converted by setting the shape profile or shape range:
    • Setting the shape profile: including the dynamic dimension profiles(a maximum of four dimensions).

      When setting ge.inputShape, set the corresponding dimension value to -1. This option must be used together with ge.dynamicDims and ge.dynamicNodeType.

    • Setting the shape range: For the Atlas 200/300/500 Inference Product , the shape range cannot be set.

      When setting ge.inputShape, you can define the corresponding dimension with a range of valid values, for example, 1~10.

      • To set the shape range based on node names, the format is "input_name1:n1,c1,h1,w1;input_name2:n2,c2,h2,w2", for example, "input_name1:8~20,3,5,-1;input_name2:5,3~9,10,-1". Enclose the specified nodes in double quotation marks (""), and separate them by semicolons (;). input_name must be the node name in the network model before model conversion. As a best practice, you should set the option based on node names.

      If you do not want to specify the dimension range or value, you can set it to -1, indicating that the dimension can be any value greater than or equal to 0. In this scenario, the upper limit of the value is the int64 type range. However, the value is limited by the size of the physical memory on the host and device, so you can increase the memory size to support it.

  • Scalar shape.
    • Non-dynamic profile scenario:

      Shape is a scalar input, which is optional. For example, if the model has two inputs — input_name1 is a scalar with shape in the "[]" format, and input_name2 has the shape of [n2,c2,h2,w2], then the shape information of the model is "input_name1:;input_name2:n2,c2,h2,w2". Different inputs are separated by semicolons (;). input_name must be the node name in the network model before conversion. If the scalar input needs to be configured, leave it empty.

    • Dynamic profile scenario:

      If the model input has both scalar shape and dynamic-profile shape, the scalar input must be configured. For example, if a model has three inputs: A:[-1,c1,h1,w1], B:[], and C:[n2,c2,h2,w2], the shape information is "A:-1,c1,h1,w1; B:;C:n2,c2,h2,w2". Scalar input B must be configured.

Configuration example:

  • Static shape. For example, if the input shape information of a network consists of two inputs (input_0_0 [16,32,208,208] and input_1_0 [16,64,208,208]), the configuration of ge.inputShape is as follows:
    {"ge.inputShape", "input_0_0:16,32,208,208;input_1_0:16,64,208,208"}
  • For details about how to set profiles for a specified dimension, see ge.dynamicDims.
  • The following is an example of setting the shape range:
    {"ge.inputShape", "input_0_0:1~10,32,208,208;input_1_0:16,64,100~208,100~208"}
  • Scalar shape.
    • Non-dynamic profile scenario:

      Shape is a scalar input, which is optional. For example, if the model has two inputs — input_name1 is a scalar and input_name2 has the shape of [16,32,208,208], the configuration example is as follows:

      {"ge.inputShape", "input_name1:;input_name2:16,32,208,208"}

      In the preceding example, input_name1 is optional.

    • Dynamic profile scenario:

      Shape is a scalar input, which must be configured. For example, if the model has three inputs and the shape information is A:[-1,32,208,208], B:[], and C:[16,64,208,208], the configuration example is as follows (A is the dynamic profile input, and the batch size profile is used as an example):

      {"ge.inputShape", "A:-1,32,208,208;B:;C:16,64,208,208"}, 
      {"ge.dynamicDims", "1,2,4"} 
NOTE:

In those scenarios, ge.inputShape is optional. If this option is not set, the shape of the corresponding data nodes is used by default. Otherwise, the passed argument is used and updated to those of the corresponding data nodes.

No

Session/Graph

ge.dynamicDims

Dynamic dimension profile in ND format. Applies to the scenario where any dimension is processed each time during inference. This option must be used in pair with ge.inputShape.

Argument: formatted as "dim1,dim2,dim3;dim4,dim5,dim6;dim7,dim8,dim9"

Format: Enclose the whole argument in double quotation marks (""), and separate the dimension sizes by a semicolon (;). The dimension size values match the -1 placeholders in ge.inputShape with ordering preserved, and the number of -1 placeholders equals the number of dimension sizes of each profile. Set at least two dynamic dimension size profiles.

Restrictions: The value range is (1, 100]. You are advised to set 3 or 4 dimensions.

Examples:

  • If the model has only one input:
    {"ge.inputShape", "data:1,-1,-1"}
    {"ge.dynamicDims", "1,2;3,4;5,6;7,8"}
    // During graph running, the supported shapes of data operators are 1,1,2; 1,3,4; 1,5,6; 1,7,8.
  • If the network model has multiple inputs:

    The dimension size values match the -1 placeholders in the argument with ordering preserved, and the number of -1 placeholders equals the number of dimension sizes of each profile. Assume that a network model has three inputs: data (1, 1, 40, T), label (1, T) and mask (T, T), where T indicates a dynamic dimension. The configuration example is as follows:

    {"ge.inputShape", "data:1,1,40,-1;label:1,-1;mask:-1,-1" }
    {"ge.dynamicDims", "20,20,1,1;40,40,2,2;80,60,4,4"}
    // During graph running, the following input dims combinations are supported:
    // Profile 0: data(1,1,40,20)+label(1,20)+mask(1,1)
    // Profile 1: data(1,1,40,40)+label(1,40)+mask(2,2)
    // Profile 2: data(1,1,40,80)+label(1,60)+mask(4,4)

No

Session/Graph

ge.dynamicNodeType

Sets the type of a dynamic input node.

  • 0: dataset input.
  • 1: placeholder input

Only one type of dynamic inputs is allowed, dataset or placeholder.

No

Session/Graph

ge.exec.precision_mode

A string for the operator precision mode. This option cannot be used together with ge.exec.precision_mode_v2. You are advised to use ge.exec.precision_mode_v2.

  • force_fp32/cube_fp16in_fp32out:
    force_fp32 has the same effect as that of cube_fp16in_fp32out. The system selects a processing mode based on cube or vector operators. cube_fp16in_fp32out is newly added to the new version. For cube operators, this option has clearer semantics.
    • For cube operators, the system processes the computation based on the operator implementation.
      1. The preferred input data type is float16 and the output data type is float32.
      2. If the float16 input data and float32 output data types are not supported, set both the input and output data types to float32.
      3. If the float32 input and output data types are not supported, set both the input and output data types to float16.
      4. If the float16 input and output data types are not supported, an error is reported.
    • For vector operators, float32 is forcibly selected for operators supporting both float16 and float32, even if the original precision is float16.

      This argument is invalid if your model contains operators not supporting float32, for example, an operator that supports only float16. In this case, float16 is retained. If the operator does not support float32 and it is configured to the blocklist of precision reduction (by setting precision_reduce to false), the counterpart AI CPU operator supporting float32 is used. If the AI CPU operator is not supported, an error is reported.

  • force_fp16:

    Forces float16 for operators supporting both float16 and float32.

  • allow_fp32_to_fp16:
    • For cube operators, float16 is used.
    • For vector operators, preserve the original precision for operators supporting float32; else, forces float16.
  • must_keep_origin_dtype:

    Retain the original precision.

    • If the precision of an operator in the original graph is float16, and the implementation of the operator in the AI Core does not support float16 but supports only float32 and bfloat16, the system automatically uses high-precision float32.
    • If the precision of an operator in the original graph is float16, and the implementation of the operator in the AI Core does not support float16 but supports only bfloat16, the AI CPU operator of float16 is used. If the AI CPU operator is not supported, an error is reported.
    • If the precision of an operator in the original graph is float32, and the implementation of the operator in the AI Core does not support float32 but supports only float16, the AI CPU operator of float32 is used. If the AI CPU operator is not supported, an error is reported.
  • allow_mix_precision/allow_mix_precision_fp16:

    allow_mix_precision has the same effect as that of allow_mix_precision_fp16, indicating that mixed precision of float16 and float32 is used for neural network processing. allow_mix_precision_fp16 is newly added to the new version, which has clearer semantics for easy understanding.

    In this mode, float16 is automatically used for certain float32 operators based on the built-in tuning policies. This will improve system performance and reduce memory footprint with minimal accuracy degradation.

    If this mode is used, you can view the value of the precision_reduce option in the built-in tuning policy file ${INSTALL_DIR}/opp/built-in/op_impl/ai_core/tbe/config/<soc_version>/aic-<soc_version>-ops-info.json in the OPP installation directory.

    • If it is set to true, the operator is on the mixed precision trustlist and its precision will be reduced from float32 to float16.
    • If it is set to false, the operator is on the mixed precision blocklist and its precision will not be reduced from float32 to float16.
    • If an operator does not have the precision_reduce option configured, the operator is on the graylist and will follow the same precision processing as the upstream operator.

Default:

In the Atlas Training Series Product training scenario, the default value of this option is allow_fp32_to_fp16.

In the online inference scenario, the default value is force_fp16.

No

All

ge.exec.precision_mode_v2

A string for the operator precision mode. This option cannot be used together with ge.exec.precision_mode. You are advised to use ge.exec.precision_mode_v2.

  • fp16:

    Forces float16 for operators supporting both float16 and float32.

  • origin:

    Retain the original precision.

    • If the precision of an operator in the original graph is float16, and the implementation of the operator in the AI Core does not support float16 but supports only float32 and bfloat16, the system automatically uses high-precision float32.
    • If the precision of an operator in the original graph is float16, and the implementation of the operator in the AI Core does not support float16 but supports only bfloat16, the AI CPU operator of float16 is used. If the AI CPU operator is not supported, an error is reported.
    • If the precision of an operator in the original graph is float32, and the implementation of the operator in the AI Core does not support float32 but supports only float16, the AI CPU operator of float32 is used. If the AI CPU operator is not supported, an error is reported.
  • cube_fp16in_fp32out:
    The system selects a processing mode based on the operator type for operators supporting both float16 and float32.
    • For cube operators, the system processes the computation based on the operator implementation.
      1. The preferred input data type is float16 and the output data type is float32.
      2. If the float16 input data and float32 output data types are not supported, set both the input and output data types to float32.
      3. If the float32 input and output data types are not supported, set both the input and output data types to float16.
      4. If the float16 input and output data types are not supported, an error is reported.
    • For vector operators, float32 is forcibly selected for operators supporting both float16 and float32, even if the original precision is float16.

      This argument is invalid if your model contains operators not supporting float32, for example, an operator that supports only float16. In this case, float16 is retained. If the operator does not support float32 and it is configured to the blocklist of precision reduction (by setting precision_reduce to false), the counterpart AI CPU operator supporting float32 is used. If the AI CPU operator is not supported, an error is reported.

  • mixed_float16:

    Mixed precision of float16 and float32 is used for neural network processing. Computations are done in float16 for float32 operators according to the built-in tuning policies. This will improve system performance and reduce memory footprint with minimal accuracy degradation.

    If this mode is used, you can view the value of the precision_reduce option in the built-in tuning policy file ${INSTALL_DIR}/opp/built-in/op_impl/ai_core/tbe/config/<soc_version>/aic-<soc_version>-ops-info.json in the OPP installation directory.

    • If it is set to true, the operator is on the mixed precision trustlist and its precision will be reduced from float32 to float16.
    • If it is set to false, the operator is on the mixed precision blocklist and its precision will not be reduced from float32 to float16.
    • If an operator does not have the precision_reduce option configured, the operator is on the graylist and will follow the same precision processing as the upstream operator.
  • mixed_hif8: enables automatic mixed precision, indicating that hifloat8 (for details about this data type, see Link), float16, and float32 are used together to process the neural network. In this mode, hifloat8 is automatically used for certain float16 and float32 operators based on the built-in tuning policies. This will improve system performance and reduce memory footprint with minimal precision degradation. The current version does not support this option.

    If this mode is used, you can view the value of precision_reduce in the built-in tuning policy file ${INSTALL_DIR}/opp/built-in/op_impl/ai_core/tbe/config/<soc_version>/aic-<soc_version>-ops-info.json.

    • true: The operator is on the mixed precision trustlist and its precision will be reduced from float16/float32 to hifloat8.
    • false: The operator is on the mixed precision blocklist and its precision will not be reduced from float16/float32 to hifloat8.
    • If an operator does not have the precision_reduce option configured, the operator is on the graylist and will follow the same precision processing as the upstream operator.
  • cube_hif8: The hifloat8 data type is forcibly used if the Cube operator in the network model supports both hifloat8 and float16/float32. The current version does not support this option.

Default:

In training scenarios, this option has no default value for the Atlas Training Series Product . allow_fp32_to_fp16, the default value of precision_mode, is used.

In online inference scenarios, the default value is "fp16".

No

All

ge.exec.modify_mixlist

When mixed precision is enabled, you can use this parameter to specify the path and file name of the blocklist, trustlist, and graylist, and specify the operators that allow precision reduction and those that do not allow precision reduction. Set this parameter to the path and file name. The file is in JSON format.

For the blocklist, trustlist, and graylist, you can view the value of flag in the precision_reduce option in the built-in tuning policy file ${INSTALL_DIR}/opp/built-in/op_impl/ai_core/tbe/config/<soc_version>/aic-<soc_version>-ops-info.json.

  • true (trustlist): Precision reduction is allowed in mixed precision mode.
  • false (blocklist): Precision reduction is not allowed in mixed precision mode.
  • Not specified (graylist): Operators on the graylist follow the same precision processing as its upstream operator.
Example
{"ge.exec.modify_mixlist", "/home/test/ops_info.json"};

You can specify the operator type (or types separated by commas) in ops_info.json as follows.

{
  "black-list": {                  // Blocklist
     "to-remove": [                // Move an operator from the blocklist to the graylist.
     "Xlog1py"
     ],
     "to-add": [                   // Move an operator from the trustlist or graylist to the blocklist.
     "Matmul",
     "Cast"
     ]
  },
  "white-list": {                  // Trustlist
     "to-remove": [                // Move an operator from the trustlist to the graylist.
     "Conv2D"
     ],
     "to-add": [                   // Move an operator from the blocklist or graylist to the trustlist.
     "Bias"
     ]
  }
}

The operators in the preceding example configuration file are for reference only. The configuration should be based on the actual hardware environment and the built-in tuning policies of the operators. To query the blocklist, trustlist, and graylist:

"Conv2D":{
    "precision_reduce":{
        "flag":"true"
},

true: trustlist; false: blocklist; Not configured: graylist.

No

All

ge.exec.profilingMode

Profiling enable.

  • 1: enabled. The option to be traced is determined by ge.exec.profilingOptions.
  • 0 (default): disabled.

No

Global

ge.exec.profilingOptions

Profiling options.

  • output: path for storing profiling result files. Both absolute path and relative path (relative to the path where the command is run) are supported. The path cannot contain the following special characters: "\n", "\f", "\r", "\b", "\t", "\v", and "\u007F".
    • An absolute path starts with a slash (/), for example, /home/HwHiAiUser/output.
    • A relative path starts with a directory name, for example, output.
    • This parameter takes precedence over ASCEND_WORK_PATH.
    • This path does not need to be created in advance because it is automatically created during collection.
  • storage_limit: maximum size of files that can be stored in a specified disk directory. If the size of profile data files in the disk is about to use up the maximum storage space specified by this option or the total remaining disk space is about to be used up (remaining space ≤ 20 MB), the earliest files in the disk are aged and deleted.

    The value range is [200, 4294967295], and the unit is MB. The unit must be included when you set this parameter, for example, 200 MB.

    If this parameter is not set, the default value is 90% of the available space of the disk where the directory for storing profile data files is located.

  • training_trace: iteration tracing switch. Collects software profile data of a training job and the AI Software Stack to profile the training job. Focuses on forward and backward propagation, and gradient aggregation and update. This option must be set to on when the forward and backward propagation operator data is collected.
  • task_trace and task_time: Switches that control collection of the operator delivery and execution durations. Related duration data must be output to the task_time, op_summary, and op_statistic files. Possible configuration values are as follows:
    • on: switch on. This is the default value, delivering the same effect as l1.
    • off: switch off.
    • l0: collects operator delivery and execution duration data. Compared with l1, l0 does not collect basic operator information, so the performance overhead during collection is smaller, and this enables more accurate collection of statistics on time duration data.
    • l1: collects operator delivery and execution duration data, as well as basic operator information, to provide more comprehensive performance analysis data.

    When Profiling is enabled to collect training data, task_trace and training_trace must be set to on.

  • hccl (optional): HCCL tracing switch, either on or off (default).
    NOTE:

    This switch will be discarded in later versions. To control data collection, use task_trace and task_time.

  • aicpu: whether to collect details about the AI CPU operator, such as the operator execution time and data copy time. The value can be on or off (default). A value other than on or off is equivalent to off.
  • fp_point: start point of the forward propagated operator in iteration traces, to record the start timestamp of forward propagation. Set the value to the name of the top operator in forward propagation. You can save the graph as a .pbtxt file by using tf.io.write_graph in the training script to obtain the name. Alternatively, you can leave this option empty (for example, "fp_point":""), and the system will automatically identify the start point of the forward propagated operator.
  • bp_point: end point of the backward propagated operator in iteration traces, to record the end timestamp of backward propagation. BP_POINT and FP_POINT are used to compute the time used by forward and backward propagation. Set the value to the name of the bottom operator in backward propagation. You can save the graph as a .pbtxt file by using tf.io.write_graph in the training script to obtain the name. Alternatively, you can leave this option empty (for example, "bp_point":""), and the system will automatically identify the end point of the backward propagated operator.
  • aic_metrics: AI Core metrics to profile.
    • ArithmeticUtilization: arithmetic utilization ratio.
    • PipeUtilization (default): ratio of time taken by the compute units to that of MTEs.
    • Memory: ratio of external memory read/write instructions.
    • MemoryL0: ratio of internal memory L0 read/write instructions.
    • MemoryUB: ratio of internal memory UB read/write instructions.
    • ResourceConflictRatio: ratio of pipeline queue instructions.

    Atlas Training Series Product : AI Core collection is supported, but AI Vector Core and L2 cache parameters are not supported.

    NOTE:
    The registers whose data is to be collected can be customized, for example, "aic_metrics":"Custom:0x49,0x8,0x15,0x1b,0x64,0x10".
    • The Custom field indicates the custom type and is set to a specific register value. The value range is [0x1, 0x6E].
    • A maximum of eight registers can be configured, which are separated with commas (,).
    • The register value can be in hexadecimal or decimal format.
  • l2: L2 cache profiling switch, either on or off (default).
  • msproftx: switch that controls the msproftx user and upper-layer framework program to output profile data, either on or off (default).
  • runtime_api: Runtime API data collection switch, either on or off (default). You can collect Runtime API profile data, including the synchronous/asynchronous memory replication latencies between the host and device and between devices.
  • sys_hardware_mem_freq: indicates the frequency of collecting On-chip memory, QoS bandwidth and memory information, LLC read/write bandwidth data, Acc PMU data and SoC transmission bandwidth data, and component memory information. Must be within the range [1,100]. The unit is Hz.

    The support for different products varies.

    NOTE:

    Sampling memory data in the environment where glibc (2.34 or an earlier version) is installed may trigger a known Bug 19329. This problem can be solved by upgrading the glibc version.

  • llc_profiling: LLC events to profile. Possible values are as follows:
    • Atlas Training Series Product : read (read event, L3 cache read rate) or write (write event, L3 cache write rate). Defaults to read.
  • sys_io_sampling_freq: NIC and RoCE collection frequency. The value range is [1,100]. The unit is Hz.
    • Atlas Training Series Product : supports NIC and RoCE collection.
  • sys_interconnection_freq: HCCS bandwidth and PCIe data collection frequency and inter-chip transmission bandwidth data collection frequency. The value range is [1, 50]. The unit is Hz.
    • Atlas Training Series Product : supports HCCS and PCIe data collection.
  • dvpp_freq: DVPP collection frequency. The value range is [1,100]. The unit is Hz.
  • instr_profiling_freq: AI Core and AI Vector bandwidth and latency collection frequency. The value range is [300, 30000]. The unit is cycle.
    • Atlas Training Series Product : Not supported.
  • host_sys: switch for collecting host profile data. You can select one or more options and separate them with commas (,), for example, "host_sys": "cpu,mem".
    • cpu: process CPU utilization
    • mem: process memory utilization
  • host_sys_usage: CPU and memory data of the system and all processes on the host, selected from cpu and mem. You can select one or more options and separate them with commas (,).
  • host_sys_usage_freq: collection frequency of CPU and memory data of the system and all processes on the host. The value range is [1, 50] and the default value is 50. The unit is Hz.

Example:

std::map<ge::AscendString, ge::AscendString> ge_options = {{"ge.exec.deviceId", "0"},
                                  {"ge.graphRunMode", "1"},
                                  {"ge.exec.profilingMode", "1"},
                                  {"ge.exec.profilingOptions", R"({"output":"/tmp/profiling","training_trace":"on","fp_point":"resnet_model/conv2d/Conv2Dresnet_model/batch_normalization/FusedBatchNormV3_Reduce","bp_point":"gradients/AddN_70"})"}};

No

Global

ge.exec.enableDump

Dump enable.

  • 1: enabled. The dump file path is read from dump_path. If dump_path is set to None, an exception occurs.
  • 0 (default): disabled.
NOTE:
  • This option cannot be set together with ge.exec.enableDumpDebug in the global scenario or in the same session.
  • If either ge.exec.enableDump or ge.exec.enableDumpDebug is set to 1 and ge.exec.enable_exception_dump is set to 1 (indicating that common ExceptionDump function is enabled):
    • For dynamic-shape networks, only ge.exec.enable_exception_dump takes effect.
    • For static-shape networks, ge.exec.enable_exception_dump and either of ge.exec.enableDump and ge.exec.enableDumpDebug take effect.

No

Global/Session

ge.exec.dumpPath

Dump path. Required when dump and overflow/underflow detection are enabled.

Create the specified path in advance in the environment (either container or host) where training is performed. The running user configured during installation must have the read and write permissions on this path. The path can be an absolute path or a path relative to the path where the training script is executed.

  • An absolute path starts with a slash (/), for example, /home/HwHiAiUser/output.
  • A relative path starts with a directory name, for example, output.

The dump data file is generated in the path specified by dump_path, that is, the {dump_path}/{time}/{deviceid}/{model_name}/{model_id}/{data_index} directory. For example, if dump_path is set to /home/HwHiAiUser/output, the dump data file is stored in the /home/HwHiAiUser/output/20200808163566/0/ge_default_20200808163719_121/11/0 path.

No

Global/Session

ge.exec.dumpStep

Iterations to dump. Defaults to None, indicating that all iterations are dumped.

Separate multiple iterations using vertical bars (|), for example, 0|5|10. You can also use hyphens (-) to specify the iteration range, for example, 0|3-5|10.

No

Global/Session

ge.exec.dumpMode

Dump mode. The values are as follows:

  • input: dumps only operator inputs.
  • output (default): dumps only operator outputs.
  • all: dumps both operator inputs and outputs.

Configuration example:

{"ge.exec.dumpMode", "input"};

Restrictions:

If this parameter is set to all, the input data of some operators, such as collective communication operators HcomAllGather and HcomAllReduce, will be modified during execution. Therefore, the system dumps the operator input before operator execution and dumps the operator output after operator execution. In this way, the dumped input and output data of the same operator is flushed to drives separately, and multiple dump files are generated. After parsing the dump files, you can determine whether the data is an input or output based on the file content.

No

Global/Session

ge.exec.dumpData

Type of operator content to dump.

  • tensor (default): dumps operator data.
  • stats: dumps operator statistics and saves the result in CSV format. As the operator data amount is large in most cases, you can try to dump the operator statistics.

No

Global/Session

ge.exec.dumpLayer

Operator to be dumped, an operator name. Multiple operator names are separated by spaces.

If the input of the specified operator involves the data operator, the data operator information is also dumped.

Configuration example:
{"ge.exec.dumpLayer", "layer1 layer2 layer3"};

No

Global/Session

ge.exec.enableDumpDebug

Overflow/Underflow detection enable.

  • 1: enabled. The dump file path is read from ge.exec.dumpPath. If ge.exec.dumpPath is set to None, an exception occurs.
  • 0 (default): disabled.
NOTE:
  • This option cannot be set together with ge.exec.enableDump in the global scenario or in the same session.
  • If either ge.exec.enableDump or ge.exec.enableDumpDebug is set to 1 and ge.exec.enable_exception_dump is set to 1 (indicating that common ExceptionDump function is enabled):
    • For dynamic-shape networks, only ge.exec.enable_exception_dump takes effect.
    • For static-shape networks, ge.exec.enable_exception_dump and either of ge.exec.enableDump and ge.exec.enableDumpDebug take effect.

No

Global/Session

ge.exec.dumpDebugMode

Overflow/Underflow detection mode.

  • aicore_overflow: detects AI Core operator overflow/underflow, that is, detecting whether abnormal extreme values (such as 65500, 38400, and 51200 in float16) are output with normal inputs. Once such fault is detected, analyze the cause of the overflow/underflow and modify the operator implementation based on the network requirements and operator logic.
  • atomic_overflow: detects Atomic Add overflow/underflow, for checking modules involved in floating-point computing (such as SDMA) in addition to AI Core.
  • all: detects overflow/underflow of both AI Core operators and Atomic Add.

No

Global/Session

ge.exec.enable_exception_dump

Whether to dump data of the exception operator.
  • 0: disabled. Defaults to 0.
  • 1: The common ExceptionDump function is enabled, to dump the input and output data, tensor description (such as shape, dtype, and format), and workspace information of the exception operator.

    In this mode, dump data is stored in the current script execution path by default.

  • 2: The LiteExecptionDump function (L0 exception dump) is enabled, to dump the input and output data, workspace information, and tiling information of the exception operator.

    In this mode, dump data is stored in /extra-info/data-dump/<device_id> of the current script execution path by default. If the environment variable ASCEND_WORK_PATH is configured, dump data is stored in ASCEND_WORK_PATH/extra-info/data-dump/<device_id>.

NOTE:

If the environment variable NPU_COLLECT_PATH is configured, only L1 exception dump information, including the input and output data of the exception operator, is collected regardless of the value of option enable_exception_dump, and the dump data is stored in the path specified by NPU_COLLECT_PATH.

For details about the environment variable, see Environment Variables.

Configuration example:
std::map<ge::AscendString, ge::AscendString> ge_options = {{"ge.exec.enable_exception_dump", "0"}, 

Optional

Global

ge.exec.disableReuseMemory

Memory reuse enable.

  • 1: disabled
  • 0 (default): enabled

No

All

ge.graphMemoryMaxSize

Do not use this option because it will be deprecated in later versions.

Network static memory size and maximum dynamic memory size. Varies according to the network size. The unit is byte and the value range is [0, 256 x 1024 x 1024 x 1024] or [0, 274877906944]. The SoC hardware requires that the sum of graph_memory_max_size and variable_memory_max_size be within 31 GB. Defaults to 26 (GB).

No

All

ge.variableMemoryMaxSize

Do not use this option because it will be deprecated in later versions.

Variable memory size. Varies according to the network size. The unit is byte and the value range is [0, 256 x 1024 x 1024 x 1024] or [0, 274877906944]. The SoC hardware requires that the sum of graph_memory_max_size and variable_memory_max_size be within 31 GB. Defaults to 5 (GB).

No

All

ge.exec.variable_acc

Variable format optimization enable.

  • True (default): enabled
  • False: disabled

To improve training efficiency, the format of the variables is converted to a format more compatible with the Ascend AI Processor during variable initialization performed by the network. However, this function should be disabled in special scenarios.

No

All

ge.exec.rankTableFile

Information about the cluster participating in collective communication, including the organization information about the server, device, and container. Set this option to the ranktable file path, including the file name.

No

All

ge.exec.rankId

Rank ID, the ID of a process in a group. The value ranges from 0 to (rank size – 1). For a custom group, the rank starts from 0 in the group. For an HCCL world group, the rank ID is the same as the world rank ID.

  • World rank ID: indicates the rank ID of a process in an HCCL world group. The value ranges from 0 to (rank size – 1).
  • Local rank ID: indicates the rank ID of a process in a group on the server where the process is located. The value ranges from 0 to (local rank size – 1).

No

All

ge.opDebugLevel

Operator debug enable.

  • 0 (default): Disables operator debug. The operator build folder kernel_meta is not generated in the current execution path.
  • 1: Enables operator debug. The kernel_meta folder is generated in the current execution path, and the .o file (operator binary file), .json file (operator description file), and TBE instruction mapping files (operator file *.cce and python-CCE mapping file *_loc.json) are generated in the folder for later analysis of AI Core errors.
  • 2: Enables operator debug. The kernel_meta folder is generated in the current execution path, and the .o file (operator binary file), .json file (operator description file), and TBE instruction mapping files (operator file *.cce and python-CCE mapping file *_loc.json) are generated in the folder for later analysis of AI Core errors. Setting this option to 2 also disables build optimization and enables the CCE compiler debug function (the CCE compiler option is set to -O0-g).
  • 3: Disables operator debug. The kernel_meta folder is generated in the current execution path, and the .o file (operator binary file) and .json file (operator description file) are generated in the folder. You can refer to these files when analyzing operator errors.
  • 4: Disables operator debug. The kernel_meta folder is generated in the current execution path, and the .o file (operator binary file), .json file (operator description file), TBE instruction mapping file (operator file *.cce), and UB fusion description file ({$kernel_name}_compute.json) are generated in the folder. These files can be used for problem reproduction and precision comparison during operator error analysis.
NOTICE:
  • If ge.opDebugLevel is set to 0 and op_debug_config is also set, the operator build directory kernel_meta is still generated in the current execution directory.
  • You are advised to set this option to 0 or 3 for training. To locate errors, set this option to 1 or 2, which might compromise the network performance.
  • If --op_debug_level is set to 2 (that is, CCEC compilation is enabled), the size of the operator kernel file (*.o file) increases. In the dynamic shape scenario, all possible shape scenarios are traversed during operator build, which may cause operator build failures due to large operator kernel files. In this case, do not enable the CCE compiler options.

    If a build failure is caused by the large operator kernel file, the following log is displayed:

    message:link error ld.lld: error: InputSection too large for range extension thunk ./kernel_meta_xxxxx.o
  • When the debug function is enabled, if the model contains the following MC2 operators, the *.o, *.json, and *.cce files of the operators are not generated in the kernel_meta directory.

    MatMulAllReduce

    MatMulAllReduceAddRmsNorm

    AllGatherMatMul

    MatMulReduceScatter

    AlltoAllAllGatherBatchMatMul

    BatchMatMulReduceScatterAlltoAll

No

All

op_debug_config

Enable for global memory check.

The value is the path of the .cfg configuration file. Multiple options in the configuration file are separated by commas (,).

  • oom: Checks whether memory overwriting occurs in the global memory during operator execution.
    • Configuring this option retains the binary operator file (.o) and operator description file (.json) in the kernel_meta folder under the current execution directory during operator build.
    • If this option is used, the following detection logic is added during operator build. You can use the dump_cce option to view the following code in the generated .cce file:
      inline __aicore__ void  CheckInvalidAccessOfDDR(xxx) {
          if (access_offset < 0 || access_offset + access_extent > ddr_size) {
              if (read_or_write == 1) {
                  trap(0X5A5A0001);
              } else {
                  trap(0X5A5A0002);
              }
          }
      }

      During actual execution, if memory overwriting occurs, the error code EZ9999 is reported.

  • dump_bin: Retains the binary operator file (.o) and operator description file (.json) in the kernel_meta folder under the current execution directory during operator build.
  • dump_cce: Retains the operator CCE file (.cce), binary operator file (.o), and operator description file (.json) in the kernel_meta folder under the current execution directory during operator build.
  • dump_loc: Retains the Python-CCE mapping file (*_loc.json) in the kernel_meta folder under the current execution directory during operator build.
  • ccec_O0: Enables the CCEC option -O0 during operator build. This option does not optimize the debugging information for later analysis of AI Core errors.
  • ccec_g: Enables the CCEC option -g during operator build. This option optimizes the debugging information for later analysis of AI Core errors.
  • check_flag: Checks whether pipeline synchronization signals in operators match each other during operator execution.
    • Configuring this option retains the binary operator file (.o) and operator description file (.json) in the kernel_meta folder under the current execution directory during operator build.
    • If this option is used, the following detection logic is added during operator build. You can use the dump_cce option to view the following code in the generated .cce file:
        set_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID0);
        set_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID1);
        set_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID2);
        set_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID3);
        ....
        pipe_barrier(PIPE_MTE3);
        pipe_barrier(PIPE_MTE2);
        pipe_barrier(PIPE_M);
        pipe_barrier(PIPE_V);
        pipe_barrier(PIPE_MTE1);
        pipe_barrier(PIPE_ALL);
        wait_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID0);
        wait_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID1);
        wait_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID2);
        wait_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID3);
        ...

      During actual inference, if the pipeline synchronization signals in operators do not match each other, a timeout error is reported at the faulty operator, and the program is terminated. The following is an example of the error message:

      Aicore kernel execute failed, ..., fault kernel_name=operator name,...
      rtStreamSynchronizeWithTimeout execute failed....

Configuration example:

{"op_debug_config", "/root/test0.cfg"};

The information about the test0.cfg file is as follows:

op_debug_config = ccec_g,oom

Restrictions:

During operator compilation, if you want to compile only some instead of all AI Core operators, you need to add the op_debug_list field to the test0.cfg configuration file. By doing so, only the operators specified in the list are compiled, based on the options configured in op_debug_config. The op_debug_list field has the following requirements:

  • The operator name or operator type can be specified.
  • Operators are separated by commas (,). The operator type is configured in OpType::typeName format. The operator type and operator name can be configured in a mixed manner.
  • The operator to be compiled must be stored in the configuration file specified by op_debug_config.

Configuration example: Add the following information to the configuration file (for example, test0.cfg) specified by op_debug_config:

op_debug_config= ccec_g,oom
op_debug_list=GatherV2,opType::ReduceSum

During model compilation, the GatherV2,ReduceSum operator is compiled based on the ccec_g and oom options.

NOTE:
  • When ccec_O0 and ccec_g are enabled, the size of the operator kernel file (*.o file) increases. In dynamic shape scenarios, all possible scenarios are traversed during operator compilation, which may cause operator compilation failures due to large operator kernel files. In this case, do not enable the CCE compiler options.

    If the build failure is caused by the large operator kernel file, the following log is displayed:

    message:link error ld.lld: error: InputSection too large for range extension thunk ./kernel_meta_xxxxx.o:(xxxx)

  • The ccec_O0 and oom options of the CCEC compiler cannot be both enabled. Otherwise, an AI Core error may be reported. The following is an example of the error message:
    ...there is an aivec error exception, core id is 49, error code = 0x4 ...
  • If the NPU_COLLECT_PATH environment variable is configured, the function of checking whether global memory overwriting occurs cannot be enabled (op_debug_config is set to oom). Otherwise, an error is reported when the compiled model file or operator kernel package is used.
  • When the build options oom, dump_bin, dump_cce, and dump_loc are configured, if the model contains the following MC2 operators, the *.o, *.json, and *.cce files of the operators are not generated in the kernel_meta directory.

    MatMulAllReduce

    MatMulAllReduceAddRmsNorm

    AllGatherMatMul

    MatMulReduceScatter

    AlltoAllAllGatherBatchMatMul

    BatchMatMulReduceScatterAlltoAll

No

Global

ge.op_compiler_cache_mode

Disk cache mode for operator build.

Arguments:

  • enable: enabled. If it is enabled, operators with the same build configurations and operator configurations will not be built repeatedly, thus accelerating the build speed.
  • force: Enabled with cache forcibly refreshed. That is, the existing cache is cleared up before the operator is recompiled and added to the cache. For example, for Python changes, dependency library changes, or repository changes after operator optimization, you need to set this option to force to clear up the existing cache and then change it to enable to prevent the cache from being forcibly refreshed during each build.
  • disable (default): disabled.

Default: enable

Restrictions:

  • To specify the disk cache path for operator build, use this option together with ge.op_compiler_cache_dir.
  • If it is set to force, the existing cache will be cleared. Therefore, it is not recommended for parallel program compilation, as this may result in the cache used by other models cleared, causing build failures.
  • disable or force is recommended for publishing the final model.
  • If the repository changes after operator tuning, set this option to force to refresh the cache before setting it to enable to recompile. Otherwise, the new tuning repository cannot be applied, and the tuning application fails to be executed.
  • When the debugging function is enabled:
    • If ge.opDebugLevel is set to a non-zero value, the ge.op_compiler_cache_mode parameter configuration does not take effect, the operator build cache function is disabled, and all operators are recompiled.
    • If op_debug_config is not empty and op_debug_list is not configured, the ge.op_compiler_cache_mode parameter configuration does not take effect, the operator build cache function is disabled, and all operators are recompiled.
    • If op_debug_config is not empty and op_debug_list is configured in the configuration file:
      • For operators in the list, ignore the configuration of ge.op_compiler_cache_mode and continue to recompile them.
      • For operators out of the list, if ge.op_compiler_cache_mode is set to enable or force, the cache function is enabled. If ge.op_compiler_cache_mode is set to disable, the cache function is disabled and the operators are recompiled.
  • When you enable the operator build cache function, you can set the disk space of the cache folder with the configuration file (the op_cache.ini file automatically generated in the path specified by OP_COMPILER_CACHE_DIR after operator build) or environment variables.
    1. Using the op_cache.ini configuration file:

      If the op_cache.ini file does not exist, manually create it. Open the file and add the following information:

      # Configure the file format (required). The automatically generated file contains the following information by default. When manually creating a file, enter the following information:
      [op_compiler_cache]
      # Limit the drive space of the cache folder on a chip. The value must be an integer, in MB.
      max_op_cache_size=500
      # Set the ratio of the cache size to be reserved. The value range is [1,100], in percentage. For example, 80 indicates that when the cache space is insufficient, 80% of the cache space is reserved and the rest is cleared up.
      remain_cache_size_ratio=80    
      • The op_cache.ini file takes effect only when the values of max_op_cache_size and remain_cache_size_ratio in the preceding file are valid.
      • If the size of the build cache file exceeds the value of max_op_cache_size and the cache file is not accessed for more than half an hour, the cache file will be aged. (Operator build will not be interrupted due to the size of the build cache file exceeding the set limit. Therefore, if max_op_cache_size is set to a small value, the size of the actual build cache file may exceed the configured value.)
      • To disable the build cache aging function, set max_op_cache_size to -1. In this case, the access time is not updated when the operator cache is accessed, the operator build cache is not aged, and the default drive space is 500 MB.
      • If multiple users use the same cache path, you are advised to use the configuration file to set the cache path. In this scenario, the op_cache.ini file affects all users.
    2. Using environment variables

      In this scenario, the environment variable ASCEND_MAX_OP_CACHE_SIZE is used to limit the storage space of the cache folder of a chip. When the build cache space reaches the specified value and the cache file is not accessed for more than half an hour, the cache file is aged. The environment variable ASCEND_REMAIN_CACHE_SIZE_RATIO is used to set the ratio of the cache space to be reserved.

      A configuration example is provided as follows:

      # The ASCEND_MAX_OP_CACHE_SIZE environment variable defaults to 500, in MB. The value must be an integer.
      export ASCEND_MAX_OP_CACHE_SIZE=500
      # ASCEND_REMAIN_CACHE_SIZE_RATIO environment variable value range is [1,100]. The default value is 50, in percentage. For example, 80 indicates that 80% of the cache space is reserved when the cache space is insufficient.
      export ASCEND_REMAIN_CACHE_SIZE_RATIO=50
      • The argument configured through environment variables takes effect only for the current user.
      • To disable the build cache aging function, set the environment variable ASCEND_MAX_OP_CACHE_SIZE to -1. In this case, the access time is not updated when the operator cache is accessed, the operator build cache is not aged, and the default drive space is 500 MB.

    Caution: If both the op_cache.ini file and environment variable are configured, the configuration items in the op_cache.ini file are read first. If neither the op_cache.ini file nor the environment variable are configured, the system default values are read: 500 MB disk space and 50% reserved cache space.

No

All

ge.op_compiler_cache_dir

Disk cache directory for operator build.

Format: The directory can contain letters, digits, underscores (_), hyphens (-), and periods (.).

Defaults to $HOME/atc_data.

  • If the specified directory exists and is valid, a kernel_cache subdirectory is automatically created. If the specified directory does not exist but is valid, the system automatically creates this directory and the kernel_cache subdirectory.
  • Do not store other self-owned content in the default cache directory. The self-owned content will be deleted together with the default cache directory during software package installation or upgrade.
  • The non-default cache directory specified by this option cannot be deleted. The directory will not be deleted during software package installation or upgrade.
  • In addition to ge.op_compiler_cache_dir, the environment variable ASCEND_CACHE_PATH can be used to set the disk cache directory for operator build. The priorities of the configuration methods are as follows: ge.op_compiler_cache_dir > ASCEND_CACHE_PATH > default storage path.

No

All

ge.debugDir

Directory of the debug-related process files generated during operator build, including the .o (operator binary file), .json (operator description file), and .cce files.

Defaults to the training script execution directory.

Restrictions:

  • If you want to specify the path for storing the process file of operator build, use ge.debugDir and ge.opDebugLevel together. If ge.opDebugLevel is set to 0, ge.debugDir cannot be used.
  • In addition to ge.debugDir, the environment variable ASCEND_WORK_PATH can be used to set the path for storing the debugging file generated by operator build. The priorities of the configuration methods are as follows: ge.debugDir > ASCEND_WORK_PATH > default storage path.

No

All

ge.bufferOptimize

Buffer optimization enable.

Arguments:

  • l1_optimize: Enables L1 optimization. Invalid in the current version. Equivalent to off_optimize.
  • l2_optimize: Enables L2 optimization. The default value is l2_optimize.
  • off_optimize: Disables buffer optimization.

Suggestions:

You are advised to enable buffer optimization as this function can improve compute efficiency and performance. However, it is possible that your model contains an operator that is not yet covered by the current implementation. If the inference accuracy degradation is eliminated after the buffer optimization function is disabled, locate the fishy operator and submit it to Huawei technical support, who will add buffer optimization support to your operator as soon as possible.

Configuration example:

{"ge.bufferOptimize", "l2_optimize"};

Optional

Session/Graph

ge.mdl_bank_path

Sets the directory of the custom repository generated after subgraph tuning.

This option must be used in pair with ge.bufferOptimize and takes effect only when buffer optimization is enabled, to improve performance by temporarily storing data in the buffer.

Argument: path of the custom repository after model tuning.

Format: The value can contain letters, digits, underscores (_), hyphens (-), and periods (.).

Default: $HOME/Ascend/latest/data/aoe/custom/graph/<soc_version>

Restrictions:

Priority ranked from high to low: the directory specified by ge.mdl_bank_path > the directory specified by TUNE_BANK_PATH > the default directory.

  1. The custom repository directory specified by ge.mdl_bank_path takes effect and the directory specified by TUNE_BANK_PATH does not when TUNE_BANK_PATH is used to specify the directory before model compilation, and then ge.mdl_bank_path is used to specify the directory during model compilation.
  2. The default directory takes effect if both the directories specified by ge.mdl_bank_path and TUNE_BANK_PATH are invalid or contain no custom repository.
  3. If none of the preceding directories contains the custom repository, the system searches the built-in directory of the custom repository generated after subgraph tuning in ${INSTALL_DIR}/compiler/data/fusion_strategy/built-in.

No

All

ge.op_bank_path

Path of the custom repository generated after operator tuning.

Format: The path can contain letters, digits, underscores (_), hyphens (-), and periods (.).

Default: ${HOME}/Ascend/latest/data/aoe/custom/op

Restrictions:

Priority ranked from high to low: the directory specified by TUNE_BANK_PATH > the directory specified by OP_BANK_PATH > the default directory of the custom repository generated after operator tuning.

  1. The custom repository directory specified by TUNE_BANK_PATH takes effect and the directory specified by OP_BANK_PATH does not when TUNE_BANK_PATH is used to specify the directory before model conversion, and then OP_BANK_PATH is used to specify the directory during model compilation.
  2. The default directory takes effect if both the directories specified by OP_BANK_PATH and that specified by TUNE_BANK_PATH are invalid.
  3. If none of the preceding directories contains the custom repository, the system searches the built-in directory of the custom repository generated after operator tuning.

No

All

ge.exec.dynamicGraphExecuteMode

This option is deprecated. Avoid using it.

Execution mode, applicable to the dynamic input scenario. The value is dynamic_execute.

No

Graph

ge.exec.dataInputsShapeRange

This option is deprecated. Avoid using it.

Shape range of dynamic input. If a graph has two data inputs, the configuration example is as follows.

std::map<ge::AscendString, ge::AscendString> ge_options = {{"ge.exec.deviceId", "0"},
      {"ge.graphRunMode", "1"},
      {"ge.exec.dynamicGraphExecuteMode", "dynamic_execute"},
      {"ge.exec.dataInputsShapeRange", "[128 ,3~5, 2~128, -1],[ 128 ,3~5, 2~128, -1]"}};
  • Set it in the format: "[n1,c1,h1,w1],[n2,c2,h2,w2]" (for example, "[8~20,3,5,-1],[5,3~9,10,-1]"). If node names are not configured, the first pair of brackets ([]) denotes the first input node. Separate the nodes with commas (,). If INPUT_SHAPE_RANGE is set based on the index, the index attribute must be set sequentially from 0 for data nodes.
  • The size of a static dimension is specified by a determinant value. The size range of a dynamic dimension is specified by using a tilde (~). A dynamic dimension without size range specified is denoted by -1.
  • For a scalar input, enclose its shape range in square brackets ([]).
  • Assume that your graph has three inputs and only the first one has a static shape; the static shape must be specified in the options field.

    {"ge.exec.dataInputsShapeRange", "[3,3,4,10], [-1,3,2~1000,-1],[-1,-1,-1,-1]"}};

NOTE:
  • If no node name is specified, nodes are stored in the index sequence by default. The following is an example:

    xxx_0, xxx_1, xxx_2,...

    The content following the underscore (_) is the sequence index of a node in the network script. Nodes are arranged in alphabetical order of the index. If the number of nodes is greater than 10, the sequence is xxx_0 > xxx_10 > xxx_2 > xxx_3. In the network script, the node with index 10 is placed before the node with index 2. As a result, the defined shape range does not match the input node.

    To avoid this problem, when the number of input nodes is greater than 10, you are advised to specify node names in the network script. Consequently, nodes are named with specified names to associate the shape range.

  • If this option and ge.dynamicDims are both configured as follows:
    std::map<ge::AscendString, ge::AscendString> ge_options = 
          {"ge.inputShape", "data:1,1,40,-1;label:1,-1;mask:-1,-1" },
          {"ge.dynamicDims", "20,20,1,1;40,40,2,2;80,60,4,4"},
            xxx
          {"ge.exec.dataInputsShapeRange", "[128 ,3~5, 2~128, -1],[ 128 ,3~5, 2~128, -1]"}};

    The priority of ge.dynamicDims (dynamic dimension size profiles) is higher than that of ge.exec.dataInputsShapeRange (dynamic shape range).

No

Graph

ge.exec.op_precision_mode

Precision mode of one or more specified operators during internal processing. This option is used to transfer the customized precision mode configuration file op_precision.ini to set different precision modes for different operators.

Set the precision mode based on the operator type (low priority) or node name (high priority) in each row in the .ini file.

The following precision modes can be set in the configuration file:

  • high_precision
  • high_performance
  • support_out_of_bound_index: indicates that the out-of-bounds verification is performed on the indices of the gather, scatter, and segment operators. The verification deteriorates the operator execution performance.
  • keep_fp16: The FP16 data type is used for internal processing of operators. In this scenario, the FP16 data type is not automatically converted to the FP32 data type. If the performance of FP32 computation does not meet the expectation and high precision is not required, you can select the keep_fp16 mode. This low-precision mode sacrifices the precision for improving the performance, which is not recommended.
  • super_performance: indicates ultra-high performance. Compared with high performance, the algorithm calculation formula is optimized.

You can view the precision or performance mode supported by an operator in the opp/built-in/op_impl/ai_core/tbe/impl_mode/all_ops_impl_mode.ini file in the file storage path with the CANN software installed.

Example:

[ByOpType]
optype1=high_precision
optype2=high_performance
optype4=support_out_of_bound_index

[ByNodeName]
nodename1=high_precision
nodename2=high_performance
nodename4=support_out_of_bound_index

No

Global

ge.opSelectImplmode

The function of this parameter does not evolve and will be deprecated in later versions. You are advised to use ge.exec.op_precision_mode.

Operator implementation mode select. Certain operators built in the Ascend AI Processor can be implemented in either high-precision or high-performance mode at model build time.

In high-precision mode, Taylor's theorem or Newton's method is used to improve operator accuracy with float16 input. In high-performance mode, the optimal performance is implemented without affecting the network precision (float16).

Arguments:

  • high_precision: High-precision implementation mode.

    This option sets the operator implementation mode by using the built-in configuration file, which is stored in ${INSTALL_DIR}/opp/op_impl/built-in/ai_core/tbe/impl_mode/high_precision.ini.

    To ensure compatibility, this argument takes effect only for the operator list in the high_precision.ini file. This list can be used to control the effective scope of operators and ensure that the network models of earlier versions are not affected.

  • high_performance (default): High-performance implementation mode.

    This option sets the operator implementation mode by using the built-in configuration file, which is stored in ${INSTALL_DIR}/opp/built-in/op_impl/ai_core/tbe/impl_mode/high_performance.ini.

    To ensure compatibility, this argument takes effect only for the operator list in the high_performance.ini file. This list can be used to control the effective scope of operators and ensure that the network models of earlier versions are not affected.

  • high_precision_for_all: High-precision mode.

    This option sets the operator implementation mode by using the built-in configuration file, which is stored in ${INSTALL_DIR}/opp/built-in/op_impl/ai_core/tbe/impl_mode/high_precision_for_all.ini. The list in this file may be updated with the version.

    This implementation mode may cause incompatibility. If an operator in the new software package sets the implementation mode (that is, an implementation mode is added for a certain operator in the configuration file), the performance of the earlier network model that uses the high_precision_for_all mode may deteriorate.

  • high_performance_for_all: High-performance mode.

    This option sets the operator implementation mode by using the built-in configuration file, which is stored in ${INSTALL_DIR}/opp/built-in/op_impl/ai_core/tbe/impl_mode/high_performance_for_all.ini. The list in this file may be updated with the version.

    This implementation mode may cause incompatibility. If an operator in the new software package sets the implementation mode (that is, an implementation mode is added for a certain operator in the configuration file), the precision of the earlier network model that uses the high_performance_for_all mode may deteriorate.

The preceding implementation modes are distinguished based on the dtype of the operator. Replace ${INSTALL_DIR} with the actual CANN component directory. If the Ascend-CANN-Toolkit package is installed as the root user, the CANN component directory is /usr/local/Ascend/ascend-toolkit/latest.

Default: high_performance

No

Global

ge.optypelistForImplmode

List of operator types. The operators in the list use the mode specified by the ge.opSelectImplmode option.

Restrictions:

  • The operators on the list use the mode specified by ge.opSelectImplmode, which is either high_precision or high_performance. Use commas (,) to separate operators.
  • This option must be used together with ge.opSelectImplmode and takes effect only for specified operators. For other operators, the default implementation mode is used. For example, ge.opSelectImplmode is set to high_precision, and ge.optypelistForImplmode is set to Pooling or SoftmaxV2. The preceding configuration example indicates that the high-precision mode is used only for the Pooling and SoftmaxV2 operators. For operators whose precision modes are not specified, the default implementation mode is used.

No

Global

ge.shape_generalized_build_mode

Do not use this option because it will be deprecated in later versions.

No

Graph

ge.customizeDtypes

Customized operator precision during model build. Other operators in the model are built according to ge.exec.precision_mode or ge.exec.precision_mode_v2. Set it to the path (including the name of the configuration file), for example, /home/test/customize_dtypes.cfg.

Restrictions:

  • List the names or types of operators whose precision needs customization in the configuration file. Each operator occupies a line, and the operator type must be defined based on IR.
  • If both operator name and type are configured for an operator, the operator name applies during build.
  • The computing precision of an operator specified by this option does not take effect if the operator is fused during model compilation.

The structure of the configuration file is as follows:

# By operator name
Opname1::InputDtype:dtype1,dtype2,…OutputDtype:dtype1,…
Opname2::InputDtype:dtype1,dtype2,…OutputDtype:dtype1,…
# By operator type
OpType::TypeName1:InputDtype:dtype1,dtype2,…OutputDtype:dtype1,…
OpType::TypeName2:InputDtype:dtype1,dtype2,…OutputDtype:dtype1,…

Example:

# By operator name
resnet_v1_50/block1/unit_3/bottleneck_v1/Relu::InputDtype:float16,int8,OutputDtype:float16,int8
# By operator type
OpType::Relu:InputDtype:float16,int8,OutputDtype:float16,int8
NOTE:
  • You can find the operator precision support in the operator information library, which is saved in opp/op_impl/custom/ai_core/tbe/config/${soc_version}/aic-${soc_version}-ops-info.json under the CANN component directory by default.
  • The data type specified by this option takes high priority, which may invite accuracy or performance degradation. If the specified data type is not supported, the build will fail.

No

Session

ge.exec.atomicCleanPolicy

Collectively cleans up the memory occupied by all operators with the memset attribute (memset operators) on the network.

Arguments:

  • 0 (default): enabled.
  • 1: disabled. Memory used by each memset operator is cleaned up separately. When the memset operators on the network occupy too much memory, you are advised to use this mode to reduce the memory usage. However, this may cause performance loss.

No

Session

ge.jit_compile

Not supported in the current version.

No

Global/Session

ge.build_inner_model

Not supported in the current version.

No

N/A

ge.externalWeight

When multiple models are loaded in a session, if the weights of these models can be reused, you are advised to use this configuration item to externalize the weights of the Const/Constant nodes on the network to implement weight reuse among multiple models and reduce the memory usage of the weights.

Arguments:

  • 0: Saves the weights in the .om model file. The default value is 0.
  • 1: externalizes the weights and flushes the weight files of all Const/Constant nodes on the network. The node type is converted to FileConstant. The weight files are named as weight_+hash.

Description of the file flush path:

  • If the environment variable ASCEND_WORK_PATH is not configured in the environment, the weight files are flushed to the current execution directory tmp_weight_<pid>_<sessionid>.
  • If the environment variable ASCEND_WORK_PATH is configured in the environment, the weight files are flushed to the ${ASCEND_WORK_PATH}/tmp_weight_<pid>_<sessionid> directory.

When the model is uninstalled, the tmp_weight_<pid>_<sessionid> directory is deleted.

Configuration example:

{"ge.externalWeight", "1"};

No

Session

stream_sync_timeout

Timeout for stream synchronization during graph execution. If the timeout exceeds the configured value, a synchronization failure is reported. The unit is ms.

The default value is -1, indicating that there is no waiting time and no error is reported when the synchronization fails.

No

Global/Session

event_sync_timeout

Timeout for event synchronization during graph execution. If the timeout exceeds the configured value, a synchronization failure is reported. The unit is ms.

The default value is -1, indicating that there is no waiting time and no error is reported when the synchronization fails.

No

Global/Session

ge.exec.staticMemoryPolicy

Memory allocation mode used during network running.

Arguments:

  • 0: dynamic memory allocation. Memory is dynamically allocated based on the actual size. Dynamic memory expansion is not supported. Default value: 0.
  • 2: dynamic memory expansion of static shapes. In training and online inference scenarios, this environment variable can be used to implement memory reuse between multiple graphs in the same session. That is, the memory required by the maximum graph is allocated. For example, if the memory required by the current graph exceeds the memory of the previous graph, the memory of the previous graph is directly released. The memory is reallocated based on the memory required by the current graph.
  • 3: Only dynamic shape supports dynamic memory expansion, which solves the fragment problem during dynamic memory allocation and reduces the memory usage of the dynamic-shape network.
  • 4: Both static and dynamic shapes support dynamic memory expansion.
NOTE:
  • This parameter cannot be set to 2 or 4 when multiple graphs are executed concurrently.
  • To be compatible with earlier versions, the system dynamically expands the memory based on the value 2 even if this environment variable is set to 1.

Configuration example:

{"ge.exec.staticMemoryPolicy", "2"};

No

Global

ge.graph_compiler_cache_dir

Disk cache directory for graph compilation. This option is used together with ge.graph_key. This function takes effect only when both ge.graph_compiler_cache_dir and ge.graph_key are not empty.

The configured cache directory must exist. Otherwise, the compilation fails.

After a graph is changed, the original cache file is unavailable. You need to manually delete the cache file from the cache directory or modify ge.graph_key to rebuild and generate a cache file.

For details about other restrictions and usage methods, see Graph Build Cache.

No

Session

ge.graph_key

Unique graph ID. The value contains a maximum of 128 characters, including only letters, digits (0–9), underscores (_), and hyphens (-).

No

Graph

ge.featureBaseRefreshable

Whether the feature memory address can be refreshed. To manage the feature memory and refresh the address for multiple times, set this option to the refreshable value.

This option applies only to static shape graphs.

Arguments:

0 (default): The feature memory address cannot be refreshed.

1: The feature memory address of a model can be refreshed.

No

All

ge.constLifecycle

Lifecycle of constant nodes in the training and online inference scenario.

session (default): Constant nodes are stored at the session level. In this case, memory reuse is supported for constant nodes between multiple graphs in a session. However, ensure that constant nodes with the same name in multiple graphs are the same.

graph: Constant nodes are stored at the graph level. You can call SetGraphConstMemoryBase to manage the const memory at the graph level.

The default value is session in the training scenario and graph in the online inference scenario.

No

All

ge.exec.inputReuseMemIndexes

Memory reuse enable for the input node of a graph. After the function is enabled, the memory of the input node can be reused as the intermediate memory required during model execution, reducing the memory peak.

The value is the index of the input node. If memory reuse is enabled for multiple input nodes, use commas (,) to separate multiple indexes. The index attribute of the input node is required, specifying the sequence number of the input. The index starts from 0.

Note:

  • If memory reuse of the input node is enabled, ensure that the input memory size is 32-byte aligned.
  • If the configured input index is greater than or equal to the value of the input count, index is invalid and the index configuration does not take effect.
  • If memory reuse of the input node is enabled, the memory data of the input node will be overwritten. After the graph is run, the memory data of the input node cannot be used.

Configuration example:

{"ge.exec.inputReuseMemIndexes", "0,1,2"};

No

Graph

ge.exec.outputReuseMemIndexes

Memory reuse enable for the entire graph output. After the function is enabled, the memory of the entire graph output can be reused as the intermediate memory required during model execution, reducing the memory peak.

If enabled, the value is the index of the entire graph output. If memory reuse is enabled for multiple outputs, use commas (,) to separate multiple indexes.

Note:

  • If memory reuse is enabled for the entire graph output, ensure that the output memory size is 32-byte aligned.
  • Output indices are identified based on the output sequence of the entire graph. The index starts from 0.
  • If the configured output index is greater than or equal to the value of the output count, index is invalid and the index configuration does not take effect.

Configuration example:

{"ge.exec.outputReuseMemIndexes", "0,1,2"};

No

Graph

ge.disableOptimizations

This parameter is used for debugging and cannot be used in commercial products. The function specified by this parameter will be released as a feature in later versions.

This parameter applies only to the following products:

Specifies one or more compilation and optimization passes to be disabled.

Currently, only the following passes can be disabled:

"RemoveSameConstPass","ConstantFoldingPass","TransOpWithoutReshapeFusionPass"

Note:

  1. Separate multiple passes with commas (,).
  2. If other passes are disabled, only warning logs are printed during graph compilation.
  3. If ConstantFoldingPass is disabled, graph compilation or running may fail.
  4. If other compilation optimization options, such as ge.oo.constfolding, are configured, ge.disableOptimizations has a higher priority.

Configuration examples:

  • Disabling a single pass
    std::map <AscendString, AscendString> session_options = {
    {"ge.disableOptimizations", "RemoveSameConstPass"}
    };
  • Disabling multiple passes
    std::map <AscendString, AscendString> session_options = {
    {"ge.disableOptimizations", "RemoveSameConstPass, ConstantFoldingPass"}
    };

No

all

ac_parallel_enable

Whether to allow AI CPU operators and AI Core operators to run in parallel in a dynamic-shape graph.

In a dynamic-shape graph, when this function is enabled, the system automatically identifies AI CPU operators that can be run in parallel with the AI Core operators in the graph. Operators of different engines are distributed to different streams to run in parallel, improving resource utilization and dynamic shape execution performance.

Arguments:

  • 1: AI CPU operators and AI Core operators are allowed to run in parallel.
  • 0 (default): AI CPU operators are not separately distributed.

Configuration example:

{"ac_parallel_enable", "1"};

Optional

Global

ge.deterministic

Deterministic computing enable.

By default, deterministic computing is disabled. The results of multiple executions of an operator with the same hardware and input may be different. This is generally caused by asynchronous multi-thread executions during operator implementation, which changes the accumulation sequence of floating point numbers. When deterministic computing is enabled, the same output is generated if an operator is executed for multiple times with the same hardware and input. This often slows down operator execution. If the execution results of a model are different for multiple times or the model accuracy needs to be tuned, you can enable deterministic computing to assist model debugging and tuning.

Arguments:

  • 0 (default): disabled.
  • 1: enabled.

Configuration example:

{"ge.deterministic", "1"};

Optional

Global

ge.enableGraphParallel

Algorithm-based partitioning for the original foundation model. The value 1 indicates that algorithm-based partitioning is enabled. For details about the partitioning strategy, see the configuration file specified by ge.graphParallelOptionPath.

If this option is set to another value or left empty, algorithm-based partitioning is disabled. By default, this option is left empty.

Configuration example:

{"ge.enableGraphParallel", "1"};

No

Graph

ge.exec.enableEngineParallel

Whether to perform tiling on communication operators and related computation operators in the network so that they can run in parallel in the partitioning and deployment scenarios of foundation models. Tiling can be performed only when communication operators exist on the network and this option is set to 1. During tiling, only AllReduce communication operators are partitioned.

If this option is set to another value or left empty, algorithm-based partitioning is disabled. By default, this option is left empty.

Configuration example:

{"ge.exec.enableEngineParallel", "1"};

Optional

Graph

ge.graphParallelOptionPath

Path and name of the algorithm-based partitioning strategy configuration file when the original foundation model is partitioned. This option takes effect only when ge.enableGraphParallel is set to 1.

Configuration example:

{"ge.graphParallelOptionPath", "./parallel.json"};

The configuration file must be in JSON format. The following is an example:

  • Semi-automatic partitioning
    {
        "graph_parallel_option": {
            "auto": false,
            "tensor_parallel_option": {
                "tensor_parallel_size": 2
            }
        }
    }
  • Automatic partitioning
    {
        "graph_parallel_option": {
            "auto": true
        }
    }

Arguments:

  • auto: true for automatic partitioning; false for semi-automatic partitioning.
  • tensor_parallel_option: enables Tensor Parallel (TP). After TP is enabled and ge.exec.modelDeployMode is set to the default value, ge.exec.variable_acc does not take effect. That is, the variable format optimization function cannot be enabled.

    TP: Tensor Parallel, also called Intra-Op Parallel, partitions the tensor of each operator in a computational graph along one or more axes (batch/non-batch). The divided partitions are distributed to each device for computation.

  • tensor_parallel_size: TP size, that is, the number of device chips to be configured.

No

Graph

ge.exec.hostSchedulingMaxThreshold

Maximum threshold to enable dynamic shape scheduling when a static small graph (root graph) is executed. The default value is 0. It is recommended that this option be used in foundation model scenarios.

  • If the number of static root graph nodes is less than the maximum threshold, dynamic shape scheduling is used. For foundation models, this mode saves stream resources.
  • If the number of static root graph nodes is greater than the maximum threshold, the original process remains unchanged.

Note: If the static root graph node contains subgraphs, this option does not take effect.

No

All

ge.exec.static_model_ops_lower_limit

Lower limit of the number of operators in a static subgraph. The value ranges from –1 to positive infinity. If other values are used, an error is reported. The default value is 4.

  • ≥ 0:
    • If the number of operators is less than the lower limit, static subgraphs are not partitioned separately, but are merged into dynamic graphs for dynamic execution.
    • If the number of operators is greater than or equal to the lower limit, the original process remains unchanged (static subgraphs are still partitioned).
  • -1: All operators are executed through dynamic graphs.

For example, if there are four operators in a static subgraph and this option is set to 10, static subgraphs are not partitioned separately, but are executed through dynamic graphs.

Optional

Graph

ge.exec.input_fusion_size

Threshold for fusing and copying multiple discrete pieces of user input data during data transfer from the host to the device. The minimum value is 0, the maximum value is 32 MB (33,554,432 bytes), and the default value is 128 KB (131,072 bytes).

  • If the size of input data is less than or equal to the threshold, the data is fused before transferred from the host to the device.
  • If the size of input data is greater than the threshold or the threshold is 0 (the function is disabled), the data is not fused before transferred from the host to the device.

Assume there are 10 user inputs, including two 100 KB inputs, two 50 KB inputs, and the other inputs greater than 100 KB:

  • ge.exec.input_fusion_size = 100 KB: The preceding four inputs are fused into 300 KB data for transfer. The other six inputs are directly transferred from the host to the device.
  • ge.exec.input_fusion_size = 0: This function is disabled and no inputs are fused. That is, the 10 inputs are directly transferred from the host to the device.

This option takes effect only when the static graph is run asynchronously. That is, RunGraphAsync is used.

Optional

all

ge.topoSortingMode

Traversal mode when you compile operators in graph mode. It is mainly used for online inference scenarios.

Arguments:

  • 0: Breadth First Search (BFS).
  • 1: Depth First Search (DFS). Default: 1.
  • 2: Reverse DFS (RDFS).
  • 3: Stable RDFS. For existing operators in the graph, the computation sequence is not changed. For new operators in the graph, RDFS is used.

Configuration example:

{"ge.topoSortingMode", "1"};

Optional

all

ge.tiling_schedule_optimize

Tiling offload scheduling optimization.

As internal storage of the AI Core in the NPU cannot store all the input and output data of operators, the input data is tiled into different parts. The first part is transferred in, computed, and then transferred out, so does the next part. This process is called tiling. Then, a computation program, called tiling implementation, determines tiling parameters (such as the block size transferred each time and the total number of cycles) based on operator information such as shape. The AI Core is not good at scalar computation in the tiling implementation. Therefore, tiling implementation is generally executed on the CPU on the host. However, tiling implementation is executed on the device when the following conditions are met:

  1. The model is static-shape.
  2. Operators in the model, such as the FusedInferAttentionScore and IncreFlashAttention fused operators, support tiling offload.
  3. The value of the operator that supports tiling offloading depends on the execution result of the previous operator.

Arguments:

  • 0 (default): Tiling offload is disabled.
  • 1: Tiling offload is enabled.

Configuration example:

{"ge.tiling_schedule_optimize", "1"};

Optional

Global/Session

ge.exportCompileStat

Whether to generate the result file fusion_result.json of operator fusion information (including graph fusion and UB fusion) during graph build.

This file is used to record the fusion patterns used during graph build. In the file:

  • session_and_graph_id_ xx_xx: Thread and graph ID to which the fusion result belongs.
  • graph_fusion: Graph fusion.
  • ub_fusion: UB fusion.
  • match_times: Number of times that a fusion pattern is hit during graph build.
  • effect_times: Number of times that a fusion pattern takes effect.
  • repository_hit_times: Number of times that the repository is hit during UB fusion.

Arguments:

  • 0: The result file of operator fusion information is not generated.
  • 1 (default): The result file of operator fusion information is generated when the program exits normally.
  • 2: The result file of operator fusion information is generated after graph build. That is, if the graph build is complete and the subsequent program is interrupted in advance, the result file of operator fusion information is also generated.
NOTE:

If the ASCEND_WORK_PATH environment variable is not set, the result file is generated in the current path where the atc command is executed by default. If the ASCEND_WORK_PATH environment variable is set, the result file fusion_result.json is saved in $ASCEND_WORK_PATH/FE/${Process ID}.

Configuration example:

{"ge.exportCompileStat", "1"};

Optional

all

ge.graphMaxParallelModelNum

In graph execution mode, a graph can be concurrently loaded and executed by multiple models on the same device. This parameter is used to control the maximum number of models that can be concurrently loaded.

Arguments:

1 to INT32_MAX. The default value is 8.

Configuration example:

{"ge.graphMaxParallelModelNum", "10"};

Optional

all

ge.oo.level

Extended parameter for debugging. It cannot be used in commercial products and will be released as a formal function in later versions.

Multi-level optimization options for graph build include subgraph optimization, full graph optimization, and static shape model sinking.

Static shape model sinking: In this approach, the input and output shapes of all operators in a static shape model can be determined at build time, allowing for model-level memory orchestration and operator tiling computation to be completed on the host. These computations are then batched and sent to the device stream when the model is loaded, but they are not executed immediately. Instead, the execution of all tasks within the model is triggered by deliver model execution tasks.

Arguments:

  • O1: Performs only optimizations related to static sinking, such as InferShape (output tensor shape inference), constant folding, dead-edge elimination, and other optimizations.
  • O3 (default): Enables all optimizations.

Configuration example:

{"ge.oo.level", "O1"};

Optional

all

ge.oo.constantFolding

Extended parameter for debugging. It cannot be used in commercial products and will be released as a formal function in later versions.

Enables constant folding optimization.

Constant folding is the process of replacing nodes in a computation graph that can be evaluated to a constant output value with that constant, and simplifying the structure of the computation graph accordingly.

Arguments:

  • true (default): enabled.
  • false: disabled.

Configuration example:

{"ge.oo.constantFolding", "true"};

Restrictions:

If other compilation optimization options, such as ge.disableOptimizations, are configured, ge.disableOptimizations has a higher priority.

Optional

all

ge.oo.deadCodeElimination

Extended parameter for debugging. It cannot be used in commercial products and will be released as a formal function in later versions.

Enables dead-edge elimination optimization.

Dead-edge elimination: When pred (input 1) of a switch statement is a constant node, one of the branches can be eliminated based on the value of const. If const is true, the false branch is eliminated; if const is false, the true branch is eliminated.

Arguments:

  • true (default): enabled.
  • false: disabled.

Configuration example:

{"ge.oo.deadCodeElimination", "true"};

Optional

all

ge.exec.modelDeployMode

Model deployment mode in the partitioning and deployment scenarios of all foundation models.

  • For the Atlas Training Series Product , this option is left empty by default. Currently, only the SPMD mode is supported. SPMD is short for Single Program Multiple Data, and indicates that the same program is executed on all nodes for data parallelism.

No

Graph

ge.exec.modelDeployDevicelist

Device used by the current execution node for model deployment and execution in the partitioning and deployment scenarios of foundation models. This option is used in conjunction with ge.exec.modelDeployMode in the SPMD scenario.

No

Graph

ge.exec.frozenInputIndexes

Index of the input tensor whose address is not refreshed. This parameter can be called only for LoadGraph. The input tensor index varies according to the model.

  • Dynamic shape model: The index of the input tensor, address of the data on the device, and data length need to be passed. The address must be in decimal format.
  • Static shape model: Only the index of the input tensor needs to be passed. Other parameters, such as data length, do not take effect.

Configuration example:

# Pass only the input tensor index.
{"ge.exec.frozenInputIndexes", "0;1;2"};
# Pass the input tensor index, address of the data on the device, and data length.
{"ge.exec.frozenInputIndexes", "0,88832131,4;1,888213294,4;2,193492421,2"};

Restrictions:

The input tensor whose address is not refreshed must have a static shape. For a dynamic shape model, the input tensor must also have a static shape.

Optional

Graph