Function: set_dump

Applicability

Product	Supported (√/x)
Atlas A3 training products / Atlas A3 inference products	√
Atlas A2 training products / Atlas A2 inference products	√
Atlas training products	√
Atlas inference products	√
Atlas 200I/500 A2 inference products	√

Function Usage

Sets dump parameters.

Prototype

C Prototype

        
             aclError aclmdlSetDump(const char *dumpCfgPath)

Python Function
1

ret = acl.mdl.set_dump(dump_cfg_path)

Parameter Description

Parameter	Description
dump_cfg_path	Str, path of the configuration file, including the file name. Currently, the following dump information can be configured: (If the operator input or output contains sensitive user information, information leakage may occur.) Model dump configuration (used to export the input and output data of operators at each layer in the model) and single-operator dump configuration (used to export the input and output data of an operator). The exported data is used to compare with that of a specified model or operator to locate accuracy issues. For details about the configuration example, description, and restrictions, see Examples of Model Dump Configuration and Single-Operator Dump Configuration. Dump configurations are disabled by default. Dump configuration of the exception operator (used to export the input and output data, workspace information, and tiling information of the exception operator). The exported data is used to analyze AI Core errors. For details about the configuration example, see Example of Dump Configuration for Exception Operators. Dump configurations are disabled by default. Overflow/Underflow operator dump configuration (used to export the input and output data of the overflow/underflow operator in the model). The exported data is used to analyze overflow/underflow causes and locate model accuracy issues. For details about the configuration example, description, and restrictions, see Example of Overflow/Underflow Operator Dump Configuration. By default, this dump configuration is disabled. Configuration for operator dump watch mode (used to enable the observation mode for the output data of a specified operator). If you suspect that the memory is overwritten by other operators after locating the accuracy issues of some operators and excluding the calculation issues of the operators, you can enable the dump watch mode. For details about the configuration example and restrictions, see Dump Watch Configuration for Operators. The dump watch mode is disabled by default.

Parameter

Description

dump_cfg_path

Str, path of the configuration file, including the file name.

Currently, the following dump information can be configured: (If the operator input or output contains sensitive user information, information leakage may occur.)

Model dump configuration (used to export the input and output data of operators at each layer in the model) and single-operator dump configuration (used to export the input and output data of an operator). The exported data is used to compare with that of a specified model or operator to locate accuracy issues. For details about the configuration example, description, and restrictions, see Examples of Model Dump Configuration and Single-Operator Dump Configuration. Dump configurations are disabled by default.
Dump configuration of the exception operator (used to export the input and output data, workspace information, and tiling information of the exception operator). The exported data is used to analyze AI Core errors. For details about the configuration example, see Example of Dump Configuration for Exception Operators. Dump configurations are disabled by default.
Overflow/Underflow operator dump configuration (used to export the input and output data of the overflow/underflow operator in the model). The exported data is used to analyze overflow/underflow causes and locate model accuracy issues. For details about the configuration example, description, and restrictions, see Example of Overflow/Underflow Operator Dump Configuration. By default, this dump configuration is disabled.
Configuration for operator dump watch mode (used to enable the observation mode for the output data of a specified operator). If you suspect that the memory is overwritten by other operators after locating the accuracy issues of some operators and excluding the calculation issues of the operators, you can enable the dump watch mode. For details about the configuration example and restrictions, see Dump Watch Configuration for Operators. The dump watch mode is disabled by default.

Return Value Description

Return Value	Description
ret	Int, error code: 0 on success; else, failure.

Restrictions

acl.mdl.init_dump needs to be called in conjunction with acl.mdl.set_dump and acl.mdl.finalize_dump to dump data to files. These APIs can be called for multiple times in a single process to obtain dump data of different Dump configurations.
Example scenario:
- To execute two models, you need to set dump information differently. The API call sequence is as follows: acl.init --> acl.mdl.init_dump --> acl.mdl.set_dump --> model loading --> model execution --> acl.mdl.finalize_dump --> model unloading --> acl.mdl.init_dump --> acl.mdl.set_dump --> model loading --> model execution --> acl.mdl.finalize_dump --> model unloading --> execution of other tasks --> acl.finalize.
- To execute the same model twice, you only need to perform the dump operation for the first execution. The API call sequence is as follows: acl.init --> acl.mdl.init_dump --> acl.mdl.set_dump --> model loading --> model execution --> acl.mdl.finalize_dump --> model unloading --> model loading --> model execution --> execution of other tasks --> acl.finalize

The configured dump information is valid only when the model is loaded after the dump function is enabled by calling this API. The dump configuration does not take effect on models loaded before this API call unless you reload the models after this API call.
For example, in the following API calling sequence, the dump configuration is valid only for model 2.

acl.mdl.init_dump --> model 1 loading --> acl.mdl.set_dump --> model 2 loading --> acl.mdl.finalize_dump
If this API is called repeatedly to set the dump configuration for the same model, the most recent configuration is applied.
For example, in the following API call sequence, the second dump configuration call overwrites the first call:

acl.mdl.init_dump --> acl.mdl.set_dump --> acl.mdl.set_dump --> model 1 loading --> acl.mdl.finalize_dump

Reference

The acl.init API is also provided. During initialization, the dump configuration is passed as a JSON configuration file to dump the app data at run time. In this mode, the acl.init API can be called only once in a process. To modify the dump configuration, you need to modify the configuration in the JSON file.

Examples of Model Dump Configuration and Single-Operator Dump Configuration

After model dump or single-operator dump is configured, the exported data is used to compare with that of a specified model or operator to locate accuracy issues. For details about the comparison method, see Accuracy Debugging Tool Guide.

Model dump configuration example:

{                                                                                            
	"dump":{
		"dump_list":[                                                                        
			{	"model_name":"ResNet-101"
			},
			{                                                                                
				"model_name":"ResNet-50",
				"layer":[
				      "conv1conv1_relu",
				      "res2a_branch2ares2a_branch2a_relu",
				      "res2a_branch1",
				      "pool1"
				] 
			}  
		],  
		"dump_path":"$HOME/output",
                "dump_mode":"output",
		"dump_op_switch":"off",
                "dump_data":"tensor"
	}                                                                                        
}

Example of single-operator dump configuration:

{
    "dump":{
        "dump_path":"output",
        "dump_list":[], 
	"dump_op_switch":"on",
        "dump_data":"tensor"
    }
}

Table 1 Format of the acl.json file

Parameter

Description

dump_list

(Required) List of network-wide models for data dump.

Create model dump configuration information. If multiple models need to be dumped, separate them with commas (,).

In the single-operator calling scenario (including single-operator model execution and single-operator API execution), dump_list is empty.

model_name

Model name. The value of model_name of each model must be unique.

To load a model from a file, enter the model file name without the name extension. You can also set this parameter to the value of the outermost name field in the .json file after ATC-based model conversion.
To load a model from memory, set this parameter to the value of the name field in the .json file after ATC-based model conversion.

layer

It is advised to dump certain operators only. Otherwise, excessive data may induce timeouts if the I/O performance is poor. This field can be used to specify the name of the operator to be dumped. The name can be the name of the operator after ATC model conversion or the name of the original operator before conversion.

Configure the operator name in each line in the format. Use commas (,) to separate operators.
You do not need to set model_name. In this case, the corresponding operators of all models are dumped by default. If model_name is set, the corresponding operators of the model are dumped.
If the input of the specified operator involves the data operator, the data operator information is dumped. To dump the data operator, enter the downstream nodes of the data operator.
To dump all operators of a model, the layer field does not need to be included.

dump_path

(Required) Directory for storing dump data files in the operating environment. The directory must be created in advance and the running user configured during installation must have the read and write permissions on the directory.

The path can be either absolute or relative.

An absolute path starts with a slash (/), for example, $HOME/output.
A relative path starts with a directory name, for example, output.

dump_mode

Dump mode.

input: dumps operator inputs only.
output (default): dumps operator outputs only.
all: dumps both operator inputs and outputs.
Note: If this parameter is set to all, the input data of some operators, such as collective communication operators HcomAllGather and HcomAllReduce, will be modified during execution. Therefore, the system dumps the operator input before operator execution and dumps the operator output after operator execution. In this way, the dumped input and output data of the same operator is flushed to drives separately, and multiple dump files are generated. After parsing the dump files, you can determine whether the data is an input or output based on the file content.

dump_level

Dump data level. The options are as follows:

op: dumps data at the operator level.
kernel: dumps data at the kernel level.
all (default): dumps both op and kernel level data.

If the default value is used, there are a large number of dump files, for example, dump files starting with aclnn. If you have requirements on the dump performance or the memory resources are limited, you can set this parameter to the op level to improve the dump performance and reduce the number of dump files.

NOTE:

An operator is a representation of operation logic (for example, addition, subtraction, multiplication, and division operations). The kernel is the implementation of the operation logic for computing and needs a specific computing device to complete computing.

dump_op_switch

Dump data switch of the single-operator model execution mode in the single-operator dump scenario.

on: enables dump for the single-operator model.
off (default): disables dump for the single-operator model.

dump_step

Iterations to dump. This parameter is not required in the inference scenario.

If this parameter is not configured, dump data will be generated for all iterations by default, which may result in a large amount of data. You are advised to specify iterations as required.

Separate multiple iterations using vertical bars (|), for example, 0|5|10. You can also use hyphens (-) to specify the iteration range, for example, 0|3-5|10.

Configuration example:

              
                   {
	"dump":{
		"dump_list":[     
			...... 
		],  
		"dump_path":"$HOME/output",
                "dump_mode":"output",
		"dump_op_switch":"off",
                "dump_step": "0|3-5|10"
	}  
}

NOTE:

In the training scenario, if the dump_step parameter in acl.json is used to specify the iterations whose dump data is to be collected and the ge.exec.dumpStep parameter is configured in the GEInitialize API (this parameter is also used to specify the iterations whose dump data is to be collected), the last configured parameter will be used. For details about the GEInitialize API, see "GEInitialize" in the Graph Mode Development Guide.

dump_data

Type of the operator dump content. The options are as follows:

tensor (default): dumps operator data.
stats: dumps operator statistics. The result file is in .csv format and contains the operator name, input/output data type, maximum value, and minimum value.

Dumping a large amount of data typically requires a significant amount of time. One solution is to first dump operator statistics, use the statistics to identify potentially abnormal operators, and then proceed to dump the data of the identified operators.

In the model dump scenario, the information of operator input or output or both can be collected based on the configuration of dump_mode.

Example of Dump Configuration for Exception Operators

You can enable dump for exception operators by setting dump_scene. The following is an example of the configuration file, indicating that lightweight exception dump is enabled:

{
    "dump":{
        "dump_path":"output",
        "dump_scene":"aic_err_brief_dump"
    }
}

The details are as follows:

dump_scene can be set to:
- aic_err_brief_dump: lightweight exception dump, which is used to export the input, output, and workspace data of exception operators of AI Core.
- aic_err_norm_dump: common exception dump, which is used to export the shape, data type, format, and attribute information in addition to the lightweight exception dump.
- aic_err_detail_dump: exports the internal storage, register, and call stack information of AI Core in addition to the lightweight exception dump.
  When configuring this parameter, note that:
  - This parameter is only available for the following models and requires the driver of 25.0.RC1 or later:
    Atlas A2 training products / Atlas A2 inference products
    
    Atlas A3 training products / Atlas A3 inference products
    
    You can click here to download the driver installation package of Ascend HDK 25.0.RC1 or later on the Firmware and Drivers page and install or upgrade the driver by referring to the document of the corresponding version.
  - During dump file export, the AI Core where an exception operator is located is suspended, which may affect the execution of other processes on the device. After dump files are exported, the AI Core is automatically restored. Therefore, you are not advised to use aic_err_detail_dump when multiple host-side user service processes share the same device.
  - After dump files are exported, host-side user service processes are forcibly exited. Errors reported during the forcible exit are not used as the input for AI Core problem analysis.
  - If aic_err_detail_dump is configured and dump files are generated but not *.core files, aic_err_detail_dump is not configured successfully. In this case, aic_err_brief_dump will be used instead.
- lite_exception: indicates light exception dump. This value is provided to be compatible with earlier versions and is equivalent to aic_err_brief_dump.
dump_path is an optional parameter, indicating the path for storing exported dump files.
The priority of the dump file storage path is as follows: NPU_COLLECT_PATH environment variable > ASCEND_WORK_PATH environment variable > dump_path in the configuration file > current execution directory of the app.

For details about environment variables, see Environment Variables.
To view the content of an exported dump file, convert the dump file to a NumPy file and then view the NumPy file using Python. For details about the conversion procedure, see "Viewing Dump Files" in Accuracy Debugging Tool Guide.
If dump_scene is set to aic_err_detail_dump, you can use msDebug to view the content of an exported dump file. For details, see Operator Development Tool User Guide.
The dump configuration for exception operators cannot be enabled if the model dump configuration or single-operator dump configuration is enabled.

Example of Overflow/Underflow Operator Dump Configuration

If dump_debug is set to on, the overflow/underflow operator configuration is enabled. The following is an example of the configuration file:

{
    "dump":{
        "dump_path":"output",
        "dump_debug":"on"
    }
}

The details are as follows:

If dump_debug is not set or set to off, the overflow/underflow operator configuration is disabled.
If the overflow/underflow operator configuration is enabled, dump_path must be set to specify the path for storing exported dump files.
After obtaining the exported data files, parse the files by referring to "Overflow/Underflow Operator Data Collection and Analysis" in Accuracy Debugging Tool Guide.
dump_path can be either absolute or relative.
- An absolute path starts with a slash (/), for example, /home.
- A relative path starts with a directory name, for example, output.
This function cannot be enabled when model or single-operator dump configuration is enabled. Otherwise, an error is returned.
Only overflow/underflow data of AI Core operators can be collected.

Dump Watch Configuration for Operators

Set dump_scene to watcher to enable dump watch for operators. Below is an example of the content in the configuration file. The configuration effect is as follows: (1) After operators A and B are executed, the output of operators C and D is dumped; (2) After operators C and D are executed, the output of operators C and D is also dumped. The dump files of operators C and D in (1) will be compared with those in (2) to check whether operator A or B overwrites the output memory of operator C or D.

{
    "dump":{
        "dump_list":[
            {
                "layer":["A", "B"],
                "watcher_nodes":["C", "D"]
            }
        ],
        "dump_path":"/home/",
        "dump_mode":"output",
        "dump_scene":"watcher"
    }
}

The details are as follows:

If the operator dump watch mode is enabled, the overflow/underflow operator dump (by configuring the dump_debug parameter) or the single-operator model dump (by configuring the dump_op_switch parameter) cannot be enabled. Otherwise, an error will be reported. Dump watch cannot be applied in the single-operator API dump scenario.
In dump_list, the layer parameter is used to configure the names of the operators that may overwrite the memory of other operators, and the watcher_nodes parameter is used to configure the names of the operators with accuracy issues possibly due to output memory being overwritten by other operators.
- If layer is unspecified, the output of the operators configured for watcher_nodes is dumped after all operators that support dump in the model are executed.
- If an operator configured for layer and watcher_node is not in the static graph and static subgraph, the configuration does not take effect.
- If an operator name configured for layer and watcher_node is duplicate, or an operator configured for layer is a collective communication operator (the operator type starts with Hcom, for example, HcomAllReduce), only the dump file of the operator configured for watcher_node is exported.
- For a fusion operator, its name configured for watcher_node must be the name of the operator after fusion. If the name of an operator before fusion is configured, no dump file will be exported.
- Currently, model_name cannot be configured in dump_list.
If the operator dump watch mode is enabled, dump_path, which is the path for storing the exported dump file, must be configured.
The exported dump files cannot be viewed using a text tool. To view the content of a dump file, convert the dump file to a NumPy file and then view the NumPy file using Python. For details about the conversion procedure, see "Viewing Dump Files" in Accuracy Debugging Tool Guide.
dump_path can be either absolute or relative.
- An absolute path starts with a slash (/), for example, /home.
- A relative path starts with a directory name, for example, output.
dump_mode is used to specify the data of the operators configured for watcher_nodes to be exported. Currently, only output can be configured.

Parent topic: Model Execution