Preparing Dump Data of an Offline Model

Precautions

  • Before dumping data, build and run the application project of the model to ensure that the project is normal.
  • Dump data is generated during inference. If the number of cycles is large, the dump data volume increases accordingly. You are advised to perform inference only once during data dump. In foundation model training scenarios, dumping a large amount of data typically requires a significant amount of time. One solution is to use dump_data to enable the operator statistics function, use the statistics to identify potentially abnormal operators, and then proceed to dump the abnormal operators.
  • In Docker scenarios, dump is not supported in containers.
  • The aclInit() and aclmdlSetDump() APIs are provided to dump data.

Dump Data Generation

Perform the following steps to dump data of the offline model:

  1. Open the code file of the inference application project where the aclInit() function is located, view the called aclInit() or aclmdlSetDump() function, and obtain the path of the acl.json file.

    If aclInit() or aclmdlSetDump() is initialized to empty, pass the acl.json path created in 2 to the call. The acl.json path is relative to the path of the binary file generated during project build.

  2. Modify the acl.json file in the directory (if the file does not exist, create it to the out directory after project build) to add the dump configuration in the following format.
    The following is an example of model dump configuration:
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    {                                                                                            
    	"dump":{
    		"dump_list":[                                                                        
    			{	"model_name":"ResNet-101"
    			},
    			{                                                                                
    				"model_name":"ResNet-50",
    				"layer":[
    				      "conv1conv1_relu",
    				      "res2a_branch2ares2a_branch2a_relu",
    				      "res2a_branch1",
    				      "pool1"
    				] 
    			}  
    		],  
    		"dump_path":"$HOME/output",
                    "dump_mode":"output",
    		"dump_op_switch":"off",
                    "dump_data":"tensor"
    	}                                                                                        
    }
    

    The following is an example of dump configuration of the single-operator model execution mode in the single-operator dump scenario:

    1
    2
    3
    4
    5
    6
    7
    8
    {
        "dump":{
            "dump_path":"output",
            "dump_list":[], 
    	"dump_op_switch":"on",
            "dump_data":"tensor"
        }
    }
    

    The following is an example of dump configuration of the single-operator API execution mode in the single-operator dump scenario:

    1
    2
    3
    4
    5
    6
    7
    {
        "dump":{
            "dump_path":"output",
            "dump_list":[], 
            "dump_data":"tensor"
        }
    }
    
    Table 1 Format of the acl.json file

    Parameter

    Description

    dump_list

    (Required) List of network-wide models for data dump.

    Create model dump configuration information. If multiple models need to be dumped, separate them with commas (,).

    In the single-operator calling scenario (including single-operator model execution and single-operator API execution), dump_list is empty.

    model_name

    Model name. The value of model_name of each model must be unique.

    • To load a model from a file, enter the model file name without the name extension. You can also set this parameter to the value of the outermost name field in the .json file after ATC-based model conversion.
    • To load a model from memory, set this parameter to the value of the name field in the .json file after ATC-based model conversion.

    layer

    It is advised to dump certain operators only. Otherwise, excessive data may induce timeouts if the I/O performance is poor. This field can be used to specify the name of the operator to be dumped. The name can be the name of the operator after ATC model conversion or the name of the original operator before conversion.

    • Configure the operator name in each line in the format. Use commas (,) to separate operators.
    • You do not need to set model_name. In this case, the corresponding operators of all models are dumped by default. If model_name is set, the corresponding operators of the model are dumped.
    • If the input of the specified operator involves the data operator, the data operator information is dumped. To dump the data operator, enter the downstream nodes of the data operator.
    • To dump all operators of a model, the layer field does not need to be included.

    dump_path

    (Required) Directory for storing dump data files in the operating environment. The directory must be created in advance and the running user configured during installation must have the read and write permissions on the directory.

    The path can be either absolute or relative.
    • An absolute path starts with a slash (/), for example, $HOME/output.
    • A relative path starts with a directory name, for example, output.

    dump_mode

    Dump mode.

    • input: dumps operator inputs only.
    • output (default): dumps operator outputs only.
    • all: dumps both operator inputs and outputs.

      Note: If this parameter is set to all, the input data of some operators, such as collective communication operators HcomAllGather and HcomAllReduce, will be modified during execution. Therefore, the system dumps the operator input before operator execution and dumps the operator output after operator execution. In this way, the dumped input and output data of the same operator is flushed to drives separately, and multiple dump files are generated. After parsing the dump files, you can determine whether the data is an input or output based on the file content.

    dump_level

    Dump data level. The options are as follows:

    • op: dumps data at the operator level.
    • kernel: dumps data at the kernel level.
    • all (default): dumps both op and kernel level data.

    If the default value is used, there are a large number of dump files, for example, dump files starting with aclnn. If you have requirements on the dump performance or the memory resources are limited, you can set this parameter to the op level to improve the dump performance and reduce the number of dump files.

    NOTE:

    An operator is a representation of operation logic (for example, addition, subtraction, multiplication, and division operations). The kernel is the implementation of the operation logic for computing and needs a specific computing device to complete computing.

    dump_op_switch

    Dump data switch of the single-operator model execution mode in the single-operator dump scenario.

    • on: enables dump for the single-operator model.
    • off (default): disables dump for the single-operator model.

    dump_step

    Iterations to dump. This parameter is not required in the inference scenario.

    If this parameter is not configured, dump data will be generated for all iterations by default, which may result in a large amount of data. You are advised to specify iterations as required.

    Separate multiple iterations using vertical bars (|), for example, 0|5|10. You can also use hyphens (-) to specify the iteration range, for example, 0|3-5|10.

    Configuration example:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    {
    	"dump":{
    		"dump_list":[     
    			...... 
    		],  
    		"dump_path":"$HOME/output",
                    "dump_mode":"output",
    		"dump_op_switch":"off",
                    "dump_step": "0|3-5|10"
    	}  
    }
    
    NOTE:

    In the training scenario, if the dump_step parameter in acl.json is used to specify the iterations whose dump data is to be collected and the ge.exec.dumpStep parameter is configured in the GEInitialize API (this parameter is also used to specify the iterations whose dump data is to be collected), the last configured parameter will be used. For details about the GEInitialize API, see " GEInitialize" in the Ascend Graph Developer Guide.

    dump_data

    Type of the operator dump content. The options are as follows:

    • tensor (default): dumps operator data.
    • stats: dumps operator statistics. The result file is in .csv format and contains the operator name, input/output data type, maximum value, and minimum value.

    Dumping a large amount of data typically requires a significant amount of time. One solution is to first dump operator statistics, use the statistics to identify potentially abnormal operators, and then proceed to dump the data of the identified operators.

    In the model dump scenario, the information of operator input or output or both can be collected based on the configuration of dump_mode.

  3. Run the application to generate dump data files. The path and format of the generated dump data files are described as follows.

    Dump file path: {dump_path}/{time}/{deviceid}/{model_name}/{model_id}/{data_index}/{dump file}

    For a single-operator model, the dump path is {dump_path}/{time}/{deviceid}/{dump file}.

    Table 2 Path format of a dump file

    Path Key

    Description

    Note

    dump_path

    Dump path configured in the acl.json file.

    -

    time

    Dump time.

    Formatted as YYYYMMDDHHMMSS.

    deviceid

    Device ID.

    -

    model_name

    Model name.

    Periods (.), forward slashes (/), backslashes (\), and spaces in model_name are replaced with underscores (_).

    model_id

    Model ID.

    -

    data_index

    Execution sequence number of each task, indexed starting at 0. This value is increased by 1 every dump.

    -

    The dump data file is named in the format of {op_type}.{op_name}.{task_id}.{stream_id}.{timestamp}.

    • A dot (.), slash (/), backslash (\), or space in op_type and op_name in the dump file will be converted to an underscore (_).
    • If the length of a file name exceeds the OS file name length limit (generally 255 characters), the dump file is renamed a string of random digits. For details about the mapping, see the mapping.csv file in the same directory.
    • During graph execution, the following operators do not generate dump data:
      • Before graph execution, some operators are not delivered to the device for execution, such as conditional operators (if/while/for/case), data operators (Data/RefData/Const), and data flow operators (StackPush/StackPop/Concat/Split).
      • During graph optimization, GE marks some operators so that they are not delivered to the device for execution. The _no_task attribute in the dump graph of these operators is true.
      • Operators that cannot go through the final execution in the graph.