Preparing Dump Data and Computational Graphs of a Training Network Running on Ascend AI Processors

Prerequisites

Before dumping data of a trained network after migration, ensure that the model is developed, built, and executed, and the training project is executable.

If the training network contains random factors, remove them before dumping.
Ensure that your code is the same as the code trained on the GPUs in terms of the network structure, operator, optimizer, and parameter initialization policy. Otherwise, the comparison is meaningless.
Do not perform training and validation at the same time in a training script. That is, do not put training and validation in the same script. Otherwise, two groups of dump data will be generated and you cannot distinguish between them.
Currently, only the AI CPU, AI Core, and HCCL operators support data dump.

Dump Parameter Configuration

To enable the training script to dump computational graphs, introduce the OS to the package reference area in the training script and set the DUMP_GE_GRAPH parameter before building a model.
```
import os
...
def main():
    ...
    os.environ['DUMP_GE_GRAPH'] = '2'
```
During the training process, the computational graph file is saved in the directory where the training script is located.

Modify the script to enable the dump function. Add the lines in bold in the corresponding positions of the script.

In Estimator mode, collect dump data using dump_config in NPURunConfig. Before NPURunConfig is created, instantiate a DumpConfig class for dump configuration, including the dump path, iterations to dump, and the dump mode (operator inputs or outputs).

from npu_bridge.estimator.npu.npu_config import DumpConfig

# dump_path: dump path. Create the specified path in advance in the training environment (either in a container or on the host). The running user configured during installation must have the read and write permissions on this path.
# enable_dump: dump enable.
# dump_step: iterations to dump.
# dump_mode: dump mode, selected from input, output, and all.
dump_config = DumpConfig(enable_dump=True, dump_path = "/home/HwHiAiUser/output", dump_step="0|5|10", dump_mode="all")

config = NPURunConfig(
  dump_config=dump_config, 
  session_config=session_config
  )

For details about each field in the constructor function of the DumpConfig class, see the TensorFlow 1.15 Network Model Porting and Training Guide.

In session.run mode, set the session configuration options enable_dump, dump_path, dump_step, and dump_mode.

config = tf.ConfigProto()

custom_op =  config.graph_options.rewrite_options.custom_optimizers.add()
custom_op.name =  "NpuOptimizer"
custom_op.parameter_map["use_off_line"].b = True

custom_op.parameter_map["enable_dump"].b = True
custom_op.parameter_map["dump_path"].s = tf.compat.as_bytes("/home/HwHiAiUser/output") 
custom_op.parameter_map["dump_step"].s = tf.compat.as_bytes("0|5|10")
custom_op.parameter_map["dump_mode"].s = tf.compat.as_bytes("all")
custom_op.parameter_map["dump_data"].s = tf.compat.as_bytes("stats")
custom_op.parameter_map["dump_layer"].s = tf.compat.as_bytes("nodename1 nodename2 nodename3")
config.graph_options.rewrite_options.remapping = RewriterConfig.OFF

with tf.Session(config=config) as sess:
  print(sess.run(cost))

**Table 1** Parameter description
Parameter	Description
enable_dump	Dump enable. Possible values are: True: enabled. The dump file path is read from dump_path. False (default): disabled.
dump_path	Dump file path. Required if enable_dump is set to True. The specified path must be created in advance in the environment (either in a container or on the host) where training is performed. The running user configured during installation must have the read and write permissions on this path. The path can be an absolute path or a relative path relative to the path where the command is executed. An absolute path starts with a slash (/), for example, /home/HwHiAiUser/output. A relative path starts with a directory name, for example, output.
dump_step	Iterations to dump. Defaults to None, indicating that all iterations are dumped. Separate multiple iterations using vertical bars (\|), for example, 0\|5\|10. You can also use hyphens (-) to specify the iteration range, for example, 0\|3-5\|10.
dump_mode	Dump mode, whether the operator input or output is dumped. Possible values are: input: dumps only operator inputs. output (default): dumps only operator outputs. all: dumps both operator inputs and outputs.
dump_data	Type of operator content to dump. Possible values are: tensor (default): dumps operator data. stats: dumps operator statistics. The result file is in .csv format. In large-scale training scenarios, dumping a large amount of data takes a long time. You can dump the statistics of all operators, identify the operators that may be abnormal based on the statistics, and then dump the input or output data of these abnormal operators.
dump_layer	Name of the operator to dump. Multiple operator names are separated by spaces.

Dump File Generation

Run the training script to generate the dump data file and computational graph file.

Computational graph file: The file whose name starts with ge is the computational graph file generated when DUMP_GE_GRAPH is set to 2. The file is stored in the directory where the training script is stored.
Dump data file: The dump data file is generated in the directory specified by dump_path, that is, the {dump_path}/{time}/{deviceid}/{model_name}/{model_id}/{data_index} directory. For example, if dump_path is set to /home/HwHiAiUser/output, the dump data file is stored in the /home/HwHiAiUser/output/20200808163566/0/ge_default_20200808163719_121/11/0 directory.

**Table 2** Path format of a dump file
Path Key	Description	Remarks
dump_path	Dump path set in 2. (If a relative path is set, the corresponding absolute path applies.)	-
time	Dump time.	Format: YYYYMMDDHHMMSS
deviceid	Device ID.	-
model_name	Subgraph name.	If the *model_name* directory contains more than one folder, dump data in the folder with the same name as the computational graph is used. Periods (.), forward slashes (/), backslashes (\), and spaces in model_name are replaced with underscores (_).
model_id	Subgraph ID.	-
data_index	Iterations to dump.	If dump_step is specified, *data_index* equals to dump_step. If it is not specified, *data_index* starts at 0 and is incremented by 1 with each dump.

The dump file name format must comply with the naming conventions described in Dump File Naming Conventions. If the length of a file name exceeds the OS file name length limit (generally 255 characters), the dump file is renamed a string of random digits. For details about the mapping, see the mapping.csv file in the same directory.
Dump data is generated in every iteration. If the training dataset is large, the dump data volume in each iteration increases accordingly. You are advised to control the number of iterations to one.
In the multi-device training scenario where more than one Ascend AI Processor is used, since the processes are not started at the same time as defined in the training script, multiple timestamp directories are generated when data is dumped.
When the command is executed in a Docker, the generated data is stored in the Docker.

Select a computational graph file.

There are many dump graph files whose names start with ge, and multiple folders may exist at the model_name layer in the dump data file. You only need to find the computational graph file and the folder whose model_name is the name of the computational graph. You can use either of the following methods to quickly find the required file:
- Method 1: Search for the keyword Iterator in all dump files whose names end with _Build.txt. Record the name of the computational graph file, which will be used for accuracy comparison.
```
grep Iterator *_Build.txt
```
  As shown in the preceding figure, ge_proto_00292_Build.txt is the required computational graph file.
- Method 2: Save the TensorFlow model as a PB file, view the model, select the name of a computing operator as the keyword, and find the computational graph file that contains the keyword. The value of the name field in the computational graph is used as the name of the computational graph.
Select the dump data file.
1. Open the computational graph file found in 2 and record the value of the name field in the first graph. In the following example, record the value ge_default_20201209083353_71.
```
graph {
  name: "ge_default_20201209083353_71"
  op {
    name: "atomic_addr_clean0_71"
    type: "AtomicAddrClean"
    attr {
      key: "_fe_imply_type"
      value {
        i: 6
      }
}
```
2. Go to the directory for storing the dump file named after the timestamp. The following folders exist in the directory:
3. Find the folder whose name is the recorded value, for example, ge_default_20201209083353_71. The files in the folder are the required dump data files.

Parent topic: Training Scenarios (TensorFlow 1.x)