Preparing Dump Data and Computational Graphs of a Training Network Running on the Ascend AI Processors

Prerequisites

Before dumping data of a trained network after migration, ensure that the model is developed, built, and executed, and the training project is executable.

  • If the training network contains random factors, remove them before dumping.
  • Ensure that your code is the same as the code trained on the GPUs in terms of the network structure, operator, optimizer, and parameter initialization policy. Otherwise, the comparison is meaningless.
  • Do not perform training and validation at the same time in a training script. That is, do not put training and validation in the same script. Otherwise, two groups of dump data will be generated and you cannot distinguish between them.
  • Currently, only AI CPU and AI Core operators can be dumped. Operators such as Huawei Collective Communication Library (HCCL) operators cannot be dumped.

Configuring Dump Parameters

  1. To enable the training script to dump computational graphs, introduce the OS to the package reference area in the training script and set the DUMP_GE_GRAPH parameter before building a model. In this way, during the training process, the computational graph file is saved in the directory where the training script is located.
    import os
    ...
    def main():
        ...
        os.environ['DUMP_GE_GRAPH'] = '2'
  2. Modify the script to enable the dump function. Add the following information to the corresponding code:
    import npu_device
    
    # Add the following configuration during initialization:
    npu_device.global_options().dump_config.enable_dump = True
    npu_device.global_options().dump_config.dump_path = "/home/HwHiAiUser/output"
    npu_device.global_options().dump_config.dump_step = "0|5|10"
    npu_device.global_options().dump_config.dump_mode = "all"
    npu.device.global_options().dump_config.dump_data = "stats"
    npu.device.global_options().dump_config.dump_layer = "nodename1 nodename2 nodename3"
    Table 1 Parameter description

    Parameter

    Description

    dump_config.enable_dump

    Data dump enable.

    • True: enabled. The dump file path is read from dump_path.
    • False (default): disabled.

    dump_config.dump_path

    Dump path. Required if enable_dump is set to True.

    The specified path must be created in advance in the environment (either in a container or on the host) where training is performed. The running user configured during installation must have the read and write permissions on this path. The path can be an absolute path or a relative path relative to the path where the command is executed.

    • An absolute path starts with a slash (/), for example, /home/HwHiAiUser/output.
    • A relative path starts with a directory name, for example, output.

    dump_config.dump_step

    Iterations to dump. Defaults to None, indicating that all iterations are dumped.

    Separate multiple iterations using vertical bars (|), for example, 0|5|10. You can also use hyphens (-) to specify the iteration range, for example, 0|3-5|10.

    dump_config.dump_mode

    Dump mode. The values are as follows:

    • input: dumps only operator inputs.
    • output (default): dumps only operator outputs.
    • all: dumps both operator inputs and outputs.

    dump_config.dump_data

    Type of operator content to dump.

    • tensor (default): dumps operator data.
    • stats: dumps operator statistics. The result file is in .csv format.

    In large-scale training scenarios, dumping a large amount of data takes a long time. You can dump the statistics of all operators, identify the operators that may be abnormal based on the statistics, and then dump the input or output data of these abnormal operators.

    dump_config.dump_layer

    Name of the operator to dump. Multiple operator names are separated by spaces.

Generating the Dump File

  1. Run the training script to generate the dump data file and computational graph file.
    • Computational graph file: The file whose name starts with ge is the computational graph file generated when DUMP_GE_GRAPH is set to 2. The file is stored in the directory where the training script is stored.
    • Dump data file: The dump data file is generated in the directory specified by dump_path, that is, the {dump_path}/{time}/{deviceid}/{model_name}/{model_id}/{data_index} directory. For example, if dump_path is set to /home/HwHiAiUser/output, the dump data file is stored in the /home/HwHiAiUser/output/20200808163566/0/ge_default_20200808163719_121/11/0 directory.
    Table 2 Path format of a dump file

    Path Key

    Description

    Remarks

    dump_path

    Dump path set in 2. (If a relative path is set, the corresponding absolute path applies.)

    -

    time

    Dump time.

    Format: YYYYMMDDHHMMSS

    deviceid

    Device ID.

    -

    model_name

    Subgraph name.

    If the model_name directory contains more than one folder, dump data in the folder with the same name as the computational graph is used.

    Periods (.), forward slashes (/), backslashes (\), and spaces in model_name are replaced with underscores (_).

    model_id

    Subgraph ID.

    -

    data_index

    Iterations to dump.

    If dump_step is specified, data_index and dump_step are the same. If not, data_index starts at 0 and is incremented by 1 with each dump.

    dump_file

    Format: {op_type}.{op_name}.{taskid}.{stream_id}.{timestamp}. If the length of a file name formatted as required exceeds the OS file name length limit (generally 255 characters), the dump file is renamed a string of random digits. For details about the mapping, see the mapping.csv file in the same directory.

    Periods (.), forward slashes (/), backslashes (\), and spaces in op_type or op_name are replaced with underscores (_).

    • Dump data is generated in every iteration. If the training dataset is large, the dump data volume in each iteration increases accordingly. You are advised to control the number of iterations to one.
    • In the multi-device training scenario where more than one Ascend AI Processor is used, since the processes are not started at the same time as defined in the training script, multiple timestamp directories are generated when data is dumped.
    • When the command is executed in a Docker, the generated data is stored in the Docker.
  2. Select a computational graph file.

    There are a large number of dump graph files whose names start with ge, and multiple folders may exist at the model_name layer in the dump data file. You only need to find the computational graph file and the folder whose model_name is the name of the computational graph. You can use either of the following methods to quickly find the required file:

    • Method 1: Search for the keyword Iterator in all dump files whose names end with _Build.txt. Record the name of the computational graph file, which will be used for accuracy comparison.
      grep Iterator *_Build.txt

      As shown in the preceding figure, the ge_proto_00292_Build.txt file is the desired computational graph file.

    • Method 2: Save the TensorFlow model as a PB file, view the model, select the name of a computing operator as the keyword, and find the computational graph file that contains the keyword. The value of the name field in the computational graph is used as the name of the computational graph.
  3. Select the dump data file.
    1. Open the computational graph file found in 2 and record the value of the name field in the first graph. In the following example, record the value ge_default_20201209083353_71.
      graph {
        name: "ge_default_20201209083353_71"
        op {
          name: "atomic_addr_clean0_71"
          type: "AtomicAddrClean"
          attr {
            key: "_fe_imply_type"
            value {
              i: 6
            }
      }
    2. Go to the directory for storing the dump file named after the timestamp. The following folders exist in the directory:

    3. Find the folder whose name is the recorded value, for example, ge_default_20201209083353_71. The files in the folder are the required dump data files.