Preparing .npy Data of a Trained TensorFlow 1.x Network Generated on GPUs

Prerequisites

Before generating the dump data or .npy data of a trained TensorFlow 1.x network, a complete, executable, standard TensorFlow model training project is required. For details about how to prepare the GPU training environment, see Quickly Creating a GPU Training Environment on an ECS. The content in the link is for reference only.
Regardless of whether the Estimator or session.run mode is used, disable all random functions in the script, including but not limited to shuffle operations on datasets, random initialization of parameters, and implicit random initialization of some operators (such as the dense operator). Ensure that all parameters in the script are not initialized randomly.

Generating the .npy File

You can use the TensorFlow debugger (tfdbg) to generate .npy files. The major steps are as follows:

Add the debugging configuration option to the TensorFlow training project script.
- If the Estimator mode is used, add the hook of tfdbg as follows:
  1. Add from tensorflow.python import debug as tf_debug to import the debug module.
  2. Add the training_hooks=[tf_debug.LocalCLIDebugHook()] code to the location where the EstimatorSpec object instance is generated, that is, the location for constructing the network structure.
  Figure 1 Estimator mode
- If the session.run mode is used, set the tfdbg decorator before running as follows:
  1. Add from tensorflow.python import debug as tf_debug to import the debug module.
  2. After the session is initialized, add sess = tf_debug.LocalCLIDebugWrapperSession(sess, ui_type="readline").
  Figure 2 Session.run mode
Run the training script.
After the training job is stopped, the view enters the debugging CLI interaction mode tfdbg. Run the run command and the training goes to the next step.
```
For more details, see help..
tfdbg> run
```
After the run command is executed, the training result parameters of the first step are returned. You can run the lt command to query the stored tensors, run the pt command to view the tensor content, and save the data as an .npy file.

Collecting the .npy File

After the run command is executed, you need to collect .npy files. tfdbg can dump only one tensor at a time. To automatically collect all .npy files, perform the following operations:

Run the lt > gpu_dump command to temporarily store all tensor names to the gpu_dump file. The command output is as follows:
```
Wrote output to tensor_name
```
Open a new CLI, go to the directory where the gpu_dump file is stored (that is, the directory where the training script is located by default) to generate commands to run in tfdbg.
```
timestamp=$[$(date +%s%N)/1000] ; cat gpu_dump | awk '{print "pt",$4,$4}' | awk '{gsub("/", "_", $3);gsub(":", ".", $3);print($1,$2,"-n 0 -w "$3".""'$timestamp'"".npy")}'
```
Copy all generated commands starting with pt and paste them to the tfdbg CLI. Run the commands to save all .npy files. The files are saved to the directory where the training script is stored.
By default, .npy files are stored using numpy.save(). Slashes (/) and colons (:) are replaced by underscores (_).

If the command cannot be pasted on the CLI, run the mouse off command in the tfdbg command line to disable the mouse mode before pasting again.
Check whether names of the generated .npy files comply with the naming rules, as shown in Figure 3.
- An .npy file is named in the format of {op_name}.{output_index}.{timestamp}.npy, where op_name must comply with the A-Za-z0-9_- regular expression, timestamp must comply with the [0-9]{1,255} regular expression, and output_index is a number.
- If the name of an .npy file exceeds 255 characters due to a long operator name, comparison of this operator is not supported.
- The name of some .npy files may not meet the naming requirements due to the tfdbg or operating environment. You can manually rename the files based on the naming rules. If there are a large number of .npy files that do not meet the requirements, generate .npy files again by referring to How Do I Handle Exceptions in the Generated .npy File Names in Batches?
Figure 3 Viewing the .npy files

Parent topic: Training Scenarios (TensorFlow 1.x)