Preparing .npy Files on the GPU
Precautions
- Before obtaining the dump data or .npy data during the original TensorFlow 1.x network training or online inference, a complete, executable, standard TensorFlow model training or online inference project is required.
- Regardless of whether the Estimator or session.run mode is used, disable all random functions in the script, including but not limited to shuffle operations on datasets, random initialization of parameters, and implicit random initialization of some operators (such as the dense operator). Ensure that all parameters in the script are not initialized randomly.
.npy Data File Generation
You can use the TensorFlow debugger (tfdbg) to generate .npy files. The major steps are as follows:
- Add the debugging configuration option to the TensorFlow training/online inference script.
- If the Estimator mode is used, add the hook of tfdbg as follows:
- Add from tensorflow.python import debug as tf_debug to import the debug module.
- Add the training_hooks=[tf_debug.LocalCLIDebugHook()] code to the location where the EstimatorSpec object instance is generated, that is, the location for constructing the network structure.
Figure 1 Estimator mode
- If the session.run mode is used, set the tfdbg decorator before running as follows:
- Add from tensorflow.python import debug as tf_debug to import the debug module.
- After the session is initialized, add sess = tf_debug.LocalCLIDebugWrapperSession(sess, ui_type="readline").
Figure 2 Session.run mode
- If the Estimator mode is used, add the hook of tfdbg as follows:
- Run the training/online inference script.
- After the training/online inference job is stopped, the view enters the debugging CLI interaction mode (tfdbg). Then run the run command.
For more details, see help.. tfdbg> run
After the run command is executed, you can run the lt command to query the stored tensors, run the pt command to view the tensor content, and save the data as an .npy file. For details, see .npy Data File Collection.
.npy Data File Collection
After the run command is executed, you need to collect .npy files. tfdbg can dump only one tensor at a time. To automatically collect all .npy files, perform the following operations:
- Run the lt > gpu_dump command to temporarily store all tensor names to the gpu_dump file. The command output is as follows:
Wrote output to tensor_name
- Open a new CLI, and go to the directory where the gpu_dump file is stored (that is, the directory where the training/online inference script is located by default) to generate commands to run in the tfdbg CLI.
timestamp=$[$(date +%s%N)/1000] ; cat gpu_dump | awk '{print "pt",$4,$4}' | awk '{gsub("/", "_", $3);gsub(":", ".", $3);print($1,$2,"-n 0 -w "$3".""'$timestamp'"".npy")}' - Copy all generated tensor storage commands starting with pt and paste them to the tfdbg CLI. Then run the commands to save all .npy files. The files are saved to the directory where the training/online inference script is stored.By default, .npy files are stored using numpy.save(). Slashes (/) and colons (:) are replaced by underscores (_).
If the commands cannot be pasted, run the mouse off command in the tfdbg CLI to disable the mouse mode before pasting again.
- Check whether the names of the generated .npy files comply with the {op_name}.{output_index}.{timestamp}.npy format, as shown in Figure 3.
- If the name of an .npy file exceeds 255 characters due to the long operator name, comparison of this operator is not supported.
- The name of some .npy files may not meet the naming requirements due to the tfdbg or operating environment. You can manually rename the files based on the naming rules. If there are a large number of .npy files that do not meet the requirements, generate .npy files again. For details, see Handling Exceptions in the Generated .npy File Names in Batches.
