Preparing .npy Data of a Trained TensorFlow 2.x Network Generated on GPUs

Prerequisites

Before generating the dump data or .npy data of a trained TensorFlow 2.x network, a complete, executable, standard TensorFlow model training project is required. For details about how to prepare the GPU training environment, see Quickly Creating a GPU Training Environment on an ECS. The content in the link is for reference only.
Install the debugger tfdbg_ascend of TensorFlow 2.x. For details, see tfdbg_ascend README.
Disable all random functions in the script, including but not limited to shuffle operations on datasets, random initialization of parameters, and implicit random initialization of some operators (such as the dense operator). Ensure that all parameters in the script are not initialized randomly.

Generating the .npy File

You can use the TensorFlow debugger (tfdbg_ascend) to generate .npy files. The major steps are as follows:

Modify the configuration in the training script .py file for model calling. The sample code is as follows:

Sample 1:

Import the debug plugin.
```
import tfdbg_ascend as dbg
```

Configure the following code before the training startup code of each step. For example, dump the data of the fifth step.

      tfdbg.disable()
      if current_step == 5: 
        tfdbg.enable()
        tfdbg.set_dump_path('home/test/gpu_dump')

Sample 2:

Import the debug plugin.
```
import tfdbg_ascend as dbg
```

Dump the data of the fourth step (example). If you do not configure dbg.enable, the dump function is enabled by default. If you do not specify the dump path, dump files are saved in the path where the training script is located by default.

class DumpConfig(tf.keras.callbacks.Callback):
    def __init__(self):
        super().__init__()
    def on_batch_begin(self, batch, logs={}):
        if batch == 4:
            dbg.enable()
            dbg.set_dump_path("/user/name1/pip_pkg/dump4")
        else:
            dbg.disable()

# define callbacks
        callbacks = [
            ModelCheckpoint(
                f'models/model_epochs-{epochs}_batch-{batch_size}_loss-{loss_function}_{Mask2FaceModel.get_datetime_string()}.h5'),
            LossHistory(batch_size),
            DumpConfig()
        ]
	
# fit the model: The following code is the location for calling the model.
history = self.model.fit(train_dataset, validation_data=valid_dataset, epochs=1, callbacks=callbacks, verbose=2)

Execute the training script. After the training job is stopped, the *.npy files are generated in the specified directory.
Check that names of the generated .npy files comply with the naming rules, as shown in Figure 1.
- An .npy file is named in the format of {op_name}.{output_index}.{timestamp}.npy, where op_name must comply with the A-Za-z0-9_- regular expression, timestamp must comply with the [0-9]{1,255} regular expression, and output_index is a number.
- If the name of an .npy file exceeds 255 characters due to a long operator name, comparison of this operator is not supported.
- The names of some .npy files may not meet the naming requirements due to the tfdbg or operating environment. You can manually rename the files based on the naming rules. If there are a large number of .npy files that do not meet the requirements, generate .npy files again by referring to How Do I Handle Exceptions in the Generated .npy File Names in Batches?.
Figure 1 Viewing the .npy files

Parent topic: Training Scenarios (TensorFlow 2.x)