Training Status Monitoring

Prerequisites

You have performed operations in Environment Setup.
You have performed operations in Pre-Training Configuration Check.

Procedure

Create a configuration file. The weight gradient monitoring function is used as an example.
For example, create a monitor_config.json configuration file in the directory where the training script is located and copy the following content to the file:
1 2 3 4 5 6 7
{ "targets": { }, "wg_distribution": true, "format": "csv", "ops": ["norm", "min", "max", "nans"] }

Add the tool to the training script.

You can copy the complete code from Code Sample for PyTorch Training Status Monitoring and execute it directly. The following examples only show where to add the tool API in the script.

 23
import torch_npu
from torch_npu.contrib import transfer_to_npu

monitor = TrainerMon(
   config_file_path="./monitor_config.json",
   params_have_main_grad=False,  # Whether to use main_grad for weights. Typically True (default value) for megatron and False for deepspeed.
) 
...
   # switch to train mode
   model.train()

   # Mount monitored objects.
   monitor.set_monitor(
       model,
       grad_acc_steps=1,
       optimizer=optimizer,
       dp_group=None,
       tp_group=None,
       start_iteration=0  # Provide the current iteration for resumable training. The default value is 0.
   ) 
...

Run the training script.

python pytorch_main.py -a resnet50 -b 32 --gpu 1 --dummy

Check the results.
After the training is complete, the monitor_output directory is generated in the current path. Multiple results are generated in the directory based on the timestamp. View the files in the latest directory.
Figure 1 Result file

For details about the output result, see Output Path.

Parent topic: Model Accuracy Debugging