Training Status Monitoring
Prerequisites
- You have performed operations in Environment Setup.
- You have performed operations in Pre-Training Configuration Check.
Procedure
- Create a configuration file. The weight gradient monitoring function is used as an example.
For example, create a monitor_config.json configuration file in the directory where the training script is located and copy the following content to the file:
1 2 3 4 5 6 7
{ "targets": { }, "wg_distribution": true, "format": "csv", "ops": ["norm", "min", "max", "nans"] }
- Add the tool to the training script.
You can copy the complete code from Code Sample for PyTorch Training Status Monitoring and execute it directly. The following examples only show where to add the tool API in the script.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
23 24 import torch_npu 25 from torch_npu.contrib import transfer_to_npu 26 27 monitor = TrainerMon( 28 config_file_path="./monitor_config.json", 29 params_have_main_grad=False, # Whether to use main_grad for weights. Typically True (default value) for megatron and False for deepspeed. 30 ) ... 333 # switch to train mode 334 model.train() 335 336 # Mount monitored objects. 337 monitor.set_monitor( 338 model, 339 grad_acc_steps=1, 340 optimizer=optimizer, 341 dp_group=None, 342 tp_group=None, 343 start_iteration=0 # Provide the current iteration for resumable training. The default value is 0. 344 ) ...
- Run the training script.
python pytorch_main.py -a resnet50 -b 32 --gpu 1 --dummy
- Check the results.
Parent topic: Model Accuracy Debugging
