Training Status Monitoring

Prerequisites

Procedure

  1. Create a configuration file. The weight gradient monitoring function is used as an example.

    For example, create a monitor_config.json configuration file in the directory where the training script is located and copy the following content to the file:

    1
    2
    3
    4
    5
    6
    7
    { 
        "targets": {
        },
        "wg_distribution": true,
        "format": "csv",
        "ops": ["norm", "min", "max", "nans"]
    } 
    
  2. Add the tool to the training script.

    You can copy the complete code from Code Sample for PyTorch Training Status Monitoring and execute it directly. The following examples only show where to add the tool API in the script.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
     23
     24 import torch_npu
     25 from torch_npu.contrib import transfer_to_npu
     26 
     27 monitor = TrainerMon(
     28     config_file_path="./monitor_config.json",
     29     params_have_main_grad=False,  # Whether to use main_grad for weights. Typically True (default value) for megatron and False for deepspeed.
     30 ) 
    ...
    333     # switch to train mode
    334     model.train()
    335 
    336     # Mount monitored objects.
    337     monitor.set_monitor(
    338         model,
    339         grad_acc_steps=1,
    340         optimizer=optimizer,
    341         dp_group=None,
    342         tp_group=None,
    343         start_iteration=0  # Provide the current iteration for resumable training. The default value is 0.
    344     ) 
    ...
    
  3. Run the training script.
    python pytorch_main.py -a resnet50 -b 32 --gpu 1 --dummy
  4. Check the results.
    After the training is complete, the monitor_output directory is generated in the current path. Multiple results are generated in the directory based on the timestamp. View the files in the latest directory.
    Figure 1 Result file

    For details about the output result, see Output Path.