Accuracy Data Collection

In this sample, the ResNet-50 model and virtual data are used for training, reducing the dataset download time.

Prerequisites

Performing Collection

  1. Add the tool to the training script pytorch_main.py in the GPU and Ascend NPU environments.

    During training in the GPU environment, lines 24 and 25 in the following script are not needed.

    You can copy the complete code from Code Sample for PyTorch Accuracy Data Collection and execute it directly. The following examples only show where to add the tool API in the script.
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
     23
     24 import torch_npu
     25 from torch_npu.contrib import transfer_to_npu
     26
     27 from msprobe.pytorch import PrecisionDebugger, seed_all
     28 seed_all(seed=1234, mode=True)    # Fix random seeds to enable deterministic computation to ensure data consistency across model executions.
    ...
    314 def train(train_loader, model, criterion, optimizer, epoch, device, args):
    ...
    331     end = time.time()
    332
    333     debugger = PrecisionDebugger(dump_path="./dump_data", task="tensor", step=[0, 1])
    334     for i, (images, target) in enumerate(train_loader):
    335         debugger.start()
    ...
    356
    357         # measure elapsed time
    358         batch_time.update(time.time() - end)
    359         end = time.time()
    360
    361         debugger.stop()
    362         debugger.step()
    

    Accuracy data occupies certain disk space. As a result, the server may be unavailable if the disk space is used up. The space required by accuracy data is closely related to the model parameters, collection configurations, and number of collection iterations. You need to ensure that there is sufficient available disk space in the directory where accuracy data is flushed.

  2. Run the training script command. The tool collects the accuracy data during model training.
    python pytorch_main.py -a resnet50 -b 32 --gpu 1 --dummy

    If the following information is displayed in the log, you can manually stop the model training and view the collected data to save time.

    1
    2
    3
    ****************************************************************************
    *                        msprobe ends successfully.                        *
    ****************************************************************************
    

Viewing Results

The following directory structure is displayed in the path specified by dump_path. Select data for analysis as required.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
dump_data/
├── step0
    └── rank
        ├── construct.json           # Save the hierarchical relationship information of the module. This field is empty in the current scenario.
        ├── dump.json                # Save the input and output statistics and overflow information of the forward and backward APIs.
        ├── dump_tensor_data         # Save the actual data of the input and output tensors of the forward and backward APIs.
           ├── Functional.adaptive_avg_pool2d.0.backward.input.0.pt
           ├── Functional.adaptive_avg_pool2d.0.backward.output.0.pt
           ├── Functional.adaptive_avg_pool2d.0.forward.input.0.pt
           ├── Functional.adaptive_avg_pool2d.0.forward.output.0.pt
        ...
        └── stack.json                 # Save the call stack information of the API.
├── step1
...

Use tools such as those in Accuracy Pre-Check and Accuracy Comparison to further analyze the collected data.