Accuracy Collection

In this sample, the ResNet-50 model and virtual data are used for training.

Prerequisites

Complete Model Development and Migration to obtain the GPU and NPU environments that can properly execute training jobs.

Environment Setup

Run the following command to install the msProbe tool in the GPU and NPU environments:

pip3 install mindstudio-probe

Collection

  1. Add the tool to the training script (main.py) in the GPU and NPU environments.
    During training in the GPU environment, lines 24 and 25 in the following script are not needed.
     23
     24 import torch_npu
     25 from torch_npu.contrib import transfer_to_npu
     26
     27 from msprobe.pytorch import PrecisionDebugger, seed_all
     28 seed_all(seed=1234, mode=True)  # Fix random seed to enable deterministic computing to ensure that the data is the same for each model execution.
    ...
    310 def train(train_loader, model, criterion, optimizer, epoch, device, args):
    ...
    324     end = time.time()
    325
    326     debugger = PrecisionDebugger(dump_path="./dump_data", task="tensor", step=[0, 1])
    327     for i, (images, target) in enumerate(train_loader):
    328         debugger.start()
    ...
    337         # compute output
    338         output = model(images)
    339         loss = criterion(output, target)
    ...
    347         # compute gradient and do SGD step
    348         optimizer.zero_grad()
    349         loss.backward()
    350         optimizer.step()
    ...
    359         debugger.stop()
    360         debugger.step()
  2. Run the training script command. The tool collects the accuracy data during model training.
    python main.py -a resnet50 -b 32 --gpu 1 --dummy

    If the following information is displayed in the log, the data of the first two steps is successfully collected. You can manually stop model training and view the collected data.

    ****************************************************************************
    *                        msprobe ends successfully.                        *
    ****************************************************************************

Results

The following directory structure is displayed in the path specified by dump_path. You can select data for analysis as required.

dump_data/
└── step0
    └── rank
        ├── construct.json             # When the level is L0, the hierarchical relationship information of the module is saved. This field is empty in the current scenario.
        ├── dump.json                  # Save the input and output statistics and overflow information of the forward and reverse APIs.
        ├── dump_tensor_data           # Save the actual data of the input and output tensors of the forward and reverse APIs.
        │   ├── Functional.adaptive_avg_pool2d.0.backward.input.0.pt
        │   ├── Functional.adaptive_avg_pool2d.0.backward.output.0.pt
        │   ├── Functional.adaptive_avg_pool2d.0.forward.input.0.pt
        │   ├── Functional.adaptive_avg_pool2d.0.forward.output.0.pt
        ...
        └── stack.json                 # Save the call stack information of the API.

The collected data needs to be further analyzed using tools such as accuracy pre-check and accuracy comparison.