Accuracy Collection

In this sample, the ResNet-50 model and virtual data are used for training.

Prerequisites

Complete Model Development and Migration to obtain the GPU and NPU environments that can properly execute training jobs.

Environment Setup

Run the following command to install the msProbe tool in the GPU and NPU environments:

pip3 install mindstudio-probe

Collection

Add the tool to the training script (main.py) in the GPU and NPU environments.

During training in the GPU environment, lines 24 and 25 in the following script are not needed.

 23
 24 import torch_npu
 25 from torch_npu.contrib import transfer_to_npu
 26
 27 from msprobe.pytorch import PrecisionDebugger, seed_all
 28 seed_all(seed=1234, mode=True)  # Fix random seed to enable deterministic computing to ensure that the data is the same for each model execution.
...
310 def train(train_loader, model, criterion, optimizer, epoch, device, args):
...
324     end = time.time()
325
326     debugger = PrecisionDebugger(dump_path="./dump_data", task="tensor", step=[0, 1])
327     for i, (images, target) in enumerate(train_loader):
328         debugger.start()
...
337         # compute output
338         output = model(images)
339         loss = criterion(output, target)
...
347         # compute gradient and do SGD step
348         optimizer.zero_grad()
349         loss.backward()
350         optimizer.step()
...
359         debugger.stop()
360         debugger.step()

Run the training script command. The tool collects the accuracy data during model training.

python main.py -a resnet50 -b 32 --gpu 1 --dummy

If the following information is displayed in the log, the data of the first two steps is successfully collected. You can manually stop model training and view the collected data.

****************************************************************************
*                        msprobe ends successfully.                        *
****************************************************************************

Results

The following directory structure is displayed in the path specified by dump_path. You can select data for analysis as required.

dump_data/
└── step0
    └── rank
        ├── construct.json             # When the level is L0, the hierarchical relationship information of the module is saved. This field is empty in the current scenario.
        ├── dump.json                  # Save the input and output statistics and overflow information of the forward and reverse APIs.
        ├── dump_tensor_data           # Save the actual data of the input and output tensors of the forward and reverse APIs.
        │   ├── Functional.adaptive_avg_pool2d.0.backward.input.0.pt
        │   ├── Functional.adaptive_avg_pool2d.0.backward.output.0.pt
        │   ├── Functional.adaptive_avg_pool2d.0.forward.input.0.pt
        │   ├── Functional.adaptive_avg_pool2d.0.forward.output.0.pt
        ...
        └── stack.json                 # Save the call stack information of the API.

The collected data needs to be further analyzed using tools such as accuracy pre-check and accuracy comparison.

Parent topic: Model Accuracy Debugging