Accuracy Collection
In this sample, the ResNet-50 model and virtual data are used for training.
Prerequisites
Complete Model Development and Migration to obtain the GPU and NPU environments that can properly execute training jobs.
Environment Setup
Run the following command to install the msProbe tool in the GPU and NPU environments:
pip3 install mindstudio-probe
Collection
- Add the tool to the training script (main.py) in the GPU and NPU environments.During training in the GPU environment, lines 24 and 25 in the following script are not needed.
23 24 import torch_npu 25 from torch_npu.contrib import transfer_to_npu 26 27 from msprobe.pytorch import PrecisionDebugger, seed_all 28 seed_all(seed=1234, mode=True) # Fix random seed to enable deterministic computing to ensure that the data is the same for each model execution. ... 310 def train(train_loader, model, criterion, optimizer, epoch, device, args): ... 324 end = time.time() 325 326 debugger = PrecisionDebugger(dump_path="./dump_data", task="tensor", step=[0, 1]) 327 for i, (images, target) in enumerate(train_loader): 328 debugger.start() ... 337 # compute output 338 output = model(images) 339 loss = criterion(output, target) ... 347 # compute gradient and do SGD step 348 optimizer.zero_grad() 349 loss.backward() 350 optimizer.step() ... 359 debugger.stop() 360 debugger.step()
- Run the training script command. The tool collects the accuracy data during model training.
python main.py -a resnet50 -b 32 --gpu 1 --dummy
If the following information is displayed in the log, the data of the first two steps is successfully collected. You can manually stop model training and view the collected data.
**************************************************************************** * msprobe ends successfully. * ****************************************************************************
Results
The following directory structure is displayed in the path specified by dump_path. You can select data for analysis as required.
dump_data/
└── step0
└── rank
├── construct.json # When the level is L0, the hierarchical relationship information of the module is saved. This field is empty in the current scenario.
├── dump.json # Save the input and output statistics and overflow information of the forward and reverse APIs.
├── dump_tensor_data # Save the actual data of the input and output tensors of the forward and reverse APIs.
│ ├── Functional.adaptive_avg_pool2d.0.backward.input.0.pt
│ ├── Functional.adaptive_avg_pool2d.0.backward.output.0.pt
│ ├── Functional.adaptive_avg_pool2d.0.forward.input.0.pt
│ ├── Functional.adaptive_avg_pool2d.0.forward.output.0.pt
...
└── stack.json # Save the call stack information of the API.
The collected data needs to be further analyzed using tools such as accuracy pre-check and accuracy comparison.
Parent topic: Model Accuracy Debugging