Accuracy Data Collection
In this sample, the ResNet-50 model and virtual data are used for training, reducing the dataset download time.
Prerequisites
- You have performed operations in Environment Setup.
- You have performed operations in Pre-Training Configuration Check.
Performing Collection
- Add the tool to the training script pytorch_main.py in the GPU and Ascend NPU environments.
During training in the GPU environment, lines 24 and 25 in the following script are not needed.
You can copy the complete code from Code Sample for PyTorch Accuracy Data Collection and execute it directly. The following examples only show where to add the tool API in the script.1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
23 24 import torch_npu 25 from torch_npu.contrib import transfer_to_npu 26 27 from msprobe.pytorch import PrecisionDebugger, seed_all 28 seed_all(seed=1234, mode=True) # Fix random seeds to enable deterministic computation to ensure data consistency across model executions. ... 314 def train(train_loader, model, criterion, optimizer, epoch, device, args): ... 331 end = time.time() 332 333 debugger = PrecisionDebugger(dump_path="./dump_data", task="tensor", step=[0, 1]) 334 for i, (images, target) in enumerate(train_loader): 335 debugger.start() ... 356 357 # measure elapsed time 358 batch_time.update(time.time() - end) 359 end = time.time() 360 361 debugger.stop() 362 debugger.step()
Accuracy data occupies certain disk space. As a result, the server may be unavailable if the disk space is used up. The space required by accuracy data is closely related to the model parameters, collection configurations, and number of collection iterations. You need to ensure that there is sufficient available disk space in the directory where accuracy data is flushed.
- Run the training script command. The tool collects the accuracy data during model training.
python pytorch_main.py -a resnet50 -b 32 --gpu 1 --dummy
If the following information is displayed in the log, you can manually stop the model training and view the collected data to save time.
1 2 3
**************************************************************************** * msprobe ends successfully. * ****************************************************************************
Viewing Results
The following directory structure is displayed in the path specified by dump_path. Select data for analysis as required.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | dump_data/ ├── step0 └── rank ├── construct.json # Save the hierarchical relationship information of the module. This field is empty in the current scenario. ├── dump.json # Save the input and output statistics and overflow information of the forward and backward APIs. ├── dump_tensor_data # Save the actual data of the input and output tensors of the forward and backward APIs. │ ├── Functional.adaptive_avg_pool2d.0.backward.input.0.pt │ ├── Functional.adaptive_avg_pool2d.0.backward.output.0.pt │ ├── Functional.adaptive_avg_pool2d.0.forward.input.0.pt │ ├── Functional.adaptive_avg_pool2d.0.forward.output.0.pt ... └── stack.json # Save the call stack information of the API. ├── step1 ... |
Use tools such as those in Accuracy Pre-Check and Accuracy Comparison to further analyze the collected data.