完成模型开发&迁移,得到可正常执行训练任务的GPU和NPU环境。
在执行性能数据采集前请先将训练脚本(main.py文件)中的精度数据采集接口删除,因为精度数据采集和性能数据采集不可同时执行。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 | 23 24 import torch_npu 25 from torch_npu.contrib import transfer_to_npu 26 ... 322 experimental_config = torch_npu.profiler._ExperimentalConfig( 323 export_type=torch_npu.profiler.ExportType.Text, 324 profiler_level=torch_npu.profiler.ProfilerLevel.Level1, 325 msprof_tx=False, 326 aic_metrics=torch_npu.profiler.AiCMetrics.AiCoreNone, 327 l2_cache=False, 328 op_attr=False, 329 data_simplification=False, 330 record_op_args=False, 331 gc_detect_threshold=None) 332 with torch_npu.profiler.profile( 333 activities=[ 334 torch_npu.profiler.ProfilerActivity.CPU, 335 torch_npu.profiler.ProfilerActivity.NPU 336 ], 337 schedule=torch_npu.profiler.schedule(wait=0, warmup=0, active=1, repeat=1, skip_first=1), 338 on_trace_ready=torch_npu.profiler.tensorboard_trace_handler("./profiling_data"), 339 record_shapes=False, 340 profile_memory=False, 341 with_stack=False, 342 with_modules=False, 343 with_flops=False, 344 experimental_config=experimental_config) as prof: 345 for i, (images, target) in enumerate(train_loader): 346 # measure data loading time 347 data_time.update(time.time() - end) 348 349 # move data to the same device as model 350 images = images.to(device, non_blocking=True) 351 target = target.to(device, non_blocking=True) 352 353 # compute output 354 output = model(images) 355 loss = criterion(output, target) 356 357 # measure accuracy and record loss 358 acc1, acc5 = accuracy(output, target, topk=(1, 5)) 359 losses.update(loss.item(), images.size(0)) 360 top1.update(acc1[0], images.size(0)) 361 top5.update(acc5[0], images.size(0)) 362 363 # compute gradient and do SGD step 364 optimizer.zero_grad() 365 loss.backward() 366 optimizer.step() 367 prof.step() ... |
python main.py -a resnet50 -b 32 --gpu 1 --dummy
训练结束后,在torch_npu.profiler.tensorboard_trace_handler接口指定的目录下生成Ascend PyTorch Profiler接口的采集结果目录,如下示例。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | └── localhost-247.localdomain_2201189_20241114070751139_ascend_pt ├── ASCEND_PROFILER_OUTPUT │ ├── api_statistic.csv │ ├── kernel_details.csv │ ├── operator_details.csv │ ├── op_statistic.csv │ ├── step_trace_time.csv │ └── trace_view.json ├── FRAMEWORK ... ├── PROF_000001_20241114151021952_PGRJNNCFAIJQMERA │ ├── device_1 │ │ ├── data ... │ ├── host │ │ ├── data ... │ ├── mindstudio_profiler_log ... │ └── mindstudio_profiler_output │ ├── api_statistic_20241114151110.csv │ ├── msprof_20241114151108.json │ ├── op_statistic_20241114151110.csv │ ├── op_summary_20241114151110.csv │ ├── prof_rule_1_20241114151110.json │ ├── README.txt │ └── task_time_20241114151110.csv └── profiler_info.json |
Ascend PyTorch Profiler接口采集的性能数据建议使用MindStudio Insight工具进行可视化分析,也可以使用mstt的msprof-analyze工具进行辅助分析,详细操作请参见使用MindStudio Insight工具可视化性能数据和使用msprof-analyze工具分析性能数据。