Profiling Quick Start (PyTorch Training/Online Inference)
In PyTorch training and online inference scenarios, you are advised to use the Ascend PyTorch Profiler API to collect and parse profile data. You can then analyze and identify performance bottlenecks based on the results.
The Ascend PyTorch Profiler API does not support performance analysis in single-process multi-device scenarios. You are advised to use a multi-process approach to execute test cases and set a device for each process.
Prerequisites
- Ensure that operations in Before You Start have been completed.
- Prepare a model trained on PyTorch 2.1.0 or later and a matched dataset, and port the model to the Ascend AI Processor. For details, see "Porting Adaptation" in the PyTorch Training Model Porting and Tuning Guide .
Collecting and Parsing Profile Data
- Call the Ascend PyTorch Profiler API to enable profile data sampling during PyTorch training/online inference.
Add the following sample code to the training script (for example, train_*.py)/online inference script to configure profile data sampling parameters, and then start the training/online inference. The following are code examples:
- For details about the APIs in the examples, see Ascend PyTorch Profiler APIs.
- For details about profiling in the PyTorch scenario, see Using PyTorch APIs for Profile Data Sampling.
- Profile data occupies certain disk space. As a result, the server may be unavailable when the disk space is used up. The space required by profile data is closely related to the model parameters, collection configurations, and number of collection iterations. You need to ensure that the available disk space in the directory where profile data is flushed is sufficient.
- Example 1:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
import torch import torch_npu ... experimental_config = torch_npu.profiler._ExperimentalConfig( export_type=torch_npu.profiler.ExportType.Text, profiler_level=torch_npu.profiler.ProfilerLevel.Level0, msprof_tx=False, aic_metrics=torch_npu.profiler.AiCMetrics.AiCoreNone, l2_cache=False, op_attr=False, data_simplification=False, record_op_args=False, gc_detect_threshold=None ) with torch_npu.profiler.profile( activities=[ torch_npu.profiler.ProfilerActivity.CPU, torch_npu.profiler.ProfilerActivity.NPU ], schedule=torch_npu.profiler.schedule(wait=0, warmup=0, active=1, repeat=1, skip_first=1), on_trace_ready=torch_npu.profiler.tensorboard_trace_handler("./result"), record_shapes=False, profile_memory=False, with_stack=False, with_modules=False, with_flops=False, experimental_config=experimental_config) as prof: for step in range(steps): train_one_step(step, steps, train_loader, model, optimizer, criterion) prof.step()
- Example 2:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
import torch import torch_npu ... experimental_config = torch_npu.profiler._ExperimentalConfig( export_type=torch_npu.profiler.ExportType.Text, profiler_level=torch_npu.profiler.ProfilerLevel.Level0, msprof_tx=False, aic_metrics=torch_npu.profiler.AiCMetrics.AiCoreNone, l2_cache=False, op_attr=False, data_simplification=False, record_op_args=False, gc_detect_threshold=None ) prof = torch_npu.profiler.profile( activities=[ torch_npu.profiler.ProfilerActivity.CPU, torch_npu.profiler.ProfilerActivity.NPU ], schedule=torch_npu.profiler.schedule(wait=0, warmup=0, active=1, repeat=1, skip_first=1), on_trace_ready=torch_npu.profiler.tensorboard_trace_handler("./result"), record_shapes=False, profile_memory=False, with_stack=False, with_modules=False, with_flops=False, experimental_config=experimental_config) prof.start() for step in range(steps): train_one_step() prof.step() prof.stop()
In the preceding example, tensorboard_trace_handler is used to export profile data. You can also use the following prof.export_chrome_trace to export profile data:
1 2 3 4 5 6 7 8 9
import torch import torch_npu ... with torch_npu.profiler.profile() as prof: for step in range(steps): train_one_step(step, steps, train_loader, model, optimizer, criterion) prof.export_chrome_trace("./chrome_trace_14.json")
- View the result file of profile data collected during PyTorch training.
After the training is complete, the collection result directory of the Ascend PyTorch Profiler API is generated in the directory specified by the torch_npu.profiler.tensorboard_trace_handler API.
- You need to open the following data files and view them. For details about the fields, see Data Storing Directories.
- If the step ID in the kernel_details.csv file is empty, you can view the step information of the operator in the trace_view.json file or collect profile data again.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
└── localhost.localdomain_139247_20230628101435_ascend_pt // Parsing result directory, whose format is {worker_name}_{timestamp}_ascend_pt. The default value of {worker_name} is {hostname}_{pid}. ├── profiler_info.json // In multi-device or cluster scenarios, the naming rule is profiler_info_{Rank_ID}.json, which is used to record Profiler-related metadata. ├── profiler_metadata.json ├── ASCEND_PROFILER_OUTPUT // Profile data collection with Ascend PyTorch Profiler │ ├── ascend_pytorch_profiler_{rank_id}.db // When export_type is set to torch_npu.profiler.ExportType.Db, a .db file is generated in the directory. Other .json and .csv files are not generated. │ ├── analysis.db // When export_type is set to torch_npu.profiler.ExportType.Db in multi-device or cluster scenarios where communication is involved, a .db file is generated in the directory, and it is displayed by the MindStudio Insight tool. Other .json and .csv files are not generated. │ ├── communication.json // Visualized data foundation for performance analysis in multi-device or cluster scenarios where communication is involved. It is generated by configuring profiler_level to torch_npu.profiler.ProfilerLevel.Level1 or torch_npu.profiler.ProfilerLevel.Level2 in experimental_config. │ ├── communication_matrix.json // Basic information file of small communication operators, which is generated by configuring profiler_level to torch_npu.profiler.ProfilerLevel.Level1 or torch_npu.profiler.ProfilerLevel.Level2 in experimental_config. │ ├── data_preprocess.csv // Generated by configuring profiler_level to torch_npu.profiler.ProfilerLevel.Level2 in experimental_config. │ ├── kernel_details.csv │ ├── l2_cache.csv // Generated by configuring l2_cache to True in experimental_config. │ ├── memory_record.csv │ ├── npu_module_mem.csv │ ├── operator_details.csv │ ├── operator_memory.csv │ ├── step_trace_time.csv // Computation and communication time statistics in iterations. │ ├── op_statistic.csv // AI Core and AI CPU operator calling times and time consumption. │ ├── api_statistic.csv // Generated when profiler_level is configured to torch_npu.profiler.ProfilerLevel.Level1 or torch_npu.profiler.ProfilerLevel.Level2 in experimental_config. │ └── trace_view.json ├── FRAMEWORK // Raw profile data on the framework side, which can be ignored. Delete this directory when data_simplification is set to True. └── PROF_000001_20230628101435646_FKFLNPEPPRRCFCBA // Profile data at the CANN layer, which is named in the format of PROF_{number}_{timestamp}_{character string}. When data_simplification is set to True, only the raw profile data in this directory is retained, and other data is deleted. ├── analyze // Generated when profiler_level is configured to torch_npu.profiler.ProfilerLevel.Level1 or torch_npu.profiler.ProfilerLevel.Level2 in experimental_config. ├── device_* ├── host ├── mindstudio_profiler_log └── mindstudio_profiler_output ├── localhost.localdomain_139247_20230628101435_ascend_pt_op_args // Directory for storing the operator statistics file, which is generated by configuring record_op_args to True in experimental_config.