Getting Started with Performance Analysis in PyTorch Training Scenarios

In PyTorch training scenarios, you are advised to use the Ascend PyTorch Profiler API to profile and parse the profile data. You can then analyze and identify performance bottlenecks based on the results.

The Ascend PyTorch Profiler API profiles data by mapping processes to devices as follows:

  • Multi-process to multi-device: One profiling process for each device.
  • Single-process to multi-device: Supported. PyTorch 2.1.0post14, 2.5.1post2, 2.6.0, or later is required.
  • Multi-process to single-device: Ensure that the profiling processes are in serial sequence, that is, the profiling processes do not start at the same time, and each profiling process is complete from start to stop.

Prerequisites

  • You have installed the CANN Toolkit package and ops operator package.

    For details, see CANN Software Installation Guide.

  • Prepare the training model developed based on PyTorch 2.1.0 or later and the dataset, and migrate the original PyTorch model to the Ascend AI Processor by referring to "Model Porting" in PyTorch Training Model Porting and Tuning Guide.

Collecting and Parsing Profile Data

  1. Call the Ascend PyTorch Profiler API to enable profiling during PyTorch training.

    Add the following sample code to the training script (for example, train_*.py) to configure profiling parameters, and then start the training.

    • For details about the APIs in the examples, see Ascend PyTorch Profiler APIs.
    • For details about profiling in the PyTorch scenario, see Ascend PyTorch Profiler.
    • Profile data occupies certain disk space. As a result, the server may be unavailable when the disk space is used up. The space required by profile data is closely related to the model parameters, collection configurations, and number of collection iterations. You need to ensure that the available disk space in the directory where profile data is flushed is sufficient.
    • Example 1: Use the WITH statement to call the torch_npu.profiler.profile API to automatically create a profiler and collect profile data within the range specified by WITH.
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      26
      27
      28
      29
      30
      31
      32
      33
      34
      35
      36
      import torch
      import torch_npu
      
      ...
      
      experimental_config = torch_npu.profiler._ExperimentalConfig(
          export_type=[
              torch_npu.profiler.ExportType.Text
              ],
          profiler_level=torch_npu.profiler.ProfilerLevel.Level0,
          mstx=False,    # The original parameter name msprof_tx is changed to mstx. The new version is compatible with the original parameter name msprof_tx.
          aic_metrics=torch_npu.profiler.AiCMetrics.AiCoreNone,
          l2_cache=False,
          op_attr=False,
          data_simplification=False,
          record_op_args=False,
          gc_detect_threshold=None
      )
      
      with torch_npu.profiler.profile(
          activities=[
              torch_npu.profiler.ProfilerActivity.CPU,
              torch_npu.profiler.ProfilerActivity.NPU
              ],
          schedule=torch_npu.profiler.schedule(wait=0, warmup=0, active=1, repeat=1, skip_first=1),    # Used with prof.step().
          on_trace_ready=torch_npu.profiler.tensorboard_trace_handler("./result"),
          record_shapes=False,
          profile_memory=False,
          with_stack=False,
          with_modules=False,
          with_flops=False,
          experimental_config=experimental_config) as prof:
      
          for step in range(steps):
              train_one_step(step, steps, train_loader, model, optimizer, criterion)
              prof.step()    # Used with schedule.
      
    • Example 2: Create a torch_npu.profiler.profile object and use the start and stop APIs to control profiling. You can set a custom location to begin profiling.
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      26
      27
      28
      29
      30
      31
      32
      33
      34
      35
      36
      37
      import torch
      import torch_npu
      ...
      
      experimental_config = torch_npu.profiler._ExperimentalConfig(
          export_type=[
              torch_npu.profiler.ExportType.Text
              ],
          profiler_level=torch_npu.profiler.ProfilerLevel.Level0,
          mstx=False,    # The original parameter name msprof_tx is changed to mstx. The new version is compatible with the original parameter name msprof_tx.
          aic_metrics=torch_npu.profiler.AiCMetrics.AiCoreNone,
          l2_cache=False,
          op_attr=False,
          data_simplification=False,
          record_op_args=False,
          gc_detect_threshold=None
      )
      
      prof = torch_npu.profiler.profile(
          activities=[
              torch_npu.profiler.ProfilerActivity.CPU,
              torch_npu.profiler.ProfilerActivity.NPU
              ],
          schedule=torch_npu.profiler.schedule(wait=0, warmup=0, active=1, repeat=1, skip_first=1),    # Used with prof.step().
          on_trace_ready=torch_npu.profiler.tensorboard_trace_handler("./result"),
          record_shapes=False,
          profile_memory=False,
          with_stack=False,
          with_modules=False,
          with_flops=False,
          experimental_config=experimental_config)
      
      prof.start()    # Start profiling.
      for step in range(steps):
          train_one_step()
          prof.step()    # Used with schedule.
      prof.stop()    # Stop profiling.
      

    In the preceding examples, tensorboard_trace_handler is used to export profile data. You can also use prof.export_chrome_trace to export the profile data of a single file chrome_trace_{pid}.json. The profile data exported by tensorboard_trace_handler contains the profile data exported by prof.export_chrome_trace. You can select either method based on the actual requirements.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    import torch
    import torch_npu
    
    ...
    
    with torch_npu.profiler.profile() as prof:
    
        # Start profiling.
        for step in range(steps):
            train_one_step(step, steps, train_loader, model, optimizer, criterion)
    prof.export_chrome_trace('./chrome_trace_14.json')
    
  2. View the result file of the profile data.

    After the training is complete, the profiling result directory of the Ascend PyTorch Profiler API is generated in the directory specified by the torch_npu.profiler.tensorboard_trace_handler API. The following is an example.

    You do not need to open the following data files. You can use the tool described in MindStudio Insight User Guide to view and analyze the profile data. For details about the fields, see MindSpore & PyTorch Profile Data File References.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    └── msprof_1784298_20250620085947065_ascend_pt
        ├── ASCEND_PROFILER_OUTPUT
           ├── ascend_pytorch_profiler_{Rank_ID}.db    # This file can be exported by default only for the Atlas A3 Training Series Product and Atlas A2 Training Series Product/Atlas 800I A2 Inference Product.
           ├── kernel_details.csv
           ├── operator_details.csv
           ├── step_trace_time.csv
           └── trace_view.json
        ├── FRAMEWORK
    ...
        ├── PROF_000001_20250620085947066_FLRBJLNFMBIDRPMB
           ├── device_1
              ├── data
    ...
           ├── host
              ├── data
    ...
           ├── mindstudio_profiler_log
           └── mindstudio_profiler_output
               ├── api_statistic_20250620085954.csv
               ├── msprof_20250620085953.json
               ├── op_summary_20250620085954.csv
               ├── README.txt
               └── task_time_20250620085954.csv
        ├── profiler_info.json
        └── profiler_metadata.json