Ascend PyTorch Profiler APIs

Ascend PyTorch Profiler is a performance analysis tool developed for the PyTorch framework. By adding the Ascend PyTorch Profiler API to PyTorch training/online inference scripts, profile data can be sampled during training/online inference and visualized as profile data files upon completion of training/online inference, improving the performance analysis efficiency. The Ascend PyTorch Profiler API can collect complete profile data in PyTorch training/online inference scenarios, including information about operators at the PyTorch layer and CANN layer, bottom-layer NPU operators, and operator memory usages, providing a comprehensive analysis on performance status during PyTorch training/online inference.

The Ascend PyTorch Profiler API tool supports the following profile data sampling modes:

Other functions:

References:

Restrictions

Ascend PyTorch Profiler APIs support multiple profiling methods, but these methods cannot be enabled at the same time.

Prerequisites

Profile Data Sampling and Parsing (torch_npu.profiler.profile)

  1. Add the following sample code to the training script (for example, train_*.py)/online inference script to configure profile data sampling parameters, and then start the training/online inference. The following is a code example:

    For details about the torch_npu.profiler.profile API in the following sample code, see Ascend PyTorch Profiler APIs.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    import torch
    import torch_npu
    
    ...
    
    experimental_config = torch_npu.profiler._ExperimentalConfig(
     export_type=torch_npu.profiler.ExportType.Text,
     profiler_level=torch_npu.profiler.ProfilerLevel.Level0,
     msprof_tx=False,
     aic_metrics=torch_npu.profiler.AiCMetrics.AiCoreNone,
     l2_cache=False,
     op_attr=False,
     data_simplification=False,
     record_op_args=False,
     gc_detect_threshold=None
    )
    
    with torch_npu.profiler.profile(
     activities=[
      torch_npu.profiler.ProfilerActivity.CPU,
      torch_npu.profiler.ProfilerActivity.NPU
      ],
     schedule=torch_npu.profiler.schedule(wait=0, warmup=0, active=1, repeat=1, skip_first=1),
     on_trace_ready=torch_npu.profiler.tensorboard_trace_handler("./result"),
     record_shapes=False,
     profile_memory=False,
     with_stack=False,
     with_modules=False,
     with_flops=False,
     experimental_config=experimental_config) as prof:
      for step in range(steps):
       train_one_step(step, steps, train_loader, model, optimizer, criterion)
       prof.step()
    

    Or

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    import torch
    import torch_npu
    ...
    
    experimental_config = torch_npu.profiler._ExperimentalConfig(
     export_type=torch_npu.profiler.ExportType.Text,
     profiler_level=torch_npu.profiler.ProfilerLevel.Level0,
     msprof_tx=False,
     aic_metrics=torch_npu.profiler.AiCMetrics.AiCoreNone,
     l2_cache=False,
     op_attr=False,
     data_simplification=False,
     record_op_args=False,
     gc_detect_threshold=None
    )
    
    prof = torch_npu.profiler.profile(
     activities=[
      torch_npu.profiler.ProfilerActivity.CPU,
      torch_npu.profiler.ProfilerActivity.NPU
      ],
     schedule=torch_npu.profiler.schedule(wait=0, warmup=0, active=1, repeat=1, skip_first=1),
     on_trace_ready=torch_npu.profiler.tensorboard_trace_handler("./result"),
     record_shapes=False,
     profile_memory=False,
     with_stack=False,
     with_modules=False,
     with_flops=False,
     experimental_config=experimental_config)
    prof.start()
    for step in range(steps):
     train_one_step()
     prof.step()
    prof.stop()
    
    In addition to using tensorboard_trace_handler to export profile data, you can also use the following method:
    1
    2
    3
    4
    5
    6
    7
    8
    9
    import torch
    import torch_npu
    
    ...
    
    with torch_npu.profiler.profile() as prof:
     for step in range(steps):
      train_one_step(step, steps, train_loader, model, optimizer, criterion)
    prof.export_chrome_trace('./chrome_trace_14.json')
    
  2. Parse profile data.

    Automatic parsing (see tensorboard_trace_handler and prof.export_chrome_trace in the preceding sample code) and offline parsing are supported.

  3. View the sampled PyTorch training/online inference profile data result files and profile data analysis.

    For details about profile data result files, see Data Storing Directories.

Profile Data Sampling and Parsing (dynamic_profile)

dynamic_profile is used to start the profiling process at any time during model training/online inference.

dynamic_profile can be enabled in multiple methods. However, only one method can be enabled at a time.

Using environment variables

  1. Configure the following environment variables:
    export PROF_CONFIG_PATH="profiler_config_path"

    After this environment variable is configured and training is started, dynamic_profile automatically creates the template file profiler_config.json in profiler_config_path. You can modify configuration options based on the template file.

    • This method applies only to training scenarios.
    • In this method, dynamic_profile cannot sample data of the first iteration (step 0).
    • This method depends on the profiling steps in the training process of the Torch native Optimizer.step(). Custom optimizer is not supported.
    • profiler_config_path can contain only letters, digits, and underscores (_). Soft links are not supported.
  2. Start a training job.
  3. Open a new CLI and modify the profiler_config.json configuration file to enable the profiling task.
    The configuration file contains the profile data collection parameters of Profiler. You can see profiler_config.json File Description to modify the parameters in the configuration file to execute different profiling tasks.
    • dynamic_profile determines whether the profiler_config.json file is modified by identifying the status of the file.
      • dynamic_profile polls every two seconds. If the profiler_config.json file is modified, the sampling process is started. Then, the running interval between step 10 and step 11 is recorded and used as the new polling interval. The minimum interval is one second.
      • If the profiler_config.json file is modified during the dynamic_profile sampling process, the dynamic_profile sampling is started again after the sampling process ends.
    • You are advised to use shared storage to set profiler_config_path of dynamic_profile.
    • The dynamic_profile maintenance and test logs are automatically recorded in the profiler_config_path directory. For details, see Using dynamic_profile for Dynamic Profiling of Maintenance and Test Logs.
  4. Parse profile data.

    Automatic parsing and manual parsing are supported. For details, see the analyse parameter in Table 5.

  5. View the sampled PyTorch training/online inference profile data result files and profile data analysis.

    For details about profile data result files, see Data Storing Directories.

Modifying the user training/online inference script by adding the dynamic_profile API

  1. Add the following sample code to the training script (for example, train_*.py)/online inference script:
    1
    2
    3
    4
    5
    6
    7
    8
    9
    # Load the dynamic_profile module.
    from torch_npu.profiler import dynamic_profile as dp
    # Set the profiling configuration file path.
    dp.init("profiler_config_path")
    ...
    for step in steps:
     train_one_step()
    	# Divide steps.
     dp.step()
    

    During init, dynamic_profile automatically creates the template file profiler_config.json in profiler_config_path. You can modify configuration items based on the template file.

    profiler_config_path can contain only letters, digits, and underscores (_). Soft links are not supported.

  2. Start a training/online inference task.
  3. Open a new CLI and modify the profiler_config.json configuration file to enable the profiling task.
    The configuration file contains the profile data collection parameters of Profiler. You can see profiler_config.json File Description to modify the parameters in the configuration file to execute different profiling tasks.
    • dynamic_profile determines whether the profiler_config.json file is modified by identifying the status of the file.
      • dynamic_profile polls every two seconds. If the profiler_config.json file is modified, the sampling process is started. Then, the running interval between step 10 and step 11 is recorded and used as the new polling interval. The minimum interval is one second.
      • If the profiler_config.json file is modified during the dynamic_profile sampling process, the dynamic_profile sampling is started again after the sampling process ends.
    • You are advised to use shared storage to set profiler_config_path of dynamic_profile.
    • The dynamic_profile maintenance and test logs are automatically recorded in the profiler_config_path directory. For details, see Using dynamic_profile for Dynamic Profiling of Maintenance and Test Logs.
  4. Parse profile data.

    Automatic parsing and manual parsing are supported. For details, see the analyse parameter in Table 5.

  5. View the sampled PyTorch training/online inference profile data result files and profile data analysis.

    For details about profile data result files, see Data Storing Directories.

Modifying the user training/online inference script by adding the dp.start() function of dynamic_profile

  1. Add the following sample code to the training script (for example, train_*.py)/online inference script:
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    # Load the dynamic_profile module.
    from torch_npu.profiler import dynamic_profile as dp
    # Set the path of the profiling configuration file of the init API.
    dp.init("profiler_config_path")
    ...
    for step in steps:
     if step==5:
    		# Set the path of the profiling configuration file of the start API.
      dp.start("start_config_path")
     train_one_step()
    	# Divide steps. The code that requires profiling must be loaded between the dp.start() API and the dp.step() API.
     dp.step()
    

    start_config_path is also specified as the profiler_config.json path. However, you need to manually create a configuration file by referring to profiler_config.json File Description and set parameters based on your actual needs. The file name must be specified, for example, dp.start("/home/xx/start_config_path/profiler_config.json").

    profiler_config_path and start_config_path can contain only letters, digits, and underscores (_). Soft links are not supported.

    • After the dp.start() function is added, when a training or online inference task is proceeded to dp.start(), data is automatically sampled based on the profiler_config.json file specified by start_config_path. The dp.start() function does not detect the modification of the profiler_config.json file. It triggers a sampling task only during training or online inference.
    • After the dp.start() function is added and the training/online inference is started:
      • If the profiler_config.json configuration file is not specified in dp.start() or the configuration file does not take effect due to an error, sample data based on the profiler_config.json configuration file in the profiler_config_path directory when the task is proceeded to dp.start()
      • If the script proceeds to dp.start() when the dynamic_profile configured in dp.init() is valid, dp.start() does not take effect.
      • If the script proceeds to dp.start() after the dynamic_profile finishes sampling as specified in dp.init(), the script continues sampling with dp.start() and generates a new profile data file directory in the prof_dir directory.
      • If the profiler_config.json file in the profiler_config_path directory is modified while the dynamic_profile configured in dp.start() is valid, dp.init() starts after the dp.start() sampling finishes and a new profile data file is generated in the prof_dir directory.
    • You are advised to use shared storage to set profiler_config_path of dynamic_profile.
  2. Start a training/online inference task.
  3. Parse profile data.

    Automatic parsing and manual parsing are supported. For details, see the analyse parameter in Table 5.

  4. View the sampled PyTorch training/online inference profile data result files and profile data analysis.

    For details about profile data result files, see Data Storing Directories.

Profile Data Sampling and Parsing (torch_npu.profiler._KinetoProfile)

  1. Add the following sample code to the training script (for example, train_*.py)/online inference script to configure profile data sampling parameters, and then start the training/online inference. The following is a code example:

    For details about the torch_npu.profiler._KinetoProfile API in the following sample code, see Ascend PyTorch Profiler APIs.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    import torch
    import torch_npu
    
    ...
    
    prof = torch_npu.profiler._KinetoProfile(activities=None, record_shapes=False, profile_memory=False, with_stack=False, with_flops=False, with_modules=False, experimental_config=None)
    for epoch in range(epochs):
     trian_model_step()
     if epoch == 0:
      prof.start()
     if epoch == 1:
      prof.stop()
    prof.export_chrome_trace("result_dir/trace.json")
    

    In this method, schedule and tensorboard_trace_handler cannot be used to export profile data.

  2. Parse profile data.

    Automatic parsing is supported. For details, see prof.export_chrome_trace in the preceding sample code.

  3. View the sampled PyTorch training/online inference profile data result files and profile data analysis.

    For details about profile data result files, see Data Storing Directories.

(Optional) Sampling and Parsing msprof_tx

In large cluster scenarios, traditional profiling involves a large amount of data and complex analysis process. You can use the msprof_tx parameter of experimental_config to enable the custom dotting function, customize the collection period or the start and end time of key functions, and identify key functions or iterations to quickly demarcate performance issues.

The usage and sample code are as follows:

  1. Enable torch_npu.profiler and msprof_tx, and set profiler_level to Level_none (the level can be configured based on the actual collection requirements) to collect dotting data.
  2. In the PyTorch script, call the mark APIs of torch_npu.npu.mstx, torch_npu.npu.mstx.mark, torch_npu.npu.mstx.range_start, torch_npu.npu.mstx.range_end and torch_npu.npu.mstx.mstx_range to collect the time required for sampling events. For details about the APIs, see "Ascend Extension for PyTorch Custom API" > "torch_npu.npu" > "profiler" in Ascend Extension for PyTorch API Reference .

Only the range duration on the host is recorded.

1
2
3
id = torch_npu.npu.mstx.range_start("dataloader", None) # If the second input parameter is set to None or not set, only the range duration on the host is recorded.
dataloader()
torch_npu.npu.mstx.range_end(id)

Mark IDs on compute streams to record the range time on the host and device.

1
2
3
4
stream = torch_npu.npu.current_stream()
id = torch_npu.npu.mstx.range_start("matmul", stream)    # Set the second input parameter to a valid stream, record the range time on the host and device.
torch.matmul()
torch_npu.npu.mstx.range_end(id)

Mark IDs on the collective communication stream:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
from torch.distributed.distributed_c10d import _world

if (torch.__version__ != '1.11.0') :
 stream_id = _world.default_pg._get_backend(torch.device('npu'))._get_stream_id(False)
	collective_stream = torch.npu.Stream(stream_id=collective_stream_id, device_type=20, device_index=device_id)    # Use device_index to specify device IDs of actual services.
else:
 stream_id = _world.default_pg._get_stream_id(False)
 current_stream = torch.npu.current_stream()
 cdata = current_stream._cdata & 0xffff000000000000
	collective_stream = torch.npu.Stream(_cdata=( stream_id + cdata), device_index=device_id)    # Use device_index to specify device IDs of actual services.
id = torch_npu.npu.mstx.range_start("allreduce", collective_stream)    # Set the second input parameter to a valid stream, record the range time on the host and device.
torch.allreduce()
torch_npu.npu.mstx.range_end(id)

Mark IDs on the P2P communication stream:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
from torch.distributed.distributed_c10d import _world

if (torch.__version__ != '1.11.0') :
 stream_id = _world.default_pg._get_backend(torch.device('npu'))._get_stream_id(True)
	p2p_stream = torch.npu.Stream(stream_id=collective_stream_id, device_type=20, device_index=device_id)    # Use device_index to specify device IDs of actual services.
else:
 stream_id = _world.default_pg._get_stream_id(True)
 current_stream = torch.npu.current_stream()
 cdata = current_stream._cdata & 0xffff000000000000
	p2p_stream = torch.npu.Stream(_cdata=( stream_id + cdata), device_index=device_id)    # Use device_index to specify device IDs of actual services.
id = torch_npu.npu.mstx.range_start("send", p2p_stream)    # Set the second input parameter to a valid stream, record the range time on the host and device.
torch.send()
torch_npu.npu.mstx.range_end(id)

To collect data in these scenarios, you need to configure the torch_npu.profiler.profile API and enable the msprof_tx switch. The following is an example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
import torch_npu

stream = torch_npu.npu.current_stream()
id = torch_npu.npu.mstx.range_start("Func", stream)    # Mark the start of the range of the func function on the host and device.
func()            # Service code.
torch_npu.npu.mstx.range_end(id)            # Mark the end of the range of the func function on the host and device.

experimental_config = torch_npu.profiler._ExperimentalConfig(
 profiler_level=torch_npu.profiler.ProfilerLevel.Level_none,
 msprof_tx=True,
 export_type=torch_npu.profiler.ExportType.Db
 )
with torch_npu.profiler.profile(
 schedule=torch_npu.profiler.schedule(wait=1, warmup=1, active=2, repeat=2, skip_first=1),
 on_trace_ready=torch_npu.profiler.tensorboard_trace_handler("./result"),
 experimental_config=experimental_config) as prof:
       
 for epoch in range(epochs):
		torch_npu.npu.mstx.mark("train epoch start")    # Mark the instantaneous moment on the host and device. You can also use torch_npu.npu.mstx().mark("train epoch start").
  for step in range(steps):
   train_one_step(step, steps, train_loader, model, optimizer, criterion)
   prof.step()

By default, the msprof_tx function collects communication operator profile data in the format of comm:{communication name},{communicator name},{input data format},{input data counts}. For example, comm:HcclBroadcast,xxxxxx,int64,5, where xxxxxx indicates the communicator name.

This function allows you to view the execution and scheduling status of user-defined dotting from the framework to the CANN layer and then to the NPU, helping you identify key functions or events to be observed and demarcate performance issues.

For details about msprof_tx sampling results, see msproftx Data Description.

(Optional) Sampling Environment Variable Information

The Ascend PyTorch Profiler API profiles environment variable information by default. The following environment variables can be sampled:

  • "ASCEND_GLOBAL_LOG_LEVEL"
  • "HCCL_RDMA_TC"
  • "HCCL_RDMA_SL"
  • "ACLNN_CACHE_LIMIT"

Procedure:

  1. Configure environment variables. The following is an example:
    export ASCEND_GLOBAL_LOG_LEVEL=1
    export HCCL_RDMA_TC=0
    export HCCL_RDMA_SL=0
    export ACLNN_CACHE_LIMIT=4096

    Set the environment variables based on the actual requirements.

  2. Call the Ascend PyTorch Profiler API for sampling.
  3. View the result data.
    • When the export_type of experimental_config is set to torch_npu.profiler.ExportType.Text, the environment variables configured in the preceding steps are stored in the profiler_metadata.json file in the {worker_name}_{timestamp}_ascend_pt directory.
    • When export_type of the experimental_config parameter is set to torch_npu.profiler.ExportType.Db, write environment variable information into the META_DATA table in the ascend_pytorch_profiler_{rank_id}.db file.

(Optional) Marking the Profile Data Collection Process With Custom Character String Keys and Values

  • Example 1
    1
    2
    with torch_npu.profiler.profile(...) as prof:
     prof.add_metadata(key, value)
    
  • Example 2
    1
    2
    with torch_npu.profiler._KinetoProfile(...) as prof:
     prof.add_metadata_json(key, value)
    

add_metadata and add_metadata_json can be configured under torch_npu.profiler.profile and torch_npu.profiler._KinetoProfile. They need to be added in the code of the profile data collection process, after profiler initialization and before finalization.

Table 1 add_metadata API description

Class and Function Name

Description

add_metadata

Adds the character string flag. The options are as follows:

  • key: character string key.
  • value: character string value.

For example:

1
prof.add_metadata("test_key1", "test_value1")

add_metadata_json

Adds the character string flag in JSON format. The options are as follows:

  • key: character string key.
  • value: character string value, in JSON format.

For example:

1
prof.add_metadata_json("test_key2", [1, 2, 3])

The metadata passed by calling this API is written to the profiler_metadata.json file in the root directory of the collection result of the Ascend PyTorch Profiler API.

(Optional) Video Memory Visualization

The function classifies and displays the occupied data when the training process occupies the storage space during model training. Export the visualization file memory_timeline.html through export_memory_timeline. To output an HTML file, you need to install matplotlib in the Python environment and set torch_npu.profiler.profile to True. In addition, if you use this function, an ascend_pt data file is generated in the current directory. The following is an example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
import torch
import torch_npu
...

def trace_handler(prof: torch_npu.profiler.profile):
 prof.export_memory_timeline(output_path="./memory_timeline.html", device="npu:0")

with torch_npu.profiler.profile(
 activities=[
  torch_npu.profiler.ProfilerActivity.CPU,
  torch_npu.profiler.ProfilerActivity.NPU
 ],
 schedule=torch_npu.profiler.schedule(wait=0, warmup=0, active=4, repeat=1, skip_first=0),
 on_trace_ready=trace_handler,
	record_shapes=True,           # Set it to True.
	profile_memory=True,          # Set it to True.
	with_stack=True,              # Set either with_stack or with_modules to True.
 with_modules=True
) as prof:
 for _ in range(steps):
  ...
  prof.step()

After data sampling, the memory_timeline.html file is exported, with the following visualization effect:

Figure 1 memory_timeline
  • Time (ms): Horizontal coordinate, indicating the memory occupation time of the tensors (unit: ms).
  • Memory (GB): Vertical coordinate, indicating the memory size occupied by the tensors (unit: GB).
  • Max memory allocated: Allocated maximum memory size (unit: GB).
  • Max memory reserved: Reserved maximum memory size (unit: GB).
  • PARAMETER: Model parameters and model weights.
  • OPTIMIZER_STATE: Optimizer status. For example, the Adam optimizer records some statuses during model training.
  • INPUT: Input data.
  • TEMPORARY: Temporarily occupied. It is defined as tensors that are allocated and then released for a single operator. Generally, these tensors store intermediate values.
  • ACTIVATION: Activation values obtained in forward computation.
  • GRADIENT: Gradient value.
  • AUTOGRAD_DETAIL: Memory usage generated during backward computation.
  • UNKNOWN: Unknown type.

Ascend PyTorch Profiler APIs

Table 2 torch_npu.profiler.profile and torch_npu.profiler._KinetoProfile configuration parameters

Parameter

Description

Required (Yes/No)

activities

CPU/NPU event collection list, Enum type. Possible values are:

  • torch_npu.profiler.ProfilerActivity.CPU: framework-side data collection switch.
  • torch_npu.profiler.ProfilerActivity.NPU: CANN software stack and NPU data collection switch.

By default, the two switches are turned on at the same time.

No

schedule

Behavior of each step, Callable type. It is controlled by the schedule class. By default, no operation is performed.

This parameter is not supported by torch_npu.profiler._KinetoProfile.

No

on_trace_ready

Operation automatically performed after the collection ends, Callable type. tensorboard_trace_handler function is supported. If a large amount of data is sampled and direct parsing of the profile data in the current environment proves unsuitable, or the training/online inference process is interrupted during the sampling and only part of the profile data is sampled, offline parsing can be used.

By default, no operation is performed.

This parameter is not supported by torch_npu.profiler._KinetoProfile.

No

record_shapes

InputShapes and InputTypes of an operator, Bool type. Possible values are:

  • True: enabled.
  • False: disabled. This is the default value.

This parameter takes effect when torch_npu.profiler.ProfilerActivity.CPU is enabled.

No

profile_memory

Memory usage of an operator, Bool type. Possible values are:

  • True: enabled.
  • False: disabled. This is the default value.
NOTE:

Sampling memory data in the environment where glibc (2.34 or an earlier version) is installed may trigger a known Bug 19329. This problem can be solved by upgrading the glibc version.

No

with_stack

Operator call stack, Bool type, including the call information at the framework layer and CPU operator layer. Possible values are:

  • True: enabled.
  • False: disabled. This is the default value.

This parameter takes effect when torch_npu.profiler.ProfilerActivity.CPU is enabled.

NOTE:

Enabling this configuration will cause extra performance bloat.

No

with_modules

Python call stack at the modules level, that is, call information at the framework layer, which is of the Boolean type. Possible values are:

  • True: enabled.
  • False: disabled. This is the default value.

This parameter takes effect when torch_npu.profiler.ProfilerActivity.CPU is enabled.

NOTE:

Enabling this configuration will cause extra performance bloat.

No

with_flops

Floating-point operation of an operator, Bool type. Currently, this parameter cannot be used for profile data parsing. Possible values are:

  • True: enabled.
  • False: disabled. This is the default value.

This parameter takes effect when torch_npu.profiler.ProfilerActivity.CPU is enabled.

No

experimental_config

Extended parameter, used to configure common collection items of the performance analysis tool. For details about the supported collection items, see experimental_config Parameter Description.

No

use_cuda

CUDA profile data collection switch, Bool type. This parameter is not supported in Ascend environments. Possible values are:

  • True: enabled.
  • False: disabled. This is the default value.

This parameter is not supported by torch_npu.profiler._KinetoProfile.

No

Table 3 torch_npu.profiler.profile and torch_npu.profiler._KinetoProfile method description

Method Name

Description

step

Divides different iterations.

This method is not supported by torch_npu.profiler._KinetoProfile.

export_chrome_trace

Exports trace data, and writes it to a specified .json file. The trace data contains the running time and association relationships of operators and APIs displayed after the Ascend PyTorch Profiler APIs integrate the CANN software stack and NPU data on the framework side. The following parameters are included:

  • path: Path of the trace file (.json). The specified file path can contain only letters, digits, and underscores (_). Soft links are not supported. This function is required.

If torch_npu.profiler.tensorboard_trace_handler is set, export_chrome_trace does not take effect.

In a multi-device scenario, you need to set different file names for different devices. The sample code is as follows:

1
2
pid = os.getpid()
prof.export_chrome_trace(f'./chrome_trace_{pid}.json')

export_stacks

Exports stack information to a file. The following parameters are included:

  • path: path for storing the stack file. You need to configure the file name to *.log and specify a path, for example, /home/*.log. If you just directly configure the file name, the file is stored in the current directory. The path can contain only letters, digits, and underscores (_). Soft links are not supported. This function is required.
  • metric: saved processor type, which can be CPU or NPU, corresponding to value self_cpu_time_total or self_npu_time_total, respectively. This function is required.

The location of this method is the same as that of the export_chrome_trace method in the training/online inference script. The following is an example:

1
export_stacks('resule_dir/stack.log', metric='self_npu_time_total')

You can use the FlameGraph tool to view the exported result file as follows:

git clone https://github.com/brendangregg/FlameGraph
cd FlameGraph
./flamegraph.pl –title "NPU time" –countname "us." profiler.stacks > perf_viz.svg

export_memory_timeline

Export the memory event information of a specified device from the sampled data and export the timeline graph. Using export_memory_timeline, you can export three files, each controlled by the suffix of the output_path.

  • For HTML-compatible charts, the suffix .html is used, and the memory timeline is embedded in the HTML file as a PNG file.
  • For plot points of [timestamp, [sizes by category]], timestamp is the timestamp, and sizes is the memory usage of each category. The memory timeline is saved as a .json file or a compressed .json.gz file, depending on the file name extension.
  • For raw memory information, use the suffix raw.json.gz. Each raw memory event is composed of timestamp, action, numbytes, and category, where action refers to [PREEXISTING, CREATE, INCREMENT_VERSION, DESTROY] and category refers to [PARAMETER, OPTIMIZER_STATE, INPUT, TEMPORARY, ACTIVATION, GRADIENT, AUTOGRAD_DETAIL, UNKNOWN].

Parameters:

  • output_path: Result file of the configuration export, in the format of a string. The configuration format is path = "$PATH/*.html", where $PATH indicates the path of the result file and * indicates the name of the result file. If the path or file does not exist, it will be automatically created. This function is required.
  • device: Specifies the device ID to be exported, in the format of a string. The format is device = "npu:*", where * indicates the device ID or rank ID, which must be an existing device ID or rank ID in the sampled data. Only one value can be specified. This function is required.

Configuration examples:

1
export_memory_timeline(output_path="./memory_timeline.html", device="npu:0")

For details, see (Optional) Video Memory Visualization.

start

Sets the position where data collection starts. Refer to the following example to add start and stop before and after the training/online inference code for which profile data is to be sampled:

1
2
3
4
5
6
7
8
prof = torch_npu.profiler.profile(
on_trace_ready=torch_npu.profiler.tensorboard_trace_handler("./result"))
for step in range(steps):
    if step == 5:
        prof.start()
    train_one_step()
    if step == 5:
        prof.stop()

stop

Sets the position where data collection ends. Before using this method, execute start first.

Table 4 torch_npu.profiler class and function description

Class and Function Name

Description

torch_npu.profiler.schedule

Sets the action for each step. By default, this operation is not performed. To obtain more stable profile data, set specific parameters of this category. For details about the parameter values and usage, see torch_npu.profiler.schedule Parameter Description.

torch_npu.profiler.tensorboard_trace_handler

Exports the collected profile data to a format supported by the TensorBoard tool. Possible values are:

  • dir_name: Output directory of the sampled profile data, which is of the string type. The path can contain only letters, digits, and underscores (_). Soft links are not supported. If no path is specified after the tensorboard_trace_handler function is configured, profile data is flushed to the current directory by default. This function is optional.

    If on_trace_ready=torch_npu.profiler.tensorboard_trace_handler is not used in the code, the flushed profile data is raw data and needs to be parsed offline.

    This function has a higher priority than ASCEND_WORK_PATH. For details, see the Environment Variables.

  • worker_name: Identifies the unique worker thread, which is of the string type. The default value is {hostname}_{pid}. The path can contain only letters, digits, and underscores (_). Soft links are not supported. This function is optional.
  • analyse_flag: Profile data analysis flag, Boolean type. The value can be True (automatic analysis is enabled, which is the default value) or False (automatic analysis is disabled, and sampled profile data can be analyzed offline. This function is optional.

This function is not supported by torch_npu.profiler._KinetoProfile.

torch_npu.profiler.ProfilerAction

Profiler status, Enum type. Possible values are:

  • NONE: no action.
  • WARMUP: warm-up for profile data collection.
  • RECORD: profile data collection.
  • RECORD_AND_SAVE: profile data collection and saving.

torch_npu.profiler._ExperimentalConfig

Profile data collection extension, Enum type. It is called by experimental_config of torch_npu.profiler.profile. For details, see experimental_config Parameter Description.

torch_npu.profiler.supported_activities

Queries the CPU and NPU events of the activities parameters that can be collected.

torch_npu.profiler.supported_profiler_level

Queries the profiler_level of the currently supported experimental_config parameters.

torch_npu.profiler.supported_ai_core_metrics

Queries the AI Core performance metrics of the currently supported experimental_config parameters.

torch_npu.profiler.supported_export_type

Queries the supported profile data result file types of torch_npu.profiler.ExportType.

Profile data occupies certain disk space. As a result, the server may be unavailable when the disk space is used up. The space required by profile data is closely related to the model parameters, collection configurations, and number of collection iterations. You need to ensure that the available disk space in the directory where profile data is flushed is sufficient.

profiler_config.json File Description

The content of the profiler_config.json file is as follows (the default settings are used as an example):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
{
 "activities": ["CPU", "NPU"],
 "prof_dir": "./",
 "analyse": false,
 "record_shapes": false,
 "profile_memory": false,
 "with_stack": false,
 "with_flops": false,
 "with_modules": false,
 "active": 1,
 "start_step": 0,
 "is_rank": false,
 "rank_list": [],
 "experimental_config": {
  "profiler_level": "Level0",
  "aic_metrics": "AiCoreNone",
  "l2_cache": false,
  "op_attr": false,
  "gc_detect_threshold": null,
  "data_simplification": true,
  "record_op_args": false,
  "export_type": "text",
  "msprof_tx": false
 }
}
Table 5 Parameters

Parameter

Description

Required (Yes/No)

activities

CPU/NPU event sampling list. Possible values are:

  • CPU: Framework data sampling switch.
  • NPU: CANN software stack and NPU data sampling switch.

By default, the two switches are turned on at the same time.

No

prof_dir

Output directory of the sampled profile data. The default directory is ./. The path can contain only letters, digits, and underscores (_). Soft links are not supported.

No

analyse

Switch for automatic parsing of profile data. Possible values are:

  • true: enables automatic parsing.
  • false (default): disables automatic parsing. The collected profile data can be analyzed offline.

No

record_shapes

InputShapes and InputTypes of an operator. Possible values are:

  • true: enabled.
  • false (default): disabled.

This parameter is valid only when activities is set to CPU.

No

profile_memory

Memory usage of an operator. Possible values are:

  • true: enabled.
  • false (default): disabled.
NOTE:

Sampling memory data in the environment where glibc (2.34 or an earlier version) is installed may trigger a known Bug 19329. This problem can be solved by upgrading the glibc version.

No

with_stack

Operator call stack, including the call information at the framework layer and CPU operator layer. Possible values are:

  • true: enabled.
  • false (default): disabled.

This parameter is valid only when activities is set to CPU.

No

with_flops

Floating-point operation of an operator, Bool type. Currently, this parameter cannot be used for profile data parsing. Possible values are:

  • true: enabled.
  • false (default): disabled.

This parameter is valid only when activities is set to CPU.

No

with_modules

Python call stack at the modules level, that is, call information at the framework layer. Possible values are:

  • true: enabled.
  • false (default): disabled.

This parameter is valid only when activities is set to CPU.

No

is_rank

Enables the function of sampling data of a specified rank. Possible values are:

  • true: enabled.
  • false (default): disabled.

After this function is enabled, dynamic_profile identifies the rank ID configured in the ranks parameter and samples data based on the configured rank ID. If rank_list is empty after this function is enabled, no profile data is sampled.

After this function is enabled, automatic analysis does not take effect. You need to use offline analysis.

No

rank_list

Rank ID to be sampled. The value is an integer. The default value is empty, indicating that no profile data is sampled. The value must be a valid rank ID in the environment. You can specify one or more ranks at a time. For example, "rank_list": [1,2,3].

No

experimental_config

Extended parameter, used to configure common collection items of the performance analysis tool. For details, see experimental_config Parameter Description (dynamic_profile Scenario).

In the dynamic sampling scenario, set the sub-parameter options of experimental_config in the configuration file to the actual parameter values, for example, "aic_metrics": "PipeUtilization".

No

metadata

Samples model hyperparameters (keys) and configuration information (values).

  • If "export_type": "text" is configured, data is saved to the profiler_metadata.json file in the {worker_name}_{timestamp}_ascend_pt directory.
  • If "export_type": "db" is configured, data is saved to the META_DATA table in the ascend_pytorch_profiler_{rank_id}.db file and the profiler_metadata.json file in the {worker_name}_{timestamp}_ascend_pt directory.

Configuration examples:

1
2
3
4
5
6
7
 "metadata": {
  "distributed_args":{
   "tp":2,
   "pp":4,
   "dp":8
  }
 }

No

active

Number of iterations for data collection. The value is a positive integer. The default value is 1.

No

experimental_config Parameter Description (dynamic_profile Scenario)

All experimental_config parameters are optional. The following table lists the sampling items that can be extended.

Table 6 experimental_config

Parameter

Description

export_type

Format of the exported profile data result file. Possible values are:

  • text: parsed into timeline and summary files in .json and .csv formats. For details, see Data Storing Directories.
  • db: parsed into a .db file (ascend_pytorch.db or analysis.db) that summarizes all profile data. on_trace_ready API-based export and offline analysis-based export are supported. The Ascend-CANN-Toolkit (CANN 8.0.RC1 or later) that supports the export of .db files needs to be installed.

If this parameter is set to an invalid value or is not set, the default value Text is used.

profiler_level

Profile level. Possible values are:
  • Level_none: Does not sample data controlled by all levels. That is, profiler_level is disabled.
  • Level0: Samples upper-layer application data, bottom-layer NPU data, and information about operators executed on the NPU. This is the default value. When this parameter is set, only partial data is collected, and some operator information is not collected. For details, see op_summary (Operator Details).
  • Level1: Samples AscendCL data at the CANN layer, performance metrics of AI Cores executed on the NPU, data generated when aic_metrics=torch_npu.profiler.AiCMetrics.PipeUtilization is enabled, and communication.json, communication_matrix.json, and api_statistic.csv files of HCCL, in addition to data sampled in Level0.
  • Level2: Samples runtime data and AI CPU data (data_preprocess.csv file) at the CANN layer, in addition to data sampled in level1.

msprof_tx

Dotting switch. It is used to enable the customized dotting function. The value can be true (on) or false (off). The default value is false. For details about this parameter, see (Optional) Sampling and Parsing msprof_tx.

data_simplification

Data simplification mode. After it is enabled, data in the FRAMEWORK directory and redundant data will be deleted after profile data is exported. Only the profiler_info.json file and raw profile data in the ASCEND_PROFILER_OUTPUT and PROF_XXX directories are retained. This saves the storage space. The value can be true (on) or false (off). The default value is true.

aic_metrics

AI Core metrics to profile. Possible values are:

The results of the following profiling items are displayed in the Kernel View.

For details about the results of the following profiling items, see op_summary (Operator Details). The actual collection result may vary.
  • AiCoreNone: disables AI Core performance metric collection. This is the default value.
  • PipeUtilization: percentages of time taken by compute units and MTEs.
  • ArithmeticUtilization: arithmetic utilization ratio.
  • Memory: ratio of external memory read/write instructions.
  • MemoryL0: ratio of internal memory L0 read/write instructions.
  • ResourceConflictRatio: percentages of pipeline queue instructions.
  • MemoryUB: ratio of internal memory UB read/write instructions.

l2_cache

L2 cache data collection switch. The value can be true (on) or false (off). The default value is false. This profiling item generates the l2_cache.csv file in ASCEND_PROFILER_OUTPUT. For details about the result fields, see l2_cache (L2 Cache Hit Ratio).

op_attr

Operator attribute data sampling switch. Currently, the collection applies to only aclnn operators. The value can be true (on) or false (off). The default value is false. The profile data sampled by this parameter can only be .db files when export_type is set to db. This parameter does not take effect when Level_none is used.

record_op_args

Operator statistics switch. The value can be true (on) or false (off). The default value is false. After it is enabled, a file with collected operator information is generated in the {worker_name}_{timestamp}_ascend_pt_op_args directory.

NOTE:

This parameter is used when the AOE tool performs tuning in the PyTorch training scenario. You are not advised enabling it together with other profile data collection APIs. For details, see the AOE Instructions.

gc_detect_threshold

GC detection threshold. The value is greater than or equal to 0 (unit: ms). If the threshold is a number, GC detection is enabled and only GC events that exceed the threshold are sampled.

If this parameter is set to 0, all GC events are sampled. (Exercise caution when setting this parameter because a large amount of data may be sampled.) The recommended value is 1ms.

The default value is null, indicating that the GC detection function is disabled.

GC is used by the Python process to reclaim the memory of destroyed objects.

If the format of the analysis result file is set to torch_npu.profiler.ExportType.Text, the GC layer is generated in the analysis result file trace_view.json.

If the format of the analysis result file is set to torch_npu.profiler.ExportType.Db, the GC_RECORD table is generated in the ascend_pytorch_profiler_{rank_id}.db file.

experimental_config Parameter Description

All experimental_config parameters are optional. The following table lists the sampling items that can be extended.

Table 7 experimental_config

Parameter

Description

export_type

Format of the exported profile data result file, Enum type. Possible values are:

  • torch_npu.profiler.ExportType.Text: parsed into timeline and summary files in .json and .csv formats. For details, see Data Storing Directories.
  • torch_npu.profiler.ExportType.Db: parsed into a .db file (ascend_pytorch_profiler_{rank_id}.db or analysis.db) that summarizes all profile data. on_trace_ready API-based export and offline analysis-based export are supported. The Ascend-CANN-Toolkit (CANN 8.0.RC1 or later) that supports the export of .db files needs to be installed.

If this parameter is set to an invalid value or is not set, the default value torch_npu.profiler.ExportType.Text is used.

profiler_level

Collection level, Enum type. Possible values are:
  • torch_npu.profiler.ProfilerLevel.Level_none: Does not sample data controlled by all levels. That is, profiler_level is disabled.
  • torch_npu.profiler.ProfilerLevel.Level0: Samples upper-layer application data, bottom-layer NPU data, and information about operators executed on the NPU. This is the default value. When this parameter is set, only partial data is collected, and some operator information is not sampled. For details, see see the description when task_time is set to l0 in op_summary (Operator Details).
  • torch_npu.profiler.ProfilerLevel.Level1: collects AscendCL data at the CANN layer, performance metrics of AI Cores executed on the NPU, data generated when aic_metrics=torch_npu.profiler.AiCMetrics.PipeUtilization is enabled, and communication.json, communication_matrix.json, and api_statistic.csv files of HCCL, in addition to data collected in Level0.
  • torch_npu.profiler.ProfilerLevel.Level2: collects runtime data and AI CPU data (data_preprocess.csv file) at the CANN layer, in addition to data collected in level1.

msprof_tx

Dotting switch, Bool type. It is used to enable the customized dotting function. The value can be True (on) or False (off). The default value is False. For details about this parameter, see (Optional) Sampling and Parsing msprof_tx.

data_simplification

Data simplification mode (Boolean type). After it is enabled, data in the FRAMEWORK directory and redundant data will be deleted after profile data is exported. Only the profiler_info.json file and raw profile data in the ASCEND_PROFILER_OUTPUT and PROF_XXX directories are retained. This saves the storage space. The value can be True (on) or False (off). The default value is True.

aic_metrics

AI Core metrics to profile. Possible values are:

The results of the following profiling items are displayed in the Kernel View.

For details about the results of the following profiling items, see op_summary (Operator Details). The actual collection result may vary.
  • AiCoreNone: disables AI Core performance metric collection. This is the default value.
  • PipeUtilization: percentages of time taken by compute units and MTEs.
  • ArithmeticUtilization: arithmetic utilization ratio.
  • Memory: ratio of external memory read/write instructions.
  • MemoryL0: ratio of internal memory L0 read/write instructions.
  • ResourceConflictRatio: percentages of pipeline queue instructions.
  • MemoryUB: ratio of internal memory UB read/write instructions.

l2_cache

L2 cache data collection switch, Bool type. The value can be True (on) or False (off). The default value is False. This collection item generates the l2_cache.csv file in ASCEND_PROFILER_OUTPUT. For details about the result fields, see l2_cache (L2 Cache Hit Ratio).

op_attr

Operator attribute data collection switch, Bool type. Currently, the collection applies to only aclnn operators. The value can be True (on) or False (off). The default value is False. The profile data collected by this parameter can only be parsed to .db files when export_type=torch_npu.profiler.ExportType.Db is used. This parameter does not take effect when torch_npu.profiler.ProfilerLevel.None is used.

record_op_args

Operator statistics switch, Bool type. The value can be True (on) or False (off). The default value is False. After it is enabled, a file with collected operator information is generated in the {worker_name}_{timestamp}_ascend_pt_op_args directory.

NOTE:

This parameter is used when the AOE tool performs tuning in the PyTorch training scenario. You are not advised enabling it together with other profile data collection APIs. For details, see the AOE Instructions.

gc_detect_threshold

GC detection threshold, float type. The value is greater than or equal to 0 (unit: ms). If the threshold is a number, GC detection is enabled and only GC events that exceed the threshold are sampled.

If this parameter is set to 0, all GC events are sampled. (Exercise caution when setting this parameter because a large amount of data may be sampled.) The recommended value is 1 ms.

The default value is None, indicating that the GC detection function is disabled.

GC is used by the Python process to reclaim the memory of destroyed objects.

If the format of the analysis result file is set to torch_npu.profiler.ExportType.Text, the GC layer is generated in the analysis result file trace_view.json.

If the format of the analysis result file is set to torch_npu.profiler.ExportType.Db, the GC_RECORD table is generated in the ascend_pytorch_profiler_{rank_id}.db file.

torch_npu.profiler.schedule Parameter Description

The torch_npu.profiler.schedule class parameters are used to set the sampling behavior in different steps in the sampling process. Prototype

torch_npu.profiler.schedule (wait, active, warmup = 0, repeat = 0, skip_first = 0)
Table 8 Parameters

Parameter

Description

wait

Number of steps skipped during each repeated collection, int type. This function is required.

active

Number of steps for collection, int type. This function is required.

warmup

Number of warm-up steps, int type. The default value is 0. You are advised to set one warm-up step. This function is optional.

repeat

Number of times that "wait + warmup + active" steps are repeatedly executed, int type. The default value is 0, indicating that the repeat execution will not stop. You are advised to set this parameter to an integer greater than 0. This function is optional.

skip_first

Number of steps that are skipped before sampling, int type. The default value is 0. In dynamic-shape scenarios, you are advised to skip the first 10 steps to ensure stable profile data. In other scenarios, you can configure this parameter based on the actual requirements. This function is optional.

Note: You are advised to set schedule based on this formula: number of steps ≥ skip_first + (wait + warmup + active) × repeat

The following figure shows the relationships between the torch_npu.profiler.schedule class, the step, and the on_trace_ready function.

Figure 2 Relationships between the torch_npu.profiler.schedule class, the step, and the on_trace_ready function

A code example of the configuration:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
with torch_npu.profiler.profile(
    activities=[
        torch_npu.profiler.ProfilerActivity.CPU,
        torch_npu.profiler.ProfilerActivity.NPU,
    ],
    schedule=torch_npu.profiler.schedule(
        wait=1,                        # Waiting phase. One step is skipped.
        warmup=1,                      # Warm-up phase. One step is skipped.
        active=2,                      # Record the activity data of two steps and call on_trace_ready.
        repeat=2,                      # Repeat the wait+warmup+active process twice.
        skip_first=1                   # Skip one step.
    ),
    on_trace_ready=torch_npu.profiler.tensorboard_trace_handler('./result')
    ) as prof:
        for _ in range(9):
            train_one_step()
            prof.step()                #  Notify the profiler to finish a step.

Using dynamic_profile for Dynamic Profiling of Maintenance and Test Logs

The dynamic_profile maintenance and test logs are automatically recorded in the profiler_config_path directory. The following is an example of the directory structure of the generated logs:
1
2
3
4
5
6
7
8
profiler_config_path/
├── log
    ├── dp_ubuntu_xxxxxx_rank_*.log
    ├── dp_ubuntu_xxxxxx_rank_*.log.1
    ├── monitor_dp_ubuntu_xxxxxx_rank_*.log
    ├── monitor_dp_ubuntu_xxxxxx_rank_*.log.1
├── profiler_config.json
└── shm
  • dp_ubuntu_xxxxxx.log: Execution log of dynamic_profile, which records all actions (INFO), warnings (WARNING), and errors (ERROR) during dynamic profiling. File naming format: dp_{Operating system}_{AI task process ID}_{rank_id}.log.

    When an AI task is started, each Rank will initiate an AI task process. The dynamic_profile generates log files for each AI task process based on the process ID of each task.

  • dp_ubuntu_xxxxxx.log.1: This is a log aging backup file. The storage limit for the dp_ubuntu_xxxxxx.log file is 200 KB. Once the limit is reached, the earliest log entries are moved to dp_ubuntu_xxxxxx.log.1. The storage limit for the dp_ubuntu_xxxxxx.log.1 file is also 200 KB, and once the limit is reached, the earliest log entries are deleted through aging.
  • monitor_dp_ubuntu_xxxxxx.log: This is the log for the profiler_config.json file modifications. After dynamic_profile is enabled for dynamic profiling, it records the modification time of the profiler_config.json file, whether the modifications take effect, and the end of the dynamic_profile process in real time. An example is shown below:
    1
    2
    3
    2024-08-21 15:51:46,392 [INFO] [2127856] _dynamic_profiler_monitor.py: Dynamic profiler process load json success
    2024-08-21 15:51:58,406 [INFO] [2127856] _dynamic_profiler_monitor.py: Dynamic profiler process load json success
    2024-08-21 15:58:16,795 [INFO] [2127856] _dynamic_profiler_monitor.py: Dynamic profiler process done
    

    File naming format: monitor_dp_{Operating system}_{monitor process ID}_{rank_id}.log.

  • monitor_dp_ubuntu_xxxxxx.log.1: This is a log aging backup file. The storage limit for the monitor_dp_ubuntu_xxxxxx.log file is 200 KB. Once the limit is reached, the earliest log entries are moved to monitor_dp_ubuntu_xxxxxx.log.1. The storage limit for the monitor_dp_ubuntu_xxxxxx.log.1 file is also 200 KB, and once the limit is reached, the earliest log entries are deleted through aging.
  • shm directory: To support Python 3.7, dynamic_profile will generate the shm directory in the environment. A binary file (DynamicProfileNpuShm+timestamp) is created in this directory to map shared memory. The file will be automatically cleaned up when the program ends normally. However, when the program is terminated using pkill, the program cannot release resources due to the abnormal termination, and you need to manually clean up this file. Otherwise, if dynamic_profile is started again within a short time (< 1h) using the same configuration path, it will cause dynamic_profile to fail. For Python 3.8 and above, the binary file (DynamicProfileNpuShm+timestamp) is stored in the /dev/shm directory. When the program is terminated using pkill, the file still needs to be manually cleaned up.