Ascend PyTorch Profiler APIs
Ascend PyTorch Profiler is a performance analysis tool developed for the PyTorch framework. By adding the Ascend PyTorch Profiler API to PyTorch training/online inference scripts, profile data can be sampled during training/online inference and visualized as profile data files upon completion of training/online inference, improving the performance analysis efficiency. The Ascend PyTorch Profiler API can collect complete profile data in PyTorch training/online inference scenarios, including information about operators at the PyTorch layer and CANN layer, bottom-layer NPU operators, and operator memory usages, providing a comprehensive analysis on performance status during PyTorch training/online inference.
The Ascend PyTorch Profiler API tool supports the following profile data sampling modes:
- torch_npu.profiler.profile API sampling
- dynamic_profile dynamic sampling
- torch_npu.profiler._KinetoProfile API sampling
Other functions:
- (Optional) Sampling and Parsing msprof_tx
- (Optional) Sampling Environment Variable Information
- (Optional) Marking the Profile Data Sampling Process With Custom Character String Keys and Values
- (Optional) Video Memory Visualization
References:
- Ascend PyTorch Profiler APIs
- profiler_config.json File Description
- experimental_config Parameter Description (dynamic_profile Scenario)
- experimental_config Parameter Description
- torch_npu.profiler.schedule Parameter Description
- Using dynamic_profile for Dynamic Profiling of Maintenance and Test Logs
Restrictions
Ascend PyTorch Profiler APIs support multiple profiling methods, but these methods cannot be enabled at the same time.
Prerequisites
- Ensure that operations in Before You Start have been completed.
- Prepare a model trained on PyTorch 2.1.0 or later and a matched dataset, and port the model to the Ascend AI Processor. For details, see "Porting Adaptation" in the PyTorch Training Model Porting and Tuning Guide .
Profile Data Sampling and Parsing (torch_npu.profiler.profile)
- Add the following sample code to the training script (for example, train_*.py)/online inference script to configure profile data sampling parameters, and then start the training/online inference. The following is a code example:
For details about the torch_npu.profiler.profile API in the following sample code, see Ascend PyTorch Profiler APIs.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
import torch import torch_npu ... experimental_config = torch_npu.profiler._ExperimentalConfig( export_type=torch_npu.profiler.ExportType.Text, profiler_level=torch_npu.profiler.ProfilerLevel.Level0, msprof_tx=False, aic_metrics=torch_npu.profiler.AiCMetrics.AiCoreNone, l2_cache=False, op_attr=False, data_simplification=False, record_op_args=False, gc_detect_threshold=None ) with torch_npu.profiler.profile( activities=[ torch_npu.profiler.ProfilerActivity.CPU, torch_npu.profiler.ProfilerActivity.NPU ], schedule=torch_npu.profiler.schedule(wait=0, warmup=0, active=1, repeat=1, skip_first=1), on_trace_ready=torch_npu.profiler.tensorboard_trace_handler("./result"), record_shapes=False, profile_memory=False, with_stack=False, with_modules=False, with_flops=False, experimental_config=experimental_config) as prof: for step in range(steps): train_one_step(step, steps, train_loader, model, optimizer, criterion) prof.step()
Or
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
import torch import torch_npu ... experimental_config = torch_npu.profiler._ExperimentalConfig( export_type=torch_npu.profiler.ExportType.Text, profiler_level=torch_npu.profiler.ProfilerLevel.Level0, msprof_tx=False, aic_metrics=torch_npu.profiler.AiCMetrics.AiCoreNone, l2_cache=False, op_attr=False, data_simplification=False, record_op_args=False, gc_detect_threshold=None ) prof = torch_npu.profiler.profile( activities=[ torch_npu.profiler.ProfilerActivity.CPU, torch_npu.profiler.ProfilerActivity.NPU ], schedule=torch_npu.profiler.schedule(wait=0, warmup=0, active=1, repeat=1, skip_first=1), on_trace_ready=torch_npu.profiler.tensorboard_trace_handler("./result"), record_shapes=False, profile_memory=False, with_stack=False, with_modules=False, with_flops=False, experimental_config=experimental_config) prof.start() for step in range(steps): train_one_step() prof.step() prof.stop()
In addition to using tensorboard_trace_handler to export profile data, you can also use the following method:1 2 3 4 5 6 7 8 9
import torch import torch_npu ... with torch_npu.profiler.profile() as prof: for step in range(steps): train_one_step(step, steps, train_loader, model, optimizer, criterion) prof.export_chrome_trace('./chrome_trace_14.json')
- Parse profile data.
Automatic parsing (see tensorboard_trace_handler and prof.export_chrome_trace in the preceding sample code) and offline parsing are supported.
- View the sampled PyTorch training/online inference profile data result files and profile data analysis.
For details about profile data result files, see Data Storing Directories.
Profile Data Sampling and Parsing (dynamic_profile)
dynamic_profile is used to start the profiling process at any time during model training/online inference.
dynamic_profile can be enabled in multiple methods. However, only one method can be enabled at a time.
Using environment variables
- Configure the following environment variables:
export PROF_CONFIG_PATH="profiler_config_path"
After this environment variable is configured and training is started, dynamic_profile automatically creates the template file profiler_config.json in profiler_config_path. You can modify configuration options based on the template file.
- This method applies only to training scenarios.
- In this method, dynamic_profile cannot sample data of the first iteration (step 0).
- This method depends on the profiling steps in the training process of the Torch native Optimizer.step(). Custom optimizer is not supported.
- profiler_config_path can contain only letters, digits, and underscores (_). Soft links are not supported.
- Start a training job.
- Open a new CLI and modify the profiler_config.json configuration file to enable the profiling task.The configuration file contains the profile data collection parameters of Profiler. You can see profiler_config.json File Description to modify the parameters in the configuration file to execute different profiling tasks.
- dynamic_profile determines whether the profiler_config.json file is modified by identifying the status of the file.
- dynamic_profile polls every two seconds. If the profiler_config.json file is modified, the sampling process is started. Then, the running interval between step 10 and step 11 is recorded and used as the new polling interval. The minimum interval is one second.
- If the profiler_config.json file is modified during the dynamic_profile sampling process, the dynamic_profile sampling is started again after the sampling process ends.
- You are advised to use shared storage to set profiler_config_path of dynamic_profile.
- The dynamic_profile maintenance and test logs are automatically recorded in the profiler_config_path directory. For details, see Using dynamic_profile for Dynamic Profiling of Maintenance and Test Logs.
- dynamic_profile determines whether the profiler_config.json file is modified by identifying the status of the file.
- Parse profile data.
Automatic parsing and manual parsing are supported. For details, see the analyse parameter in Table 5.
- View the sampled PyTorch training/online inference profile data result files and profile data analysis.
For details about profile data result files, see Data Storing Directories.
Modifying the user training/online inference script by adding the dynamic_profile API
- Add the following sample code to the training script (for example, train_*.py)/online inference script:
1 2 3 4 5 6 7 8 9
# Load the dynamic_profile module. from torch_npu.profiler import dynamic_profile as dp # Set the profiling configuration file path. dp.init("profiler_config_path") ... for step in steps: train_one_step() # Divide steps. dp.step()
During init, dynamic_profile automatically creates the template file profiler_config.json in profiler_config_path. You can modify configuration items based on the template file.
profiler_config_path can contain only letters, digits, and underscores (_). Soft links are not supported.
- Start a training/online inference task.
- Open a new CLI and modify the profiler_config.json configuration file to enable the profiling task.The configuration file contains the profile data collection parameters of Profiler. You can see profiler_config.json File Description to modify the parameters in the configuration file to execute different profiling tasks.
- dynamic_profile determines whether the profiler_config.json file is modified by identifying the status of the file.
- dynamic_profile polls every two seconds. If the profiler_config.json file is modified, the sampling process is started. Then, the running interval between step 10 and step 11 is recorded and used as the new polling interval. The minimum interval is one second.
- If the profiler_config.json file is modified during the dynamic_profile sampling process, the dynamic_profile sampling is started again after the sampling process ends.
- You are advised to use shared storage to set profiler_config_path of dynamic_profile.
- The dynamic_profile maintenance and test logs are automatically recorded in the profiler_config_path directory. For details, see Using dynamic_profile for Dynamic Profiling of Maintenance and Test Logs.
- dynamic_profile determines whether the profiler_config.json file is modified by identifying the status of the file.
- Parse profile data.
Automatic parsing and manual parsing are supported. For details, see the analyse parameter in Table 5.
- View the sampled PyTorch training/online inference profile data result files and profile data analysis.
For details about profile data result files, see Data Storing Directories.
Modifying the user training/online inference script by adding the dp.start() function of dynamic_profile
- Add the following sample code to the training script (for example, train_*.py)/online inference script:
1 2 3 4 5 6 7 8 9 10 11 12
# Load the dynamic_profile module. from torch_npu.profiler import dynamic_profile as dp # Set the path of the profiling configuration file of the init API. dp.init("profiler_config_path") ... for step in steps: if step==5: # Set the path of the profiling configuration file of the start API. dp.start("start_config_path") train_one_step() # Divide steps. The code that requires profiling must be loaded between the dp.start() API and the dp.step() API. dp.step()
start_config_path is also specified as the profiler_config.json path. However, you need to manually create a configuration file by referring to profiler_config.json File Description and set parameters based on your actual needs. The file name must be specified, for example, dp.start("/home/xx/start_config_path/profiler_config.json").
profiler_config_path and start_config_path can contain only letters, digits, and underscores (_). Soft links are not supported.
- After the dp.start() function is added, when a training or online inference task is proceeded to dp.start(), data is automatically sampled based on the profiler_config.json file specified by start_config_path. The dp.start() function does not detect the modification of the profiler_config.json file. It triggers a sampling task only during training or online inference.
- After the dp.start() function is added and the training/online inference is started:
- If the profiler_config.json configuration file is not specified in dp.start() or the configuration file does not take effect due to an error, sample data based on the profiler_config.json configuration file in the profiler_config_path directory when the task is proceeded to dp.start()
- If the script proceeds to dp.start() when the dynamic_profile configured in dp.init() is valid, dp.start() does not take effect.
- If the script proceeds to dp.start() after the dynamic_profile finishes sampling as specified in dp.init(), the script continues sampling with dp.start() and generates a new profile data file directory in the prof_dir directory.
- If the profiler_config.json file in the profiler_config_path directory is modified while the dynamic_profile configured in dp.start() is valid, dp.init() starts after the dp.start() sampling finishes and a new profile data file is generated in the prof_dir directory.
- You are advised to use shared storage to set profiler_config_path of dynamic_profile.
- Start a training/online inference task.
- Parse profile data.
Automatic parsing and manual parsing are supported. For details, see the analyse parameter in Table 5.
- View the sampled PyTorch training/online inference profile data result files and profile data analysis.
For details about profile data result files, see Data Storing Directories.
Profile Data Sampling and Parsing (torch_npu.profiler._KinetoProfile)
- Add the following sample code to the training script (for example, train_*.py)/online inference script to configure profile data sampling parameters, and then start the training/online inference. The following is a code example:
For details about the torch_npu.profiler._KinetoProfile API in the following sample code, see Ascend PyTorch Profiler APIs.
1 2 3 4 5 6 7 8 9 10 11 12 13
import torch import torch_npu ... prof = torch_npu.profiler._KinetoProfile(activities=None, record_shapes=False, profile_memory=False, with_stack=False, with_flops=False, with_modules=False, experimental_config=None) for epoch in range(epochs): trian_model_step() if epoch == 0: prof.start() if epoch == 1: prof.stop() prof.export_chrome_trace("result_dir/trace.json")
In this method, schedule and tensorboard_trace_handler cannot be used to export profile data.
- Parse profile data.
Automatic parsing is supported. For details, see prof.export_chrome_trace in the preceding sample code.
- View the sampled PyTorch training/online inference profile data result files and profile data analysis.
For details about profile data result files, see Data Storing Directories.
(Optional) Sampling and Parsing msprof_tx
In large cluster scenarios, traditional profiling involves a large amount of data and complex analysis process. You can use the msprof_tx parameter of experimental_config to enable the custom dotting function, customize the collection period or the start and end time of key functions, and identify key functions or iterations to quickly demarcate performance issues.
The usage and sample code are as follows:
- Enable torch_npu.profiler and msprof_tx, and set profiler_level to Level_none (the level can be configured based on the actual collection requirements) to collect dotting data.
- In the PyTorch script, call the mark APIs of torch_npu.npu.mstx, torch_npu.npu.mstx.mark, torch_npu.npu.mstx.range_start, torch_npu.npu.mstx.range_end and torch_npu.npu.mstx.mstx_range to collect the time required for sampling events. For details about the APIs, see "Ascend Extension for PyTorch Custom API" > "torch_npu.npu" > "profiler" in Ascend Extension for PyTorch API Reference .
Only the range duration on the host is recorded.
1 2 3 |
id = torch_npu.npu.mstx.range_start("dataloader", None) # If the second input parameter is set to None or not set, only the range duration on the host is recorded. dataloader() torch_npu.npu.mstx.range_end(id) |
Mark IDs on compute streams to record the range time on the host and device.
1 2 3 4 |
stream = torch_npu.npu.current_stream() id = torch_npu.npu.mstx.range_start("matmul", stream) # Set the second input parameter to a valid stream, record the range time on the host and device. torch.matmul() torch_npu.npu.mstx.range_end(id) |
Mark IDs on the collective communication stream:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
from torch.distributed.distributed_c10d import _world if (torch.__version__ != '1.11.0') : stream_id = _world.default_pg._get_backend(torch.device('npu'))._get_stream_id(False) collective_stream = torch.npu.Stream(stream_id=collective_stream_id, device_type=20, device_index=device_id) # Use device_index to specify device IDs of actual services. else: stream_id = _world.default_pg._get_stream_id(False) current_stream = torch.npu.current_stream() cdata = current_stream._cdata & 0xffff000000000000 collective_stream = torch.npu.Stream(_cdata=( stream_id + cdata), device_index=device_id) # Use device_index to specify device IDs of actual services. id = torch_npu.npu.mstx.range_start("allreduce", collective_stream) # Set the second input parameter to a valid stream, record the range time on the host and device. torch.allreduce() torch_npu.npu.mstx.range_end(id) |
Mark IDs on the P2P communication stream:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
from torch.distributed.distributed_c10d import _world if (torch.__version__ != '1.11.0') : stream_id = _world.default_pg._get_backend(torch.device('npu'))._get_stream_id(True) p2p_stream = torch.npu.Stream(stream_id=collective_stream_id, device_type=20, device_index=device_id) # Use device_index to specify device IDs of actual services. else: stream_id = _world.default_pg._get_stream_id(True) current_stream = torch.npu.current_stream() cdata = current_stream._cdata & 0xffff000000000000 p2p_stream = torch.npu.Stream(_cdata=( stream_id + cdata), device_index=device_id) # Use device_index to specify device IDs of actual services. id = torch_npu.npu.mstx.range_start("send", p2p_stream) # Set the second input parameter to a valid stream, record the range time on the host and device. torch.send() torch_npu.npu.mstx.range_end(id) |
To collect data in these scenarios, you need to configure the torch_npu.profiler.profile API and enable the msprof_tx switch. The following is an example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
import torch_npu stream = torch_npu.npu.current_stream() id = torch_npu.npu.mstx.range_start("Func", stream) # Mark the start of the range of the func function on the host and device. func() # Service code. torch_npu.npu.mstx.range_end(id) # Mark the end of the range of the func function on the host and device. experimental_config = torch_npu.profiler._ExperimentalConfig( profiler_level=torch_npu.profiler.ProfilerLevel.Level_none, msprof_tx=True, export_type=torch_npu.profiler.ExportType.Db ) with torch_npu.profiler.profile( schedule=torch_npu.profiler.schedule(wait=1, warmup=1, active=2, repeat=2, skip_first=1), on_trace_ready=torch_npu.profiler.tensorboard_trace_handler("./result"), experimental_config=experimental_config) as prof: for epoch in range(epochs): torch_npu.npu.mstx.mark("train epoch start") # Mark the instantaneous moment on the host and device. You can also use torch_npu.npu.mstx().mark("train epoch start"). for step in range(steps): train_one_step(step, steps, train_loader, model, optimizer, criterion) prof.step() |
By default, the msprof_tx function collects communication operator profile data in the format of comm:{communication name},{communicator name},{input data format},{input data counts}. For example, comm:HcclBroadcast,xxxxxx,int64,5, where xxxxxx indicates the communicator name.
This function allows you to view the execution and scheduling status of user-defined dotting from the framework to the CANN layer and then to the NPU, helping you identify key functions or events to be observed and demarcate performance issues.
For details about msprof_tx sampling results, see msproftx Data Description.
(Optional) Sampling Environment Variable Information
The Ascend PyTorch Profiler API profiles environment variable information by default. The following environment variables can be sampled:
- "ASCEND_GLOBAL_LOG_LEVEL"
- "HCCL_RDMA_TC"
- "HCCL_RDMA_SL"
- "ACLNN_CACHE_LIMIT"
Procedure:
- Configure environment variables. The following is an example:
export ASCEND_GLOBAL_LOG_LEVEL=1 export HCCL_RDMA_TC=0 export HCCL_RDMA_SL=0 export ACLNN_CACHE_LIMIT=4096
Set the environment variables based on the actual requirements.
- Call the Ascend PyTorch Profiler API for sampling.
- View the result data.
- When the export_type of experimental_config is set to torch_npu.profiler.ExportType.Text, the environment variables configured in the preceding steps are stored in the profiler_metadata.json file in the {worker_name}_{timestamp}_ascend_pt directory.
- When export_type of the experimental_config parameter is set to torch_npu.profiler.ExportType.Db, write environment variable information into the META_DATA table in the ascend_pytorch_profiler_{rank_id}.db file.
(Optional) Marking the Profile Data Collection Process With Custom Character String Keys and Values
- Example 1
1 2
with torch_npu.profiler.profile(...) as prof: prof.add_metadata(key, value)
- Example 2
1 2
with torch_npu.profiler._KinetoProfile(...) as prof: prof.add_metadata_json(key, value)
add_metadata and add_metadata_json can be configured under torch_npu.profiler.profile and torch_npu.profiler._KinetoProfile. They need to be added in the code of the profile data collection process, after profiler initialization and before finalization.
|
Class and Function Name |
Description |
||
|---|---|---|---|
|
add_metadata |
Adds the character string flag. The options are as follows:
For example:
|
||
|
add_metadata_json |
Adds the character string flag in JSON format. The options are as follows:
For example:
|
The metadata passed by calling this API is written to the profiler_metadata.json file in the root directory of the collection result of the Ascend PyTorch Profiler API.
(Optional) Video Memory Visualization
The function classifies and displays the occupied data when the training process occupies the storage space during model training. Export the visualization file memory_timeline.html through export_memory_timeline. To output an HTML file, you need to install matplotlib in the Python environment and set torch_npu.profiler.profile to True. In addition, if you use this function, an ascend_pt data file is generated in the current directory. The following is an example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
import torch import torch_npu ... def trace_handler(prof: torch_npu.profiler.profile): prof.export_memory_timeline(output_path="./memory_timeline.html", device="npu:0") with torch_npu.profiler.profile( activities=[ torch_npu.profiler.ProfilerActivity.CPU, torch_npu.profiler.ProfilerActivity.NPU ], schedule=torch_npu.profiler.schedule(wait=0, warmup=0, active=4, repeat=1, skip_first=0), on_trace_ready=trace_handler, record_shapes=True, # Set it to True. profile_memory=True, # Set it to True. with_stack=True, # Set either with_stack or with_modules to True. with_modules=True ) as prof: for _ in range(steps): ... prof.step() |
After data sampling, the memory_timeline.html file is exported, with the following visualization effect:
- Time (ms): Horizontal coordinate, indicating the memory occupation time of the tensors (unit: ms).
- Memory (GB): Vertical coordinate, indicating the memory size occupied by the tensors (unit: GB).
- Max memory allocated: Allocated maximum memory size (unit: GB).
- Max memory reserved: Reserved maximum memory size (unit: GB).
- PARAMETER: Model parameters and model weights.
- OPTIMIZER_STATE: Optimizer status. For example, the Adam optimizer records some statuses during model training.
- INPUT: Input data.
- TEMPORARY: Temporarily occupied. It is defined as tensors that are allocated and then released for a single operator. Generally, these tensors store intermediate values.
- ACTIVATION: Activation values obtained in forward computation.
- GRADIENT: Gradient value.
- AUTOGRAD_DETAIL: Memory usage generated during backward computation.
- UNKNOWN: Unknown type.
Ascend PyTorch Profiler APIs
|
Parameter |
Description |
Required (Yes/No) |
|---|---|---|
|
activities |
CPU/NPU event collection list, Enum type. Possible values are:
By default, the two switches are turned on at the same time. |
No |
|
schedule |
Behavior of each step, Callable type. It is controlled by the schedule class. By default, no operation is performed. This parameter is not supported by torch_npu.profiler._KinetoProfile. |
No |
|
on_trace_ready |
Operation automatically performed after the collection ends, Callable type. tensorboard_trace_handler function is supported. If a large amount of data is sampled and direct parsing of the profile data in the current environment proves unsuitable, or the training/online inference process is interrupted during the sampling and only part of the profile data is sampled, offline parsing can be used. By default, no operation is performed. This parameter is not supported by torch_npu.profiler._KinetoProfile. |
No |
|
record_shapes |
InputShapes and InputTypes of an operator, Bool type. Possible values are:
This parameter takes effect when torch_npu.profiler.ProfilerActivity.CPU is enabled. |
No |
|
profile_memory |
Memory usage of an operator, Bool type. Possible values are:
|
No |
|
with_stack |
Operator call stack, Bool type, including the call information at the framework layer and CPU operator layer. Possible values are:
This parameter takes effect when torch_npu.profiler.ProfilerActivity.CPU is enabled.
NOTE:
Enabling this configuration will cause extra performance bloat. |
No |
|
with_modules |
Python call stack at the modules level, that is, call information at the framework layer, which is of the Boolean type. Possible values are:
This parameter takes effect when torch_npu.profiler.ProfilerActivity.CPU is enabled.
NOTE:
Enabling this configuration will cause extra performance bloat. |
No |
|
with_flops |
Floating-point operation of an operator, Bool type. Currently, this parameter cannot be used for profile data parsing. Possible values are:
This parameter takes effect when torch_npu.profiler.ProfilerActivity.CPU is enabled. |
No |
|
experimental_config |
Extended parameter, used to configure common collection items of the performance analysis tool. For details about the supported collection items, see experimental_config Parameter Description. |
No |
|
use_cuda |
CUDA profile data collection switch, Bool type. This parameter is not supported in Ascend environments. Possible values are:
This parameter is not supported by torch_npu.profiler._KinetoProfile. |
No |
|
Method Name |
Description |
||
|---|---|---|---|
|
step |
Divides different iterations. This method is not supported by torch_npu.profiler._KinetoProfile. |
||
|
export_chrome_trace |
Exports trace data, and writes it to a specified .json file. The trace data contains the running time and association relationships of operators and APIs displayed after the Ascend PyTorch Profiler APIs integrate the CANN software stack and NPU data on the framework side. The following parameters are included:
If torch_npu.profiler.tensorboard_trace_handler is set, export_chrome_trace does not take effect. In a multi-device scenario, you need to set different file names for different devices. The sample code is as follows:
|
||
|
export_stacks |
Exports stack information to a file. The following parameters are included:
The location of this method is the same as that of the export_chrome_trace method in the training/online inference script. The following is an example:
You can use the FlameGraph tool to view the exported result file as follows: git clone https://github.com/brendangregg/FlameGraph cd FlameGraph ./flamegraph.pl –title "NPU time" –countname "us." profiler.stacks > perf_viz.svg |
||
|
Export the memory event information of a specified device from the sampled data and export the timeline graph. Using export_memory_timeline, you can export three files, each controlled by the suffix of the output_path.
Parameters:
Configuration examples:
For details, see (Optional) Video Memory Visualization. |
|||
|
start |
Sets the position where data collection starts. Refer to the following example to add start and stop before and after the training/online inference code for which profile data is to be sampled:
|
||
|
stop |
Sets the position where data collection ends. Before using this method, execute start first. |
|
Class and Function Name |
Description |
|---|---|
|
torch_npu.profiler.schedule |
Sets the action for each step. By default, this operation is not performed. To obtain more stable profile data, set specific parameters of this category. For details about the parameter values and usage, see torch_npu.profiler.schedule Parameter Description. |
|
Exports the collected profile data to a format supported by the TensorBoard tool. Possible values are:
This function is not supported by torch_npu.profiler._KinetoProfile. |
|
|
torch_npu.profiler.ProfilerAction |
Profiler status, Enum type. Possible values are:
|
|
torch_npu.profiler._ExperimentalConfig |
Profile data collection extension, Enum type. It is called by experimental_config of torch_npu.profiler.profile. For details, see experimental_config Parameter Description. |
|
torch_npu.profiler.supported_activities |
Queries the CPU and NPU events of the activities parameters that can be collected. |
|
torch_npu.profiler.supported_profiler_level |
Queries the profiler_level of the currently supported experimental_config parameters. |
|
torch_npu.profiler.supported_ai_core_metrics |
Queries the AI Core performance metrics of the currently supported experimental_config parameters. |
|
torch_npu.profiler.supported_export_type |
Queries the supported profile data result file types of torch_npu.profiler.ExportType. |
Profile data occupies certain disk space. As a result, the server may be unavailable when the disk space is used up. The space required by profile data is closely related to the model parameters, collection configurations, and number of collection iterations. You need to ensure that the available disk space in the directory where profile data is flushed is sufficient.
profiler_config.json File Description
The content of the profiler_config.json file is as follows (the default settings are used as an example):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
{ "activities": ["CPU", "NPU"], "prof_dir": "./", "analyse": false, "record_shapes": false, "profile_memory": false, "with_stack": false, "with_flops": false, "with_modules": false, "active": 1, "start_step": 0, "is_rank": false, "rank_list": [], "experimental_config": { "profiler_level": "Level0", "aic_metrics": "AiCoreNone", "l2_cache": false, "op_attr": false, "gc_detect_threshold": null, "data_simplification": true, "record_op_args": false, "export_type": "text", "msprof_tx": false } } |
|
Parameter |
Description |
Required (Yes/No) |
||
|---|---|---|---|---|
|
activities |
CPU/NPU event sampling list. Possible values are:
By default, the two switches are turned on at the same time. |
No |
||
|
prof_dir |
Output directory of the sampled profile data. The default directory is ./. The path can contain only letters, digits, and underscores (_). Soft links are not supported. |
No |
||
|
analyse |
Switch for automatic parsing of profile data. Possible values are:
|
No |
||
|
record_shapes |
InputShapes and InputTypes of an operator. Possible values are:
This parameter is valid only when activities is set to CPU. |
No |
||
|
profile_memory |
Memory usage of an operator. Possible values are:
|
No |
||
|
with_stack |
Operator call stack, including the call information at the framework layer and CPU operator layer. Possible values are:
This parameter is valid only when activities is set to CPU. |
No |
||
|
with_flops |
Floating-point operation of an operator, Bool type. Currently, this parameter cannot be used for profile data parsing. Possible values are:
This parameter is valid only when activities is set to CPU. |
No |
||
|
with_modules |
Python call stack at the modules level, that is, call information at the framework layer. Possible values are:
This parameter is valid only when activities is set to CPU. |
No |
||
|
is_rank |
Enables the function of sampling data of a specified rank. Possible values are:
After this function is enabled, dynamic_profile identifies the rank ID configured in the ranks parameter and samples data based on the configured rank ID. If rank_list is empty after this function is enabled, no profile data is sampled. After this function is enabled, automatic analysis does not take effect. You need to use offline analysis. |
No |
||
|
rank_list |
Rank ID to be sampled. The value is an integer. The default value is empty, indicating that no profile data is sampled. The value must be a valid rank ID in the environment. You can specify one or more ranks at a time. For example, "rank_list": [1,2,3]. |
No |
||
|
experimental_config |
Extended parameter, used to configure common collection items of the performance analysis tool. For details, see experimental_config Parameter Description (dynamic_profile Scenario). In the dynamic sampling scenario, set the sub-parameter options of experimental_config in the configuration file to the actual parameter values, for example, "aic_metrics": "PipeUtilization". |
No |
||
|
metadata |
Samples model hyperparameters (keys) and configuration information (values).
Configuration examples:
|
No |
||
|
active |
Number of iterations for data collection. The value is a positive integer. The default value is 1. |
No |
experimental_config Parameter Description (dynamic_profile Scenario)
All experimental_config parameters are optional. The following table lists the sampling items that can be extended.
|
Parameter |
Description |
|---|---|
|
export_type |
Format of the exported profile data result file. Possible values are:
If this parameter is set to an invalid value or is not set, the default value Text is used. |
|
profiler_level |
Profile level. Possible values are:
|
|
msprof_tx |
Dotting switch. It is used to enable the customized dotting function. The value can be true (on) or false (off). The default value is false. For details about this parameter, see (Optional) Sampling and Parsing msprof_tx. |
|
data_simplification |
Data simplification mode. After it is enabled, data in the FRAMEWORK directory and redundant data will be deleted after profile data is exported. Only the profiler_info.json file and raw profile data in the ASCEND_PROFILER_OUTPUT and PROF_XXX directories are retained. This saves the storage space. The value can be true (on) or false (off). The default value is true. |
|
aic_metrics |
AI Core metrics to profile. Possible values are: The results of the following profiling items are displayed in the Kernel View.
For details about the results of the following profiling items, see op_summary (Operator Details). The actual collection result may vary.
|
|
l2_cache |
L2 cache data collection switch. The value can be true (on) or false (off). The default value is false. This profiling item generates the l2_cache.csv file in ASCEND_PROFILER_OUTPUT. For details about the result fields, see l2_cache (L2 Cache Hit Ratio). |
|
op_attr |
Operator attribute data sampling switch. Currently, the collection applies to only aclnn operators. The value can be true (on) or false (off). The default value is false. The profile data sampled by this parameter can only be .db files when export_type is set to db. This parameter does not take effect when Level_none is used. |
|
record_op_args |
Operator statistics switch. The value can be true (on) or false (off). The default value is false. After it is enabled, a file with collected operator information is generated in the {worker_name}_{timestamp}_ascend_pt_op_args directory. |
|
gc_detect_threshold |
GC detection threshold. The value is greater than or equal to 0 (unit: ms). If the threshold is a number, GC detection is enabled and only GC events that exceed the threshold are sampled. If this parameter is set to 0, all GC events are sampled. (Exercise caution when setting this parameter because a large amount of data may be sampled.) The recommended value is 1ms. The default value is null, indicating that the GC detection function is disabled. GC is used by the Python process to reclaim the memory of destroyed objects. If the format of the analysis result file is set to torch_npu.profiler.ExportType.Text, the GC layer is generated in the analysis result file trace_view.json. If the format of the analysis result file is set to torch_npu.profiler.ExportType.Db, the GC_RECORD table is generated in the ascend_pytorch_profiler_{rank_id}.db file. |
experimental_config Parameter Description
All experimental_config parameters are optional. The following table lists the sampling items that can be extended.
|
Parameter |
Description |
|---|---|
|
export_type |
Format of the exported profile data result file, Enum type. Possible values are:
If this parameter is set to an invalid value or is not set, the default value torch_npu.profiler.ExportType.Text is used. |
|
profiler_level |
Collection level, Enum type. Possible values are:
|
|
msprof_tx |
Dotting switch, Bool type. It is used to enable the customized dotting function. The value can be True (on) or False (off). The default value is False. For details about this parameter, see (Optional) Sampling and Parsing msprof_tx. |
|
data_simplification |
Data simplification mode (Boolean type). After it is enabled, data in the FRAMEWORK directory and redundant data will be deleted after profile data is exported. Only the profiler_info.json file and raw profile data in the ASCEND_PROFILER_OUTPUT and PROF_XXX directories are retained. This saves the storage space. The value can be True (on) or False (off). The default value is True. |
|
aic_metrics |
AI Core metrics to profile. Possible values are: The results of the following profiling items are displayed in the Kernel View.
For details about the results of the following profiling items, see op_summary (Operator Details). The actual collection result may vary.
|
|
l2_cache |
L2 cache data collection switch, Bool type. The value can be True (on) or False (off). The default value is False. This collection item generates the l2_cache.csv file in ASCEND_PROFILER_OUTPUT. For details about the result fields, see l2_cache (L2 Cache Hit Ratio). |
|
op_attr |
Operator attribute data collection switch, Bool type. Currently, the collection applies to only aclnn operators. The value can be True (on) or False (off). The default value is False. The profile data collected by this parameter can only be parsed to .db files when export_type=torch_npu.profiler.ExportType.Db is used. This parameter does not take effect when torch_npu.profiler.ProfilerLevel.None is used. |
|
record_op_args |
Operator statistics switch, Bool type. The value can be True (on) or False (off). The default value is False. After it is enabled, a file with collected operator information is generated in the {worker_name}_{timestamp}_ascend_pt_op_args directory. |
|
gc_detect_threshold |
GC detection threshold, float type. The value is greater than or equal to 0 (unit: ms). If the threshold is a number, GC detection is enabled and only GC events that exceed the threshold are sampled. If this parameter is set to 0, all GC events are sampled. (Exercise caution when setting this parameter because a large amount of data may be sampled.) The recommended value is 1 ms. The default value is None, indicating that the GC detection function is disabled. GC is used by the Python process to reclaim the memory of destroyed objects. If the format of the analysis result file is set to torch_npu.profiler.ExportType.Text, the GC layer is generated in the analysis result file trace_view.json. If the format of the analysis result file is set to torch_npu.profiler.ExportType.Db, the GC_RECORD table is generated in the ascend_pytorch_profiler_{rank_id}.db file. |
torch_npu.profiler.schedule Parameter Description
The torch_npu.profiler.schedule class parameters are used to set the sampling behavior in different steps in the sampling process. Prototype
torch_npu.profiler.schedule (wait, active, warmup = 0, repeat = 0, skip_first = 0)
|
Parameter |
Description |
|---|---|
|
wait |
Number of steps skipped during each repeated collection, int type. This function is required. |
|
active |
Number of steps for collection, int type. This function is required. |
|
warmup |
Number of warm-up steps, int type. The default value is 0. You are advised to set one warm-up step. This function is optional. |
|
repeat |
Number of times that "wait + warmup + active" steps are repeatedly executed, int type. The default value is 0, indicating that the repeat execution will not stop. You are advised to set this parameter to an integer greater than 0. This function is optional. |
|
skip_first |
Number of steps that are skipped before sampling, int type. The default value is 0. In dynamic-shape scenarios, you are advised to skip the first 10 steps to ensure stable profile data. In other scenarios, you can configure this parameter based on the actual requirements. This function is optional. |
|
Note: You are advised to set schedule based on this formula: number of steps ≥ skip_first + (wait + warmup + active) × repeat |
|
The following figure shows the relationships between the torch_npu.profiler.schedule class, the step, and the on_trace_ready function.
A code example of the configuration:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
with torch_npu.profiler.profile( activities=[ torch_npu.profiler.ProfilerActivity.CPU, torch_npu.profiler.ProfilerActivity.NPU, ], schedule=torch_npu.profiler.schedule( wait=1, # Waiting phase. One step is skipped. warmup=1, # Warm-up phase. One step is skipped. active=2, # Record the activity data of two steps and call on_trace_ready. repeat=2, # Repeat the wait+warmup+active process twice. skip_first=1 # Skip one step. ), on_trace_ready=torch_npu.profiler.tensorboard_trace_handler('./result') ) as prof: for _ in range(9): train_one_step() prof.step() # Notify the profiler to finish a step. |
Using dynamic_profile for Dynamic Profiling of Maintenance and Test Logs
1 2 3 4 5 6 7 8 |
profiler_config_path/ ├── log │ ├── dp_ubuntu_xxxxxx_rank_*.log │ ├── dp_ubuntu_xxxxxx_rank_*.log.1 │ ├── monitor_dp_ubuntu_xxxxxx_rank_*.log │ ├── monitor_dp_ubuntu_xxxxxx_rank_*.log.1 ├── profiler_config.json └── shm |
- dp_ubuntu_xxxxxx.log: Execution log of dynamic_profile, which records all actions (INFO), warnings (WARNING), and errors (ERROR) during dynamic profiling. File naming format: dp_{Operating system}_{AI task process ID}_{rank_id}.log.
When an AI task is started, each Rank will initiate an AI task process. The dynamic_profile generates log files for each AI task process based on the process ID of each task.
- dp_ubuntu_xxxxxx.log.1: This is a log aging backup file. The storage limit for the dp_ubuntu_xxxxxx.log file is 200 KB. Once the limit is reached, the earliest log entries are moved to dp_ubuntu_xxxxxx.log.1. The storage limit for the dp_ubuntu_xxxxxx.log.1 file is also 200 KB, and once the limit is reached, the earliest log entries are deleted through aging.
- monitor_dp_ubuntu_xxxxxx.log: This is the log for the profiler_config.json file modifications. After dynamic_profile is enabled for dynamic profiling, it records the modification time of the profiler_config.json file, whether the modifications take effect, and the end of the dynamic_profile process in real time. An example is shown below:
1 2 3
2024-08-21 15:51:46,392 [INFO] [2127856] _dynamic_profiler_monitor.py: Dynamic profiler process load json success 2024-08-21 15:51:58,406 [INFO] [2127856] _dynamic_profiler_monitor.py: Dynamic profiler process load json success 2024-08-21 15:58:16,795 [INFO] [2127856] _dynamic_profiler_monitor.py: Dynamic profiler process done
File naming format: monitor_dp_{Operating system}_{monitor process ID}_{rank_id}.log.
- monitor_dp_ubuntu_xxxxxx.log.1: This is a log aging backup file. The storage limit for the monitor_dp_ubuntu_xxxxxx.log file is 200 KB. Once the limit is reached, the earliest log entries are moved to monitor_dp_ubuntu_xxxxxx.log.1. The storage limit for the monitor_dp_ubuntu_xxxxxx.log.1 file is also 200 KB, and once the limit is reached, the earliest log entries are deleted through aging.
- shm directory: To support Python 3.7, dynamic_profile will generate the shm directory in the environment. A binary file (DynamicProfileNpuShm+timestamp) is created in this directory to map shared memory. The file will be automatically cleaned up when the program ends normally. However, when the program is terminated using pkill, the program cannot release resources due to the abnormal termination, and you need to manually clean up this file. Otherwise, if dynamic_profile is started again within a short time (< 1h) using the same configuration path, it will cause dynamic_profile to fail. For Python 3.8 and above, the binary file (DynamicProfileNpuShm+timestamp) is stored in the /dev/shm directory. When the program is terminated using pkill, the file still needs to be manually cleaned up.