Timeline and Summary Data
trace_view.json
As shown in Figure 1, trace data is displayed in the following areas:
- Area 1: upper-layer application data, including the time consumption information of upper-layer application operators.
- Area 2: data at the CANN layer, including the time consumption data of the AscendCL, GE, and Runtime components.
- Area 3: bottom-layer NPU data, including the time consumption data of Task Scheduler, iteration trace data, and other Ascend AI Processor system data.
- Area 4: details about each operator and API in a trace event, which are displayed when you click the trace event.
The trace_view.json file can be opened in TensorBoard, chrome://tracing/, and https://ui.perfetto.dev/.
When record_shapes is enabled, Input Dims and Input type are displayed for upper-layer application operators in trace_view.
When with_stack is enabled, Call stack is displayed for upper-layer application operators in trace_view.
The sampling result in the Figure 4, that is, the time segment at the Python GC layer is the GC execution time.
During GC, the current process is blocked. You need to wait until the GC finishes. If the GC takes a long time, you can adjust the GC parameters (see gc.set_threshold in Garbage Collector) to relieve the process blocking caused by GC.
kernel_details.csv
The kernel_details.csv file is controlled by the torch_npu.profiler.ProfilerActivity.NPU switch. It contains information about all operators executed on the NPU. If the user frontend calls schedule to perform step dotting, the Step Id field is added. Table 1 describes the fields.
During the configuration of the aic_metrics parameter of experimental_config, corresponding fields are added to the kernel_details.csv file based on the aic_metrics configuration of experimental_config. For details about the added content, see experimental_config Parameter Description. For details about the fields in the file, see op_summary (Operator Details).
|
Field |
Description |
|---|---|
|
Step Id |
Iteration ID. |
|
Model ID |
Model ID. |
|
Task ID |
Task ID. |
|
Stream ID |
ID of the stream where a task is located. |
|
Name |
Operator name. |
|
Type |
Operator type. |
|
OP State |
Dynamic and static information about an operator. The value dynamic indicates a dynamic operator, and the value static indicates a static operator. The HCCL operator does not have this state, so N/A is displayed. This field is reported only when --task-time is l1. If --task-time is l0, N/A is displayed. |
|
Accelerator Core |
AI acceleration core type, including AI Core and AI CPU. |
|
Start Time(us) |
Operator execution start time (μs). |
|
Duration(us) |
Execution duration of the current operator (μs). |
|
Wait Time(us) |
Operator execution wait time (μs). |
|
Block Dim |
Number of running blocks, which corresponds to the number of cores during task running. |
|
Mix Block Dim |
Some operators are executed simultaneously on both the AI Core and Vector Core. The Block Dim of the primary accelerator is described in the Block Dim field, and the Block Dim of the secondary accelerator is described in this field. If task_time is set to l0, this field is not collected and is displayed as N/A. |
|
HF32 Eligible |
Whether to use the HF32 precision flag. YES indicates that the HF32 precision flag is used, while NO indicates that the HF32 precision flag is not used. |
|
Input Shapes |
Operator input shapes. |
|
Input Data Types |
Input data types. |
|
Input Formats |
Input formats. |
|
Output Shapes |
Operator output shapes. |
|
Output Data Types |
Output data types. |
|
Output Formats |
Output data formats. |
memory_record.csv
The memory_record.csv file is controlled by the profile_memory switch. It records the memory usages of the PTA and GE components, including their memory allocations and occupation time at the operator level (PTA, GE, PTA + GE) and process level. Table 2 describes the fields.
|
Field |
Description |
|---|---|
|
Component |
Component, including PTA and GE. PTA, GE, and PTA+GE are operator-level components, and APP is a process-level component. |
|
Timestamp(us) |
Timestamp, which records the start time of memory usage (μs). |
|
Total Allocated(MB) |
Total allocated memory (MB). |
|
Total Reserved(MB) |
Total reserved memory (MB). |
|
Total Active(MB) |
Total memory requested by streams in the PTA (MB), including the unreleased memory that is reused by other streams. The unit is MB. |
|
Stream Ptr |
Memory address of an AscendCL stream, which is used to mark different AscendCL streams. |
|
Device Type |
Device type and device ID. Only NPUs are involved. |
operator_memory.csv
The operator_memory.csv file is controlled by the profile_memory switch. It contains the memory usage details of operators, including the memory required for executing specific operators on the NPU and the occupation time. The memory is allocated to the PTA and GE. Table 3 describes the fields.
If the operator_memory.csv file contains negative or empty values, see Negative and Empty Value Description for details.
|
Field |
Description |
|---|---|
|
Name |
Operator name. |
|
Size(KB) |
Size of the memory occupied by the operator (KB). |
|
Allocation Time(us) |
Tensor memory allocation time (μs). |
|
Release Time(us) |
Tensor memory release time (μs). |
|
Active Release Time(us) |
Actual time when the memory is returned to the memory pool (μs). |
|
Duration(us) |
Memory occupation time (Release Time – Allocation Time) (μs). |
|
Active Duration(us) |
Actual memory occupation time (Active Release Time – Allocation Time) (μs). |
|
Allocation Total Allocated(MB) |
Total allocated memory during operator memory allocation (MB). (If the operator name begins with aten, the memory is the PTA memory. If the operator name begins with cann, the memory is the GE memory.) |
|
Allocation Total Reserved(MB) |
Total occupied memory during operator memory allocation (MB). (If the operator name begins with aten, the memory is the PTA memory. If the operator name begins with cann, the memory is the GE memory.) |
|
Allocation Total Active(MB) |
Total memory requested by the current stream during operator memory allocation (MB), including the unreleased memory reused by other streams. |
|
Release Total Allocated(MB) |
Total allocated memory during operator memory release (MB). (If the operator name begins with aten, the memory is the PTA memory. If the operator name begins with cann, the memory is the GE memory.) |
|
Release Total Reserved(MB) |
Total occupied memory during operator memory release (MB). (If the operator name begins with aten, the memory is the PTA memory. If the operator name begins with cann, the memory is the GE memory.) |
|
Release Total Active(MB) |
Total memory that is reused by other streams during operator memory release (MB). (If the operator name begins with aten, the memory is the PTA memory. If the operator name begins with cann, the memory is the GE memory.) |
|
Stream Ptr |
Memory address of an AscendCL stream, which is used to mark different AscendCL streams. |
|
Device Type |
Device type and device ID. Only NPUs are involved. |
npu_module_mem.csv
The npu_module_mem.csv data is automatically collected during PyTorch profile data collection, including the component-level memory usage. It records the current memory usage of a component executed on the NPU. Table 4 describes the fields.
operator_details.csv
The operator_details.csv file is controlled by the torch_npu.profiler.ProfilerActivity.CPU switch. Table 5 describes the information contained in the operator_details.csv file.
|
Field |
Description |
|---|---|
|
Name |
Operator name. |
|
Input Shapes |
Shape information. |
|
Call Stack |
Function call stack information. It is controlled by the with_stack field. |
|
Host Self Duration(us) |
Time consumed by operators on the host (excluding other internally called operators) (μs). |
|
Host Total Duration(us) |
Time consumed by operators on the host (μs). |
|
Device Self Duration(us) |
Time consumed by operators on the device (excluding other internally called operators) (μs). |
|
Device Total Duration(us) |
Time consumed by operators on the device (μs). |
|
Device Self Duration With AICore(us) |
Time consumed by operators executed on the AI Core on the device (excluding internally called operators) (μs). |
|
Device Total Duration With AICore(us) |
Time consumed by operators executed on the AI Core on the device (μs). |
step_trace_time.csv
The step_trace_time.csv file is extracted from data in the trace_view.json file. Table 6 describes the information contained in this file.
|
Field |
Description |
|---|---|
|
Step |
Number of iterations. |
|
Computing |
Total computation time of operators on the NPU (unit: μs). |
|
Communication(Not Overlapped) |
Communication time (unit: μs), which is the total communication time minus the overlapping time of computation and communication. |
|
Overlapped |
Overlapping time of computation and communication (unit: μs). Longer overlapping time indicates better parallelism between computation and communication. Ideally, communication and computation are completely overlapped. |
|
Communication |
Total communication time of operators on the NPU (unit: μs). |
|
Free |
Total iteration time minus the computation and communication time (unit: μs). It may include initialization, data loading, and CPU computation time. |
|
Stage |
Stage time, indicating the time except the receive operator time (unit: μs). |
|
Bubble |
Total receive time (unit: μs). |
|
Communication(Not Overlapped and Exclude Receive) |
Total communication time minus the overlapping time of computation and communication and the receive operator time (unit: μs). |
|
Preparing |
Duration from the time when the iteration starts to the time when the first computing or communication operator runs (unit: μs). |
data_preprocess.csv
The data_preprocess.csv file records the AI CPU data. The example and field description are based on aicpu (AI CPU Operator Time Consumption Details). The actual results are slightly different.
l2_cache.csv
The example and field description are based on l2_cache (L2 Cache Hit Ratio). The actual results are slightly different.
op_statistic.csv
The example and field description are based on op_statistic (Operator Calling Times and Time Consumption). The actual results are slightly different.
api_statistic.csv
The example and field description are based on api_statistic_*.csv File. The actual results are slightly different.

