Timeline and Summary Data
trace_view.json
As shown in Figure 1, trace data is displayed in the following areas:
- Area 1: upper-layer application data, including the time consumption information of upper-layer application operators.
- Area 2: data at the CANN layer, including the time consumption data of the AscendCL, GE, and Runtime components.
- Area 3: bottom-layer NPU data, including the time consumption data of Task Scheduler, iteration trace data, and other Ascend AI Processor system data.
- Area 4: details about each operator and API in a trace event, which are displayed when you click the trace event.
The trace_view.json file can be opened in MindStudio Insight, chrome://tracing/, and https://ui.perfetto.dev/.

When record_shapes is enabled, Input Dims and Input type are displayed for upper-layer application operators in trace_view.
This function is supported only in PyTorch scenarios but not in MindSpore scenarios.

When with_stack is enabled, Call stack is displayed for upper-layer application operators in trace_view.
The profiling result in Figure 4, that is, the time segment at the Python GC layer is the GC execution time.
During GC, the current process is blocked. You need to wait until the GC finishes. If the GC takes a long time, you can adjust the GC parameters (see gc.set_threshold in Garbage Collector) to relieve the process blocking caused by GC.
This function is supported only in PyTorch scenarios.
kernel_details.csv


The file contains information about all operators executed on the NPU. If the user's frontend calls schedule to insert markers into steps, the Step Id field is included. However, if warmup (not 0) is set for schedule and asynchronous operator execution operations exist after each step of training or online inference, the asynchronous operator execution operations may be profiled in the warmup phase. As a result, the Step Id field does not exist in kernel_details.csv.
Table 1 describes the fields.
During the configuration of the aic_metrics parameter of experimental_config, corresponding fields are added to the kernel_details.csv file based on the aic_metrics configuration of experimental_config. For details about the added content, see experimental_config Parameter Description. For details about the fields in the file, see op_summary (Operator Details).
Field |
Description |
|---|---|
Step Id&Step ID |
Iteration ID. |
Device_id |
Device ID. |
Model ID |
Model ID. |
Task ID |
Task ID. |
Stream ID |
ID of the stream where a task is located. |
Name |
Operator name. |
Type |
Operator type. |
OP State |
Dynamic and static information about an operator. The value dynamic indicates a dynamic operator, and the value static indicates a static operator. The communication operator does not have this state, so N/A is displayed. This field is reported only when --task-time is l1. If --task-time is l0, N/A is displayed. |
Accelerator Core |
AI acceleration core type, including AI Core and AI CPU. |
Start Time(us) |
Operator execution start time, in μs. |
Duration(us) |
Execution duration of the current operator, in μs. |
Wait Time(us) |
Operator execution wait time, in μs. |
Block Dim |
Number of running blocks, which corresponds to the number of cores during task running. |
Mix Block Dim |
Some operators are executed simultaneously on both the AI Core and Vector Core. The Block Dim of the primary accelerator is described in the Block Dim field, and the Block Dim of the secondary accelerator is described in this field. If task_time is set to l0, this field is not collected and is displayed as N/A. (Atlas A2 Training Series Product/Atlas 800I A2 Inference Product) (Atlas A3 Training Series Product) |
HF32 Eligible |
Whether to use the HF32 precision flag. YES indicates that the HF32 precision flag is used, while NO indicates that the HF32 precision flag is not used. |
Input Shapes |
Operator input shapes. |
Input Data Types |
Input data types. |
Input Formats |
Input formats. |
Output Shapes |
Operator output shapes. |
Output Data Types |
Output data types. |
Output Formats |
Output data formats. |
memory_record.csv

The file contains the device memory usage records of PTA and GE, including the allocated memory and the memory occupation time. Table 2 describes the fields.
Field |
Description |
|---|---|
Component |
Components, including:
|
Timestamp(us) |
Timestamp, which records the start time of device memory usage, in μs. |
Total Allocated(MB) |
Total allocated memory (MB). |
Total Reserved(MB) |
Total reserved memory (MB). |
Total Active(MB) |
Total memory requested by streams, including the unreleased memory that is reused by other streams, in MB. |
Stream Ptr |
Memory address of an AscendCL stream, which is used to mark different AscendCL streams. |
Device Type |
Device type and device ID. Only NPUs are involved. |
operator_memory.csv

The file contains the memory usage details of operators, including the memory required for executing specific operators on the NPU and the occupation time. The memory is allocated to PTA and GE. Table 3 describes the fields.
If the operator_memory.csv file contains negative or empty values, see Negative and Empty Value Description for details.
Field |
Description |
|---|---|
Name |
Operator name. |
Size(KB) |
Size of the memory occupied by the operator, in KB. |
Allocation Time(us) |
Tensor memory allocation time, in μs. |
Release Time(us) |
Tensor memory release time, in μs. |
Active Release Time(us) |
Actual time when the memory is returned to the memory pool, in μs. |
Duration(us) |
Memory occupation time (Release Time – Allocation Time), in μs. |
Active Duration(us) |
Actual memory occupation time (Active Release Time – Allocation Time), in μs. |
Allocation Total Allocated(MB) |
Total allocated memory during operator memory allocation (MB). (If the operator name begins with aten, the memory is the PTA memory. If the operator name begins with cann, the memory is the GE memory.) |
Allocation Total Reserved(MB) |
Total occupied memory during operator memory allocation (MB). (If the operator name begins with aten, the memory is the PTA memory. If the operator name begins with cann, the memory is the GE memory.) |
Allocation Total Active(MB) |
Total memory (MB) requested by the current stream during operator memory allocation, including the unreleased memory reused by other streams. |
Release Total Allocated(MB) |
Total allocated memory during operator memory release, in MB. (If the operator name begins with aten, the memory is the PTA memory. If the operator name begins with cann, the memory is the GE memory.) |
Release Total Reserved(MB) |
Total occupied memory during operator memory release, in MB. (If the operator name begins with aten, the memory is the PTA memory. If the operator name begins with cann, the memory is the GE memory.) |
Release Total Active(MB) |
Total memory that is reused by other streams during operator memory release, in MB. (If the operator name begins with aten, the memory is the PTA memory. If the operator name begins with cann, the memory is the GE memory.) |
Stream Ptr |
Memory address of an AscendCL stream, which is used to mark different AscendCL streams. |
Device Type |
Device type and device ID. Only NPUs are involved. |
npu_module_mem.csv


The npu_module_mem.csv data is automatically collected during the profiling process. It includes the component-level memory usage, primarily recording the memory occupied by components at the time of execution on the NPU. Table 4 describes the fields.
Field |
Description |
|---|---|
Device_id |
Device ID. |
Component |
Component name. |
Timestamp(us) |
Timestamp, in μs, indicating the memory occupied by the component at the current time. |
Total Reserved(MB)&Total Reserved(KB) |
Memory usage, in MB for PyTorch and KB for MindSpore. |
Device |
Device type and device ID. Only NPUs are involved. |
operator_details.csv

Table 5 describes the information contained in the operator_details.csv file.
Field |
Description |
|---|---|
Name |
Operator name. |
Input Shapes |
Shape information. |
Call Stack |
Function call stack information. It is controlled by the with_stack field. |
Host Self Duration(us) |
Time consumed by operators on the host (excluding other internally called operators), in μs. |
Host Total Duration(us) |
Time consumed by operators on the host, in μs. |
Device Self Duration(us) |
Time consumed by operators on the device (excluding other internally called operators), in μs. |
Device Total Duration(us) |
Time consumed by operators on the device, in μs. |
Device Self Duration With AICore(us) |
Time consumed by operators executed on the AI Core on the device (excluding internally called operators), in μs. |
Device Total Duration With AICore(us) |
Time consumed by operators executed on the AI Core on the device, in μs. |
step_trace_time.csv


Table 6 describes the time statistics of computation and communication in iterations.
Field |
Description |
|---|---|
Device_id |
Device ID. |
Step |
Number of iterations. |
Computing |
Total computation time of operators on the NPU (unit: μs). |
Communication(Not Overlapped) |
Communication time (unit: μs), which is the total communication time minus the overlapping time of computation and communication. |
Overlapped |
Overlapping time of computation and communication (unit: μs). Longer overlapping time indicates better parallelism between computation and communication. Ideally, communication and computation are completely overlapped. |
Communication |
Total communication time of operators on the NPU (unit: μs). |
Free |
Total iteration time minus the computation and communication time (unit: μs). It may include initialization, data loading, and CPU computation time. |
Stage |
Stage time, indicating the time except the receive operator time (unit: μs). |
Bubble |
Total receive time (unit: μs). |
Communication(Not Overlapped and Exclude Receive) |
Total communication time minus the overlapping time of computation and communication and the receive operator time (unit: μs). |
Preparing |
Duration from the time when the iteration starts to the time when the first computing or communication operator runs (unit: μs). |
data_preprocess.csv
The data_preprocess.csv file records the AI CPU data. The example and field description are based on aicpu (AI CPU Operator Time Consumption Details). The actual results are slightly different.
l2_cache.csv
The example and field description are based on l2_cache (L2 Cache Hit Ratio). The actual results are slightly different.
op_statistic.csv
The example and field description are based on op_statistic (Operator Calling Times and Time Consumption). The actual results are slightly different.
api_statistic.csv
The example and field description are based on api_statistic_*.csv File. The actual results are slightly different.
pcie.csv
See pcie_*.csv File for the example and field description. The actual results are slightly different.
hccs.csv
See hccs_*.csv File for the example and field description. The actual results are slightly different.
nic.csv
See nic_*.csv File for the example and field description. The actual results are slightly different.
roce.csv
See nic_*.csv File for the example and field description. The actual results are slightly different.

