Timeline and Summary Data

trace_view.json

Figure 1 trace_view

As shown in Figure 1, trace data is displayed in the following areas:

  • Area 1: upper-layer application data, including the time consumption information of upper-layer application operators.
  • Area 2: data at the CANN layer, including the time consumption data of the AscendCL, GE, and Runtime components.
  • Area 3: bottom-layer NPU data, including the time consumption data of Task Scheduler, iteration trace data, and other Ascend AI Processor system data.
  • Area 4: details about each operator and API in a trace event, which are displayed when you click the trace event.

The trace_view.json file can be opened in TensorBoard, chrome://tracing/, and https://ui.perfetto.dev/.

Figure 2 trace_view (record_shapes)

When record_shapes is enabled, Input Dims and Input type are displayed for upper-layer application operators in trace_view.

Figure 3 trace_view (with_stack)

When with_stack is enabled, Call stack is displayed for upper-layer application operators in trace_view.

Figure 4 trace_view (GC)

The sampling result in the Figure 4, that is, the time segment at the Python GC layer is the GC execution time.

During GC, the current process is blocked. You need to wait until the GC finishes. If the GC takes a long time, you can adjust the GC parameters (see gc.set_threshold in Garbage Collector) to relieve the process blocking caused by GC.

kernel_details.csv

Figure 5 kernel_details

The kernel_details.csv file is controlled by the torch_npu.profiler.ProfilerActivity.NPU switch. It contains information about all operators executed on the NPU. If the user frontend calls schedule to perform step dotting, the Step Id field is added. Table 1 describes the fields.

During the configuration of the aic_metrics parameter of experimental_config, corresponding fields are added to the kernel_details.csv file based on the aic_metrics configuration of experimental_config. For details about the added content, see experimental_config Parameter Description. For details about the fields in the file, see op_summary (Operator Details).

Table 1 kernel_details

Field

Description

Step Id

Iteration ID.

Model ID

Model ID.

Task ID

Task ID.

Stream ID

ID of the stream where a task is located.

Name

Operator name.

Type

Operator type.

OP State

Dynamic and static information about an operator. The value dynamic indicates a dynamic operator, and the value static indicates a static operator. The HCCL operator does not have this state, so N/A is displayed. This field is reported only when --task-time is l1. If --task-time is l0, N/A is displayed.

Accelerator Core

AI acceleration core type, including AI Core and AI CPU.

Start Time(us)

Operator execution start time (μs).

Duration(us)

Execution duration of the current operator (μs).

Wait Time(us)

Operator execution wait time (μs).

Block Dim

Number of running blocks, which corresponds to the number of cores during task running.

Mix Block Dim

Some operators are executed simultaneously on both the AI Core and Vector Core. The Block Dim of the primary accelerator is described in the Block Dim field, and the Block Dim of the secondary accelerator is described in this field. If task_time is set to l0, this field is not collected and is displayed as N/A.

HF32 Eligible

Whether to use the HF32 precision flag. YES indicates that the HF32 precision flag is used, while NO indicates that the HF32 precision flag is not used.

Input Shapes

Operator input shapes.

Input Data Types

Input data types.

Input Formats

Input formats.

Output Shapes

Operator output shapes.

Output Data Types

Output data types.

Output Formats

Output data formats.

memory_record.csv

Figure 6 memory_record

The memory_record.csv file is controlled by the profile_memory switch. It records the memory usages of the PTA and GE components, including their memory allocations and occupation time at the operator level (PTA, GE, PTA + GE) and process level. Table 2 describes the fields.

Table 2 memory_record

Field

Description

Component

Component, including PTA and GE. PTA, GE, and PTA+GE are operator-level components, and APP is a process-level component.

Timestamp(us)

Timestamp, which records the start time of memory usage (μs).

Total Allocated(MB)

Total allocated memory (MB).

Total Reserved(MB)

Total reserved memory (MB).

Total Active(MB)

Total memory requested by streams in the PTA (MB), including the unreleased memory that is reused by other streams. The unit is MB.

Stream Ptr

Memory address of an AscendCL stream, which is used to mark different AscendCL streams.

Device Type

Device type and device ID. Only NPUs are involved.

operator_memory.csv

Figure 7 operator_memory

The operator_memory.csv file is controlled by the profile_memory switch. It contains the memory usage details of operators, including the memory required for executing specific operators on the NPU and the occupation time. The memory is allocated to the PTA and GE. Table 3 describes the fields.

If the operator_memory.csv file contains negative or empty values, see Negative and Empty Value Description for details.

Table 3 operator_memory

Field

Description

Name

Operator name.

Size(KB)

Size of the memory occupied by the operator (KB).

Allocation Time(us)

Tensor memory allocation time (μs).

Release Time(us)

Tensor memory release time (μs).

Active Release Time(us)

Actual time when the memory is returned to the memory pool (μs).

Duration(us)

Memory occupation time (Release Time – Allocation Time) (μs).

Active Duration(us)

Actual memory occupation time (Active Release Time – Allocation Time) (μs).

Allocation Total Allocated(MB)

Total allocated memory during operator memory allocation (MB). (If the operator name begins with aten, the memory is the PTA memory. If the operator name begins with cann, the memory is the GE memory.)

Allocation Total Reserved(MB)

Total occupied memory during operator memory allocation (MB). (If the operator name begins with aten, the memory is the PTA memory. If the operator name begins with cann, the memory is the GE memory.)

Allocation Total Active(MB)

Total memory requested by the current stream during operator memory allocation (MB), including the unreleased memory reused by other streams.

Release Total Allocated(MB)

Total allocated memory during operator memory release (MB). (If the operator name begins with aten, the memory is the PTA memory. If the operator name begins with cann, the memory is the GE memory.)

Release Total Reserved(MB)

Total occupied memory during operator memory release (MB). (If the operator name begins with aten, the memory is the PTA memory. If the operator name begins with cann, the memory is the GE memory.)

Release Total Active(MB)

Total memory that is reused by other streams during operator memory release (MB). (If the operator name begins with aten, the memory is the PTA memory. If the operator name begins with cann, the memory is the GE memory.)

Stream Ptr

Memory address of an AscendCL stream, which is used to mark different AscendCL streams.

Device Type

Device type and device ID. Only NPUs are involved.

npu_module_mem.csv

Figure 8 npu_module_mem

The npu_module_mem.csv data is automatically collected during PyTorch profile data collection, including the component-level memory usage. It records the current memory usage of a component executed on the NPU. Table 4 describes the fields.

Table 4 npu_module_mem

Field

Description

Component

Component name.

Timestamp(us)

Timestamp, in μs, indicating the memory occupied by the component at the current time.

Total Reserved(MB)

Memory usage, in MB.

Device

Device type and device ID. Only NPUs are involved.

operator_details.csv

Figure 9 operator_details

The operator_details.csv file is controlled by the torch_npu.profiler.ProfilerActivity.CPU switch. Table 5 describes the information contained in the operator_details.csv file.

Table 5 operator_details

Field

Description

Name

Operator name.

Input Shapes

Shape information.

Call Stack

Function call stack information. It is controlled by the with_stack field.

Host Self Duration(us)

Time consumed by operators on the host (excluding other internally called operators) (μs).

Host Total Duration(us)

Time consumed by operators on the host (μs).

Device Self Duration(us)

Time consumed by operators on the device (excluding other internally called operators) (μs).

Device Total Duration(us)

Time consumed by operators on the device (μs).

Device Self Duration With AICore(us)

Time consumed by operators executed on the AI Core on the device (excluding internally called operators) (μs).

Device Total Duration With AICore(us)

Time consumed by operators executed on the AI Core on the device (μs).

step_trace_time.csv

Figure 10 step_trace_time

The step_trace_time.csv file is extracted from data in the trace_view.json file. Table 6 describes the information contained in this file.

Table 6 step_trace_time

Field

Description

Step

Number of iterations.

Computing

Total computation time of operators on the NPU (unit: μs).

Communication(Not Overlapped)

Communication time (unit: μs), which is the total communication time minus the overlapping time of computation and communication.

Overlapped

Overlapping time of computation and communication (unit: μs). Longer overlapping time indicates better parallelism between computation and communication. Ideally, communication and computation are completely overlapped.

Communication

Total communication time of operators on the NPU (unit: μs).

Free

Total iteration time minus the computation and communication time (unit: μs). It may include initialization, data loading, and CPU computation time.

Stage

Stage time, indicating the time except the receive operator time (unit: μs).

Bubble

Total receive time (unit: μs).

Communication(Not Overlapped and Exclude Receive)

Total communication time minus the overlapping time of computation and communication and the receive operator time (unit: μs).

Preparing

Duration from the time when the iteration starts to the time when the first computing or communication operator runs (unit: μs).

data_preprocess.csv

The data_preprocess.csv file records the AI CPU data. The example and field description are based on aicpu (AI CPU Operator Time Consumption Details). The actual results are slightly different.

l2_cache.csv

The example and field description are based on l2_cache (L2 Cache Hit Ratio). The actual results are slightly different.

op_statistic.csv

The example and field description are based on op_statistic (Operator Calling Times and Time Consumption). The actual results are slightly different.

api_statistic.csv

The example and field description are based on api_statistic_*.csv File. The actual results are slightly different.