op_summary (Operator Details)

AI Core and AI CPU operator summary data does not contain the timeline information. The summary information is summarized in the op_summary_*.csv file to collect statistics on the specific details and time consumptions of operators.

Availability

Atlas 200/300/500 Inference Product

Atlas Training Series Product

op_summary_*.csv File

The file content is formatted as follows.

Figure 1 op_summary (example only)

The Task Duration field specifies the operator time consumption. You can sort operators by Task Duration to find time-consuming operators, or sort them by Task Type to view the time-consuming operators running on the AI Core or AI CPU.

When task_time is set to l0 or off, op_summary_*.csv does not display the PMU data of the AI Core.
Operators whose Task Type is HCCL usually contain a series of communication tasks. Each communication task has an independent task ID and stream ID, which are not displayed here. Therefore, the task IDs and stream IDs of this type of operators are N/A.
If an input is scalar, the corresponding Input Shapes field is empty and formatted as ; ; ; ;. Each dimension is separated by a semicolon (;). This also applies to output shapes.
The tool checks the operator overflow/underflow. If an operator overflow/underflow is detected, the following alarm is displayed. In this case, the operator computation result is unreliable.
Figure 2 Operator overflow/underflow alarm

The op_summary_*.csv file content varies depending on the values of --aic-mode, --aic-metrics, --task-time. The complete fields are as follows.

**Table 1** Description of common fields
Field	Description
Device_id	Device ID.
Model Name	Model name. It may be left empty if no related data is collected. (This field is not displayed by default or in the single-operator scenario.)
Model ID	Model ID.
Task ID	Task ID.
Stream ID	ID of the stream where a task is located.
Infer ID	Inference iteration ID. (This field is not displayed by default or in the single-operator scenario.)
Op Name	Operator name.
OP Type	Operator type. If task_time is set to l0, this field is not collected and is displayed as N/A.
OP State	Dynamic and static information about an operator. The value dynamic indicates a dynamic operator, and the value static indicates a static operator. The HCCL operator does not have this state, so N/A is displayed. This field is reported only when --task-time is l1. If --task-time is l0, N/A is displayed.
Task Type	Type of the accelerator that executes a task, including AI Core, AI Vector Core, and AI CPU. If task_time is set to l0, this field is not collected and is displayed as N/A.
Task Start Time(us)	Task start time (μs).
Task Duration(us)	Task duration (μs), including scheduling time to the accelerator, execution time on the accelerator, and response end time.
Task Wait Time(us)	Interval between tasks (μs).
Block Dim	Number of running task blocks, which corresponds to the number of cores during task running. If task_time is set to l0, this field is not collected and is displayed as 0.
HF32 Eligible	Whether to use the HF32 precision flag. YES indicates that the HF32 precision flag is used, while NO indicates that the HF32 precision flag is not used.
Input Shapes	Input shapes. If task_time is set to l0, this field is not collected and is displayed as N/A.
Input Data Types	Input data types. If task_time is set to l0, this field is not collected and is displayed as N/A.
Input Formats	Input formats. If task_time is set to l0, this field is not collected and is displayed as N/A.
Output Shapes	Output shapes. If task_time is set to l0, this field is not collected and is displayed as N/A.
Output Data Types	Output data types. If task_time is set to l0, this field is not collected and is displayed as N/A.
Output Formats	Output data formats. If task_time is set to l0, this field is not collected and is displayed as N/A.
aicore_time(us)	Theoretical execution time of a task on the AI Core when all blocks are scheduled simultaneously and each block has an equal execution duration. The unit is μs. Typically, the scheduling start time of each block is slightly different, so the value of this field is slightly less than the actual execution time of the task on the AI Core. If the AI Core frequency changes (for example, manual frequency adjustment or dynamic frequency adjustment when the power consumption exceeds the threshold is involved), the data is inaccurate and is not recommended for reference.
total_cycles	Total number of execution cycles of a task on the AI Core, which is the sum of the execution cycles of all blocks.
Register value	Value of the custom register whose data is to be collected. It is configured by --aic-metrics.

The following fields are supported by the Atlas 200/300/500 Inference Product Atlas Training Series Product

**Table 2** Field description (**PipeUtilization**)
Field	Description
vec_time(us)	Time (μs) taken to execute Vector instructions.
vec_ratio	Ratio of cycles taken to execute Vector instructions to the total cycles.
mac_time(us)	Time (μs) taken to execute Cube instructions.
mac_ratio	Ratio of cycles taken to execute Cube instructions to the total cycles.
scalar_time(us)	Time (μs) taken to execute Scalar instructions.
scalar_ratio	Ratio of cycles taken to execute Scalar instructions to the total cycles.
mte1_time(us)	Time (μs) taken to execute MTE1 instructions (L1-to-L0A/L0B transfer) instructions.
mte1_ratio	Ratio of cycles taken to execute MTE1 instructions (L1-to-L0A/L0B transfer) to the total cycles.
mte2_time(us)	Time (μs) taken to execute MTE2 instructions (DDR-to-AI Core transfer).
mte2_ratio	Ratio of cycles taken to execute MTE2 instructions (DDR-to-AI Core transfer) to the total cycles.
mte3_time(us)	Time (μs) taken to execute MTE3 instructions (AI Core-to-DDR transfer).
mte3_ratio	Ratio of cycles taken to execute MTE3 instructions (AI Core-to-DDR transfer) to the total cycles.
icache_miss_rate	iCache is the L2 cache reserved for instructions. If the value of icache_miss_rate is high, the AI Core reads instructions at a low efficiency.
memory_bound	AI Core memory bound, calculated as: mte2_ratio/max(mac_ratio, vec_ratio). If the value is less than 1, no memory bound exists. If the value is greater than 1, the AI Core spends most of the time on memory transfers rather than computation. A larger value indicates a more severe memory bottleneck.
cube_utilization(%)	Cube operator utilization. Check whether the number of operations of the Cube operator in a unit time reaches the theoretical upper limit. A value closer to 100% indicates a value closer to the theoretical upper limit. Formula: cube_utilization = total_cycles/(freq * core_num * task_duration)

**Table 3** Field description (**ArithmeticUtilization**)
Field	Description
mac_fp16_ratio	Ratio of cycles taken to execute Cube fp16 instructions to the total cycles.
mac_int8_ratio	Ratio of cycles taken to execute Cube int8 instructions to the total cycles.
vec_fp32_ratio	Ratio of cycles taken to execute Vector fp32 instructions to the total cycles.
vec_fp16_ratio	Ratio of cycles taken to execute Vector fp16 instructions to the total cycles.
vec_int32_ratio	Ratio of cycles taken to execute Vector int32 instructions to the total cycles.
vec_misc_ratio	Ratio of cycles taken to execute Vector misc instructions to the total cycles.
cube_fops	Floating-point operations (FLOPs, that is, fops in this field) of the Cube type, indicating the computation amount. This field can be used to measure the complexity of an algorithm or model.
vector_fops	Floating-point operations (FLOPs, that is, fops in this field) of the Vector type, indicating the computation amount. This field can be used to measure the complexity of an algorithm or model.

**Table 4** Field description (**Memory**)
Field	Description
ub_read_bw(GB/s)	UB read bandwidth (GB/s)
ub_write_bw(GB/s)	UB write bandwidth (GB/s)
l1_read_bw(GB/s)	L1 read bandwidth (GB/s)
l1_write_bw(GB/s)	L1 write bandwidth (GB/s)
l2_read_bw	L2 read bandwidth (GB/s) It is supported only by the Atlas 200/300/500 Inference Product .
l2_write_bw	L2 write bandwidth (GB/s) It is supported only by the Atlas 200/300/500 Inference Product .
main_mem_read_bw(GB/s)	Main memory read bandwidth (GB/s)
main_mem_write_bw(GB/s)	Main memory write bandwidth (GB/s)

**Table 5** Field description (**MemoryL0**)
Field	Description
l0a_read_bw(GB/s)	l0a read bandwidth (GB/s)
l0a_write_bw(GB/s)	l0a write bandwidth (GB/s)
l0b_read_bw(GB/s)	l0b read bandwidth (GB/s)
l0b_write_bw(GB/s)	l0b write bandwidth (GB/s)
l0c_read_bw(GB/s)	Bandwidth rate for Vector to read data from L0C, in GB/s.
l0c_write_bw(GB/s)	Bandwidth rate for Vector to write data to L0C, in GB/s.
l0c_read_bw_cube(GB/s)	Bandwidth rate for Cube to read data from L0C, in GB/s.
l0c_write_bw_cube(GB/s)	Bandwidth rate for Cube to write data to L0C, in GB/s.

**Table 6** Field description (**MemoryUB**)
Field	Description
ub_read_bw_mte(GB/s)	Bandwidth rate for MTE to read data from UB, in GB/s. It is supported only by the Atlas 200/300/500 Inference Product .
ub_write_bw_mte(GB/s)	Bandwidth rate for MTE to write data to UB, in GB/s. It is supported only by the Atlas 200/300/500 Inference Product .
ub_read_bw_vector(GB/s)	Bandwidth rate for Vector to read data from UB, in GB/s.
ub_write_bw_vector(GB/s)	Bandwidth rate for Vector to write data to UB, in GB/s.
ub_read_bw_scalar(GB/s)	Bandwidth rate for Scalar to read data from UB, in GB/s.
ub_write_bw_scalar(GB/s)	Bandwidth rate for Scalar to write data to UB, in GB/s.

**Table 7** Field description (**ResourceConflictRatio**)
Field	Description
vec_bankgroup_cflt_ratio	Ratio of cycles taken to execute vec_bankgroup_stall_cycles instructions to the total cycles. The block stride of Vector instructions is improperly set, resulting in bankgroup conflicts.
vec_bank_cflt_ratio	Ratio of cycles taken to execute vec_bank_stall_cycles instructions to the total cycles. The read/write pointer address of the Vector instruction operand is improper, resulting in bank conflicts.
vec_resc_cflt_ratio	Ratio of cycles taken to execute vec_resc_cflt_ratio instructions to the total cycles. If an operator involves multiple compute units, ensure that they are concurrently scheduled. When a compute unit is working, but the operator logic still delivers instructions to it, the overall computing power is not fully utilized.

Parent topic: Profile Data File References