op_summary (Operator Details)

AI Core, AI Vector Core, and AI CPU operator summary data does not contain the timeline information. The summary information is summarized in the op_summary_*.csv file, which collects statistics on the specific details and time consumptions of operators.

Availability

Atlas 200/500 A2 Inference Product

Atlas Inference Series Product

Atlas Training Series Product

Atlas A2 Training Series Product/Atlas 800I A2 Inference Product

Atlas A3 Training Series Product

op_summary_*.csv File

The file content is formatted as follows.

Figure 1 op_summary (example only)

The Task Duration field specifies the operator time consumption. You can sort operators by Task Duration to find time-consuming operators, or sort them by Task Type to view the time-consuming operators running on the AI Core or AI CPU.

Supported fields may vary by product. Please refer to the actual result files for the final list of fields.
When task_time is set to l0 or off, op_summary_*.csv does not display the PMU data of the AI Core and AI Vector Core.
Atlas A2 Training Series Product/Atlas 800I A2 Inference Product: The input matrices a and b of MatMul operators meet the following condition: When the inner axis is greater than 1000, the theoretical MAC calculation duration is greater than 50 μs, and the inner axis size is not 516-byte aligned, MatMul operators are converted into MIX operators. In this case, the number of MatMul operators in op_summary.csv decreases, and Task Type changes from the original AI_Core to MIX_AIC.
Atlas A3 Training Series Product: The input matrices a and b of MatMul operators meet the following condition: When the inner axis is greater than 1000, the theoretical MAC calculation duration is greater than 50 μs, and the inner axis size is not 516-byte aligned, MatMul operators are converted into MIX operators. In this case, the number of MatMul operators in op_summary.csv decreases, and Task Type changes from the original AI_Core to MIX_AIC.
The execution duration of some operators takes too long, resulting in inaccurate metrics, which are marked as N/A and invisible.
Operators whose Task Type is communication usually contain a series of communication tasks. Each communication task has an independent task ID and stream ID, which are not displayed here. Therefore, the task IDs and stream IDs of this type of operators are N/A.
If an input is scalar, the corresponding Input Shapes field is empty and formatted as ; ; ; ;. Each dimension is separated by a semicolon (;). This also applies to output shapes.
The tool checks the operator overflow/underflow. If an operator overflow/underflow is detected, the following alarm is displayed. In this case, the operator computation result is unreliable.
Figure 2 Operator overflow/underflow alarm

The op_summary_*.csv file content varies depending on the values of --aic-mode, --aic-metrics, --task-time. The complete fields are as follows.

**Table 1** Description of common fields
Field	Description
Device_id	Device ID.
Model Name	Model name. It may be left empty if no related data is collected. (This field is not displayed by default or in the single-operator scenario.)
Model ID	Model ID.
Task ID	Task ID.
Stream ID	ID of the stream where a task is located.
Infer ID	Inference iteration ID. (This field is not displayed by default or in the single-operator scenario.)
Op Name	Operator name.
OP Type	Operator type. If task_time is set to l0, this field is not collected and is displayed as N/A.
OP State	Dynamic and static information about an operator. The value dynamic indicates a dynamic operator, and the value static indicates a static operator. The communication operator does not have this state, so N/A is displayed. This field is reported only when --task-time is l1. If --task-time is l0, N/A is displayed.
Task Type	Type of the accelerator that executes the task, including AI_CORE, AI_VECTOR_CORE, and AI_CPU. If task_time is set to l0, this field is not collected and is displayed as N/A.
Task Start Time(us)	Task start time, in μs.
Task Duration(us)	Task duration (μs), including scheduling time to the accelerator, execution time on the accelerator, and response end time.
Task Wait Time(us)	Interval between tasks, in μs.
Block Dim	Number of running task blocks, which corresponds to the number of cores during task running. If task_time is set to l0, this field is not collected and is displayed as 0.
HF32 Eligible	Whether to use the HF32 precision flag. YES indicates that the HF32 precision flag is used, while NO indicates that the HF32 precision flag is not used. This field is reported only when --task-time is set to l1. If --task-time is set to l0, this field is displayed as N/A.
Mix Block Dim	Some operators are executed simultaneously on both the AI Core and Vector Core. The Block Dim of the primary accelerator is described in the Block Dim field, and the Block Dim of the secondary accelerator is described in this field. If task_time is set to l0, this field is not collected and is displayed as N/A. (Atlas 200/500 A2 Inference Product) (Atlas A2 Training Series Product/Atlas 800I A2 Inference Product) (Atlas A3 Training Series Product)
Input Shapes	Input shapes. If task_time is set to l0, this field is not collected and is displayed as N/A.
Input Data Types	Input data types. If task_time is set to l0, this field is not collected and is displayed as N/A.
Input Formats	Input formats. If task_time is set to l0, this field is not collected and is displayed as N/A.
Output Shapes	Output shapes. If task_time is set to l0, this field is not collected and is displayed as N/A.
Output Data Types	Output data types. If task_time is set to l0, this field is not collected and is displayed as N/A.
Output Formats	Output data formats. If task_time is set to l0, this field is not collected and is displayed as N/A.
Context ID	Context ID, which identifies a small operator of a subtask. If no small operator exists, N/A is displayed. (Atlas 200/500 A2 Inference Product) (Atlas A2 Training Series Product/Atlas 800I A2 Inference Product) (Atlas A3 Training Series Product)
aiv_time(us)	Theoretical execution time of a task on the AI Vector Core when all blocks are scheduled simultaneously and each block has an equal execution duration. The unit is μs. Typically, the scheduling start time of each block is slightly different, so the value of this field is slightly less than the actual execution time of the task on the AI Vector Core. (Atlas A2 Training Series Product/Atlas 800I A2 Inference Product) (Atlas A3 Training Series Product)
aicore_time(us)	Theoretical execution time of a task on the AI Core when all blocks are scheduled simultaneously and each block has an equal execution duration. The unit is μs. Typically, the scheduling start time of each block is slightly different, so the value of this field is slightly less than the actual execution time of the task on the AI Core. If the AI Core frequency changes (for example, manual frequency adjustment or dynamic frequency adjustment when the power consumption exceeds the threshold, or when the Atlas 300V/Atlas 300I Pro is involved), the data is inaccurate and is not recommended for reference. Atlas 200/500 A2 Inference Product: For details about the frequency change, see AI Core Frequency Viewing. Atlas A2 Training Series Product/Atlas 800I A2 Inference Product: For details about the frequency change, see AI Core Frequency Viewing. Atlas A3 Training Series Product: For details about the frequency change, see AI Core Frequency Viewing.
total_cycles	Total number of execution cycles of a task on the AI Core, which is the sum of the execution cycles of all blocks. The Atlas 200/500 A2 Inference Product is split into aic_total_cycles (total number of cycles executed by the task on the AI Cube Core) and aiv_total_cycles (total number of cycles executed by the task on the AI Vector Core). The Atlas A2 Training Series Product/Atlas 800I A2 Inference Product is split into aic_total_cycles (total number of cycles executed by the task on the AI Cube Core) and aiv_total_cycles (total number of cycles executed by the task on the AI Vector Core). The Atlas A3 Training Series Product is split into aic_total_cycles (total number of cycles executed by the task on the AI Cube Core) and aiv_total_cycles (total number of cycles executed by the task on the AI Vector Core).
Register value	Value of the custom register whose data is to be collected. It is configured by --aic-metrics.

The following fields are generated when --task-time is set to l1 and --aic-mode is set to task-based. If --task-time is set to l0, these fields are not profiled and N/A is displayed. The generated data is controlled by the aic_metrics parameter.

**Table 2** Field description (**PipeUtilization**)
Field	Description
*_vec_time(us)	Time taken to execute Vector instructions, in μs. For the Atlas 200/500 A2 Inference Product, this field is not supported and defaults to N/A.
*_vec_ratio	Ratio of cycles taken to execute Vector instructions to the total cycles. For the Atlas 200/500 A2 Inference Product, this field is not supported and defaults to N/A.
*_mac_time(us)	Time taken to execute Cube instructions, in μs.
*_mac_ratio	Ratio of cycles taken to execute Cube instructions to the total cycles.
*_scalar_time(us)	Time taken to execute Scalar instructions, in μs.
*_scalar_ratio	Ratio of cycles taken to execute Scalar instructions to the total cycles.
aic_fixpipe_time(us)	Time taken to execute fixpipe instructions (L0C-to-OUT/L1 transfer) instructions, in μs.
aic_fixpipe_ratio	Ratio of cycles taken to execute fixpipe instructions (L0C-to-OUT/L1 transfer) to the total cycles.
*_mte1_time(us)	Time taken to execute MTE1 instructions (L1-to-L0A/L0B transfer), in μs.
*_mte1_ratio	Ratio of cycles taken to execute MTE1 instructions (L1-to-L0A/L0B transfer) to the total cycles.
*_mte2_time(us)	Time taken to execute MTE2 instructions (DDR-to-AI Core transfer), in μs.
*_mte2_ratio	Ratio of cycles taken to execute MTE2 instructions (DDR-to-AI Core transfer) to the total cycles.
*_mte3_time(us)	Time taken to execute MTE3 instructions (AI Core-to-DDR transfer), in μs.
*_mte3_ratio	Ratio of cycles taken to execute MTE3 instructions (AI Core-to-DDR transfer) to the total cycles.
*_icache_miss_rate	iCache is the L2 cache reserved for instructions. If the value of icache_miss_rate is high, the AI Core reads instructions at a low efficiency.
memory_bound	AI Core memory bound, calculated as: mte2_ratio/max(mac_ratio, vec_ratio). If the value is less than 1, no memory bound exists. If the value is greater than 1, the AI Core is mostly engaged in memory transfer instead of computation when executing tasks. A greater value indicates a more severe bound.
cube_utilization(%)	Cube operator utilization. Check whether the number of operations of the Cube operator in a unit time reaches the theoretical upper limit. A value closer to 100% indicates a value closer to the theoretical upper limit. Formula: cube_utilization = total_cycles/(freq * core_num * task_duration)
Note: For some products, the asterisk () prefix of some fields in the preceding table represents aic* or aiv, indicating that the data is execution results on the Cube Core or Vector Core.

**Table 3** Field description (**ArithmeticUtilization**)
Field	Description
*_mac_fp16_ratio	Ratio of cycles taken to execute Cube fp16 instructions to the total cycles.
*_mac_int8_ratio	Ratio of cycles taken to execute Cube int8 instructions to the total cycles.
*_vec_fp32_ratio	Ratio of cycles taken to execute Vector fp32 instructions to the total cycles. For the Atlas 200/500 A2 Inference Product, this field is not supported and defaults to N/A.
*_vec_fp16_ratio	Ratio of cycles taken to execute Vector fp16 instructions to the total cycles.
*_vec_int32_ratio	Ratio of cycles taken to execute Vector int32 instructions to the total cycles. For the Atlas 200/500 A2 Inference Product, this field is not supported and defaults to N/A.
*_vec_misc_ratio	Ratio of cycles taken to execute Vector misc instructions to the total cycles. For the Atlas 200/500 A2 Inference Product, this field is not supported and defaults to N/A.
*_cube_fops	Floating-point operations (FLOPs, that is, fops in this field) of the Cube type, indicating the computation amount. This field can be used to measure the complexity of an algorithm or model.
*_vector_fops	Floating-point operations (FLOPs, that is, fops in this field) of the Vector type, indicating the computation amount. This field can be used to measure the complexity of an algorithm or model.
Note: For some products, the asterisk () prefix of some fields in the preceding table represents aic* or aiv, indicating that the data is execution results on the Cube Core or Vector Core.

**Table 4** Field description (**Memory**)
Field	Description
*_ub_read_bw(GB/s)	UB read bandwidth, in GB/s. For the Atlas 200/500 A2 Inference Product, this field is not supported and defaults to N/A.
*_ub_write_bw(GB/s)	UB write bandwidth, in GB/s. For the Atlas 200/500 A2 Inference Product, this field is not supported and defaults to N/A.
*_l1_read_bw(GB/s)	L1 read bandwidth, in GB/s.
*_l1_write_bw(GB/s)	L1 write bandwidth, in GB/s.
*_l2_read_bw	L2 read bandwidth, in GB/s.
*_l2_write_bw	L2 write bandwidth, in GB/s. For the Atlas 200/500 A2 Inference Product, this field is not supported and defaults to N/A.
*_main_mem_read_bw(GB/s)	Main memory read bandwidth, in GB/s.
*_main_mem_write_bw(GB/s)	Main memory write bandwidth, in GB/s. For the Atlas 200/500 A2 Inference Product, this field is not supported and defaults to N/A.
Note: For some products, the asterisk () prefix of some fields in the preceding table represents aic* or aiv, indicating that the data is execution results on the Cube Core or Vector Core.

**Table 5** Field description (**MemoryL0**)
Field	Description
*_l0a_read_bw(GB/s)	l0a read bandwidth, in GB/s.
*_l0a_write_bw(GB/s)	l0a write bandwidth, in GB/s.
*_l0b_read_bw(GB/s)	l0b read bandwidth, in GB/s.
*_l0b_write_bw(GB/s)	l0b write bandwidth, in GB/s.
*_l0c_read_bw(GB/s)	Bandwidth for Vector to read data from L0C, in GB/s.
*_l0c_write_bw(GB/s)	Bandwidth for Vector to write data to L0C, in GB/s.
*_l0c_read_bw_cube(GB/s)	Bandwidth for Cube to read data from L0C, in GB/s.
*_l0c_write_bw_cube(GB/s)	Bandwidth for Cube to write data to L0C, in GB/s.
Note: Data about the MemoryL0 metric of the AI Vector Core is 0. Note: For some products, the asterisk () prefix of some fields in the preceding table represents aic* or aiv, indicating that the data is execution results on the Cube Core or Vector Core.

**Table 6** Field description (**MemoryUB**)
Field	Description
*_ub_read_bw_vector(GB/s)	Bandwidth for Vector to read data from UB, in GB/s.
*_ub_write_bw_vector(GB/s)	Bandwidth for Vector to write data to UB, in GB/s.
*_ub_read_bw_scalar(GB/s)	Bandwidth for Scalar to read data from UB, in GB/s.
*_ub_write_bw_scalar(GB/s)	Bandwidth for Scalar to write data to UB, in GB/s.
Note: For some products, the asterisk () prefix of some fields in the preceding table represents aic* or aiv, indicating that the data is execution results on the Cube Core or Vector Core.

**Table 7** Field description (**ResourceConflictRatio**)
Field	Description
*_vec_bankgroup_cflt_ratio	Ratio of cycles taken to execute vec_bankgroup_stall_cycles instructions to the total cycles. Improper block stride settings in Vector instructions can lead to bank group conflicts. For the Atlas 200/500 A2 Inference Product, this field is not supported and defaults to N/A.
*_vec_bank_cflt_ratio	Ratio of cycles taken to execute vec_bank_stall_cycles instructions to the total cycles. Improper read/write pointer addresses for Vector instruction operands can lead to bank conflicts. For the Atlas 200/500 A2 Inference Product, this field is not supported and defaults to N/A.
*_vec_resc_cflt_ratio	Ratio of cycles taken to execute vec_resc_cflt_ratio instructions to the total cycles. If an operator involves multiple compute units, ensure that they are concurrently scheduled. When a compute unit is working but the operator logic still delivers instructions to the unit, the overall computing power is not fully utilized. For the Atlas 200/500 A2 Inference Product, this field is not supported and defaults to N/A.
Note: For some products, the asterisk () prefix of some fields in the preceding table represents aic* or aiv, indicating that the data is execution results on the Cube Core or Vector Core.

**Table 8** Field description (**MemoryAccess**)
Field	Description
*_read_main_memory_datas(KB)	Amount of data read from the on-chip memory, in KB.
*_write_main_memory_datas(KB)	Amount of data written to the on-chip memory, in KB.
*_GM_to_L1_datas(KB)	Amount of data transferred from GM to L1, in KB.
*_L0C_to_L1_datas(KB)	Amount of data transferred from L0C to L1, in KB.
*_L0C_to_GM_datas(KB)	Amount of data transferred from L0C to GM, in KB.
*_GM_to_UB_datas(KB)	Amount of data transferred from GM to UB, in KB.
*_UB_to_GM_datas(KB)	Amount of data transferred from UB to GM, in KB.
Note: The asterisk () prefix of the fields in the preceding table represents aic* or aiv, , indicating that the data is execution results on the Cube Core or Vector Core. Availability: Atlas A2 Training Series Product/Atlas 800I A2 Inference Product Atlas A3 Training Series Product

**Table 9** Field description (**L2Cache**)
Field	Description
*_write_cache_hit	Write cache hits.
*_write_cache_miss_allocate	Cache re-allocations upon write misses.
_r_read_cache_hit	Read cache hits in the r* channel.
_r_read_cache_miss_allocate	Cache re-allocations upon read misses in the r* channel.
Note: For some products, the asterisk () prefix of some fields in the preceding table represents aic* or aiv, indicating that the data is execution results on the Cube Core or Vector Core. Availability: Atlas A2 Training Series Product/Atlas 800I A2 Inference Product Atlas A3 Training Series Product Atlas 200/500 A2 Inference Product

**Table 10** Field description (**PipelineExecuteUtilization**)
Field	Description
vec_exe_time(us)	Time taken to execute Vector instructions, in μs.
vec_exe_ratio	Ratio of cycles taken to execute Vector instructions to the total cycles. For the Atlas 200/500 A2 Inference Product, this field is not supported and defaults to N/A.
mac_exe_time(us)	Time taken to execute Cube instructions (fp16 and s16), in μs.
mac_exe_ratio	Ratio of cycles taken to execute Cube instructions (fp16 and s16) to the total cycles.
scalar_exe_time(us)	Time taken to execute Scalar instructions, in μs.
scalar_exe_ratio	Ratio of cycles taken to execute Scalar instructions to the total cycles.
mte1_exe_time(us)	Time taken to execute MTE1 instructions (L1-to-L0A/L0B transfer), in μs.
mte1_exe_ratio	Ratio of cycles taken to execute MTE1 instructions (L1-to-L0A/L0B transfer) to the total cycles.
mte2_exe_time(us)	Time taken to execute MTE2 instructions (DDR-to-AI Core transfer), in μs.
mte2_exe_ratio	Ratio of cycles taken to execute MTE2 instructions (DDR-to-AI Core transfer) to the total cycles.
mte3_exe_time(us)	Time taken to execute MTE3 instructions (AI Core-to-DDR transfer), in μs.
mte3_exe_ratio	Ratio of cycles taken to execute MTE3 instructions (AI Core-to-DDR transfer) to the total cycles.
fixpipe_exe_time(us)	Time taken to execute fixpipe instructions (L0C-to-OUT/L1 transfer) instructions, in μs.
fixpipe_exe_ratio	Ratio of cycles taken to execute fixpipe instructions (L0C-to-OUT/L1 transfer) to the total cycles.
memory_bound	AI Core memory bound, calculated as: mte2_ratio/max(mac_ratio, vec_ratio). If the value is less than 1, no memory bound exists. If the value is greater than 1, the AI Core is mostly engaged in memory transfer instead of computation when executing tasks. A greater value indicates a more severe bound.
cube_utilization(%)	Cube operator utilization. Check whether the number of operations of the Cube operator in a unit time reaches the theoretical upper limit. A value closer to 100% indicates a value closer to the theoretical upper limit. Formula: cube_utilization = total_cycles/(freq * core_num * task_duration)

Parent topic: Profile Data File References