op_summary (Operator Details)

AI Core, AI Vector Core, and AI CPU operator summary data does not contain the timeline information. The summary information is summarized in the op_summary_*.csv file, which collects statistics on the specific details and time consumptions of operators.

Availability

Atlas 200/500 A2 Inference Product

Atlas Inference Series Product

Atlas Training Series Product

Atlas A2 Training Series Product/Atlas 800I A2 Inference Product

Atlas A3 Training Series Product

op_summary_*.csv File

The file content is formatted as follows.

Figure 1 op_summary (example only)

The Task Duration field specifies the operator time consumption. You can sort operators by Task Duration to find time-consuming operators, or sort them by Task Type to view the time-consuming operators running on the AI Core or AI CPU.

  • Supported fields may vary by product. Please refer to the actual result files for the final list of fields.
  • When task_time is set to l0 or off, op_summary_*.csv does not display the PMU data of the AI Core and AI Vector Core.
  • Atlas A2 Training Series Product/Atlas 800I A2 Inference Product: The input matrices a and b of MatMul operators meet the following condition: When the inner axis is greater than 1000, the theoretical MAC calculation duration is greater than 50 μs, and the inner axis size is not 516-byte aligned, MatMul operators are converted into MIX operators. In this case, the number of MatMul operators in op_summary.csv decreases, and Task Type changes from the original AI_Core to MIX_AIC.
  • Atlas A3 Training Series Product: The input matrices a and b of MatMul operators meet the following condition: When the inner axis is greater than 1000, the theoretical MAC calculation duration is greater than 50 μs, and the inner axis size is not 516-byte aligned, MatMul operators are converted into MIX operators. In this case, the number of MatMul operators in op_summary.csv decreases, and Task Type changes from the original AI_Core to MIX_AIC.
  • The execution duration of some operators takes too long, resulting in inaccurate metrics, which are marked as N/A and invisible.
  • Operators whose Task Type is communication usually contain a series of communication tasks. Each communication task has an independent task ID and stream ID, which are not displayed here. Therefore, the task IDs and stream IDs of this type of operators are N/A.
  • If an input is scalar, the corresponding Input Shapes field is empty and formatted as ; ; ; ;. Each dimension is separated by a semicolon (;). This also applies to output shapes.
  • The tool checks the operator overflow/underflow. If an operator overflow/underflow is detected, the following alarm is displayed. In this case, the operator computation result is unreliable.
    Figure 2 Operator overflow/underflow alarm

The op_summary_*.csv file content varies depending on the values of --aic-mode, --aic-metrics, --task-time. The complete fields are as follows.

Table 1 Description of common fields

Field

Description

Device_id

Device ID.

Model Name

Model name. It may be left empty if no related data is collected. (This field is not displayed by default or in the single-operator scenario.)

Model ID

Model ID.

Task ID

Task ID.

Stream ID

ID of the stream where a task is located.

Infer ID

Inference iteration ID. (This field is not displayed by default or in the single-operator scenario.)

Op Name

Operator name.

OP Type

Operator type. If task_time is set to l0, this field is not collected and is displayed as N/A.

OP State

Dynamic and static information about an operator. The value dynamic indicates a dynamic operator, and the value static indicates a static operator. The communication operator does not have this state, so N/A is displayed. This field is reported only when --task-time is l1. If --task-time is l0, N/A is displayed.

Task Type

Type of the accelerator that executes the task, including AI_CORE, AI_VECTOR_CORE, and AI_CPU. If task_time is set to l0, this field is not collected and is displayed as N/A.

Task Start Time(us)

Task start time, in μs.

Task Duration(us)

Task duration (μs), including scheduling time to the accelerator, execution time on the accelerator, and response end time.

Task Wait Time(us)

Interval between tasks, in μs.

Block Dim

Number of running task blocks, which corresponds to the number of cores during task running. If task_time is set to l0, this field is not collected and is displayed as 0.

HF32 Eligible

Whether to use the HF32 precision flag. YES indicates that the HF32 precision flag is used, while NO indicates that the HF32 precision flag is not used. This field is reported only when --task-time is set to l1. If --task-time is set to l0, this field is displayed as N/A.

Mix Block Dim

Some operators are executed simultaneously on both the AI Core and Vector Core. The Block Dim of the primary accelerator is described in the Block Dim field, and the Block Dim of the secondary accelerator is described in this field. If task_time is set to l0, this field is not collected and is displayed as N/A. (Atlas 200/500 A2 Inference Product) (Atlas A2 Training Series Product/Atlas 800I A2 Inference Product) (Atlas A3 Training Series Product)

Input Shapes

Input shapes. If task_time is set to l0, this field is not collected and is displayed as N/A.

Input Data Types

Input data types. If task_time is set to l0, this field is not collected and is displayed as N/A.

Input Formats

Input formats. If task_time is set to l0, this field is not collected and is displayed as N/A.

Output Shapes

Output shapes. If task_time is set to l0, this field is not collected and is displayed as N/A.

Output Data Types

Output data types. If task_time is set to l0, this field is not collected and is displayed as N/A.

Output Formats

Output data formats. If task_time is set to l0, this field is not collected and is displayed as N/A.

Context ID

Context ID, which identifies a small operator of a subtask. If no small operator exists, N/A is displayed. (Atlas 200/500 A2 Inference Product) (Atlas A2 Training Series Product/Atlas 800I A2 Inference Product) (Atlas A3 Training Series Product)

aiv_time(us)

Theoretical execution time of a task on the AI Vector Core when all blocks are scheduled simultaneously and each block has an equal execution duration. The unit is μs. Typically, the scheduling start time of each block is slightly different, so the value of this field is slightly less than the actual execution time of the task on the AI Vector Core. (Atlas A2 Training Series Product/Atlas 800I A2 Inference Product) (Atlas A3 Training Series Product)

aicore_time(us)

Theoretical execution time of a task on the AI Core when all blocks are scheduled simultaneously and each block has an equal execution duration. The unit is μs. Typically, the scheduling start time of each block is slightly different, so the value of this field is slightly less than the actual execution time of the task on the AI Core.

If the AI Core frequency changes (for example, manual frequency adjustment or dynamic frequency adjustment when the power consumption exceeds the threshold, or when the Atlas 300V/Atlas 300I Pro is involved), the data is inaccurate and is not recommended for reference.

Atlas 200/500 A2 Inference Product: For details about the frequency change, see AI Core Frequency Viewing.

Atlas A2 Training Series Product/Atlas 800I A2 Inference Product: For details about the frequency change, see AI Core Frequency Viewing.

Atlas A3 Training Series Product: For details about the frequency change, see AI Core Frequency Viewing.

total_cycles

Total number of execution cycles of a task on the AI Core, which is the sum of the execution cycles of all blocks.

The Atlas 200/500 A2 Inference Product is split into aic_total_cycles (total number of cycles executed by the task on the AI Cube Core) and aiv_total_cycles (total number of cycles executed by the task on the AI Vector Core).

The Atlas A2 Training Series Product/Atlas 800I A2 Inference Product is split into aic_total_cycles (total number of cycles executed by the task on the AI Cube Core) and aiv_total_cycles (total number of cycles executed by the task on the AI Vector Core).

The Atlas A3 Training Series Product is split into aic_total_cycles (total number of cycles executed by the task on the AI Cube Core) and aiv_total_cycles (total number of cycles executed by the task on the AI Vector Core).

Register value

Value of the custom register whose data is to be collected. It is configured by --aic-metrics.

The following fields are generated when --task-time is set to l1 and --aic-mode is set to task-based. If --task-time is set to l0, these fields are not profiled and N/A is displayed. The generated data is controlled by the aic_metrics parameter.

Table 2 Field description (PipeUtilization)

Field

Description

*_vec_time(us)

Time taken to execute Vector instructions, in μs. For the Atlas 200/500 A2 Inference Product, this field is not supported and defaults to N/A.

*_vec_ratio

Ratio of cycles taken to execute Vector instructions to the total cycles. For the Atlas 200/500 A2 Inference Product, this field is not supported and defaults to N/A.

*_mac_time(us)

Time taken to execute Cube instructions, in μs.

*_mac_ratio

Ratio of cycles taken to execute Cube instructions to the total cycles.

*_scalar_time(us)

Time taken to execute Scalar instructions, in μs.

*_scalar_ratio

Ratio of cycles taken to execute Scalar instructions to the total cycles.

aic_fixpipe_time(us)

Time taken to execute fixpipe instructions (L0C-to-OUT/L1 transfer) instructions, in μs.

aic_fixpipe_ratio

Ratio of cycles taken to execute fixpipe instructions (L0C-to-OUT/L1 transfer) to the total cycles.

*_mte1_time(us)

Time taken to execute MTE1 instructions (L1-to-L0A/L0B transfer), in μs.

*_mte1_ratio

Ratio of cycles taken to execute MTE1 instructions (L1-to-L0A/L0B transfer) to the total cycles.

*_mte2_time(us)

Time taken to execute MTE2 instructions (DDR-to-AI Core transfer), in μs.

*_mte2_ratio

Ratio of cycles taken to execute MTE2 instructions (DDR-to-AI Core transfer) to the total cycles.

*_mte3_time(us)

Time taken to execute MTE3 instructions (AI Core-to-DDR transfer), in μs.

*_mte3_ratio

Ratio of cycles taken to execute MTE3 instructions (AI Core-to-DDR transfer) to the total cycles.

*_icache_miss_rate

iCache is the L2 cache reserved for instructions. If the value of icache_miss_rate is high, the AI Core reads instructions at a low efficiency.

memory_bound

AI Core memory bound, calculated as: mte2_ratio/max(mac_ratio, vec_ratio). If the value is less than 1, no memory bound exists. If the value is greater than 1, the AI Core is mostly engaged in memory transfer instead of computation when executing tasks. A greater value indicates a more severe bound.

cube_utilization(%)

Cube operator utilization. Check whether the number of operations of the Cube operator in a unit time reaches the theoretical upper limit. A value closer to 100% indicates a value closer to the theoretical upper limit. Formula: cube_utilization = total_cycles/(freq * core_num * task_duration)

Note: For some products, the asterisk (*) prefix of some fields in the preceding table represents aic or aiv, indicating that the data is execution results on the Cube Core or Vector Core.

Table 3 Field description (ArithmeticUtilization)

Field

Description

*_mac_fp16_ratio

Ratio of cycles taken to execute Cube fp16 instructions to the total cycles.

*_mac_int8_ratio

Ratio of cycles taken to execute Cube int8 instructions to the total cycles.

*_vec_fp32_ratio

Ratio of cycles taken to execute Vector fp32 instructions to the total cycles. For the Atlas 200/500 A2 Inference Product, this field is not supported and defaults to N/A.

*_vec_fp16_ratio

Ratio of cycles taken to execute Vector fp16 instructions to the total cycles.

*_vec_int32_ratio

Ratio of cycles taken to execute Vector int32 instructions to the total cycles. For the Atlas 200/500 A2 Inference Product, this field is not supported and defaults to N/A.

*_vec_misc_ratio

Ratio of cycles taken to execute Vector misc instructions to the total cycles. For the Atlas 200/500 A2 Inference Product, this field is not supported and defaults to N/A.

*_cube_fops

Floating-point operations (FLOPs, that is, fops in this field) of the Cube type, indicating the computation amount. This field can be used to measure the complexity of an algorithm or model.

*_vector_fops

Floating-point operations (FLOPs, that is, fops in this field) of the Vector type, indicating the computation amount. This field can be used to measure the complexity of an algorithm or model.

Note: For some products, the asterisk (*) prefix of some fields in the preceding table represents aic or aiv, indicating that the data is execution results on the Cube Core or Vector Core.

Table 4 Field description (Memory)

Field

Description

*_ub_read_bw(GB/s)

UB read bandwidth, in GB/s. For the Atlas 200/500 A2 Inference Product, this field is not supported and defaults to N/A.

*_ub_write_bw(GB/s)

UB write bandwidth, in GB/s. For the Atlas 200/500 A2 Inference Product, this field is not supported and defaults to N/A.

*_l1_read_bw(GB/s)

L1 read bandwidth, in GB/s.

*_l1_write_bw(GB/s)

L1 write bandwidth, in GB/s.

*_l2_read_bw

L2 read bandwidth, in GB/s.

*_l2_write_bw

L2 write bandwidth, in GB/s. For the Atlas 200/500 A2 Inference Product, this field is not supported and defaults to N/A.

*_main_mem_read_bw(GB/s)

Main memory read bandwidth, in GB/s.

*_main_mem_write_bw(GB/s)

Main memory write bandwidth, in GB/s. For the Atlas 200/500 A2 Inference Product, this field is not supported and defaults to N/A.

Note: For some products, the asterisk (*) prefix of some fields in the preceding table represents aic or aiv, indicating that the data is execution results on the Cube Core or Vector Core.

Table 5 Field description (MemoryL0)

Field

Description

*_l0a_read_bw(GB/s)

l0a read bandwidth, in GB/s.

*_l0a_write_bw(GB/s)

l0a write bandwidth, in GB/s.

*_l0b_read_bw(GB/s)

l0b read bandwidth, in GB/s.

*_l0b_write_bw(GB/s)

l0b write bandwidth, in GB/s.

*_l0c_read_bw(GB/s)

Bandwidth for Vector to read data from L0C, in GB/s.

*_l0c_write_bw(GB/s)

Bandwidth for Vector to write data to L0C, in GB/s.

*_l0c_read_bw_cube(GB/s)

Bandwidth for Cube to read data from L0C, in GB/s.

*_l0c_write_bw_cube(GB/s)

Bandwidth for Cube to write data to L0C, in GB/s.

Note: Data about the MemoryL0 metric of the AI Vector Core is 0.

Note: For some products, the asterisk (*) prefix of some fields in the preceding table represents aic or aiv, indicating that the data is execution results on the Cube Core or Vector Core.

Table 6 Field description (MemoryUB)

Field

Description

*_ub_read_bw_vector(GB/s)

Bandwidth for Vector to read data from UB, in GB/s.

*_ub_write_bw_vector(GB/s)

Bandwidth for Vector to write data to UB, in GB/s.

*_ub_read_bw_scalar(GB/s)

Bandwidth for Scalar to read data from UB, in GB/s.

*_ub_write_bw_scalar(GB/s)

Bandwidth for Scalar to write data to UB, in GB/s.

Note: For some products, the asterisk (*) prefix of some fields in the preceding table represents aic or aiv, indicating that the data is execution results on the Cube Core or Vector Core.

Table 7 Field description (ResourceConflictRatio)

Field

Description

*_vec_bankgroup_cflt_ratio

Ratio of cycles taken to execute vec_bankgroup_stall_cycles instructions to the total cycles. Improper block stride settings in Vector instructions can lead to bank group conflicts. For the Atlas 200/500 A2 Inference Product, this field is not supported and defaults to N/A.

*_vec_bank_cflt_ratio

Ratio of cycles taken to execute vec_bank_stall_cycles instructions to the total cycles. Improper read/write pointer addresses for Vector instruction operands can lead to bank conflicts. For the Atlas 200/500 A2 Inference Product, this field is not supported and defaults to N/A.

*_vec_resc_cflt_ratio

Ratio of cycles taken to execute vec_resc_cflt_ratio instructions to the total cycles. If an operator involves multiple compute units, ensure that they are concurrently scheduled. When a compute unit is working but the operator logic still delivers instructions to the unit, the overall computing power is not fully utilized. For the Atlas 200/500 A2 Inference Product, this field is not supported and defaults to N/A.

Note: For some products, the asterisk (*) prefix of some fields in the preceding table represents aic or aiv, indicating that the data is execution results on the Cube Core or Vector Core.

Table 8 Field description (MemoryAccess)

Field

Description

*_read_main_memory_datas(KB)

Amount of data read from the on-chip memory, in KB.

*_write_main_memory_datas(KB)

Amount of data written to the on-chip memory, in KB.

*_GM_to_L1_datas(KB)

Amount of data transferred from GM to L1, in KB.

*_L0C_to_L1_datas(KB)

Amount of data transferred from L0C to L1, in KB.

*_L0C_to_GM_datas(KB)

Amount of data transferred from L0C to GM, in KB.

*_GM_to_UB_datas(KB)

Amount of data transferred from GM to UB, in KB.

*_UB_to_GM_datas(KB)

Amount of data transferred from UB to GM, in KB.

Note: The asterisk (*) prefix of the fields in the preceding table represents aic or aiv, , indicating that the data is execution results on the Cube Core or Vector Core.

Availability:

Atlas A2 Training Series Product/Atlas 800I A2 Inference Product

Atlas A3 Training Series Product

Table 9 Field description (L2Cache)

Field

Description

*_write_cache_hit

Write cache hits.

*_write_cache_miss_allocate

Cache re-allocations upon write misses.

*_r*_read_cache_hit

Read cache hits in the r* channel.

*_r*_read_cache_miss_allocate

Cache re-allocations upon read misses in the r* channel.

Note: For some products, the asterisk (*) prefix of some fields in the preceding table represents aic or aiv, indicating that the data is execution results on the Cube Core or Vector Core.

Availability:

Atlas A2 Training Series Product/Atlas 800I A2 Inference Product

Atlas A3 Training Series Product

Atlas 200/500 A2 Inference Product

Table 10 Field description (PipelineExecuteUtilization)

Field

Description

vec_exe_time(us)

Time taken to execute Vector instructions, in μs.

vec_exe_ratio

Ratio of cycles taken to execute Vector instructions to the total cycles. For the Atlas 200/500 A2 Inference Product, this field is not supported and defaults to N/A.

mac_exe_time(us)

Time taken to execute Cube instructions (fp16 and s16), in μs.

mac_exe_ratio

Ratio of cycles taken to execute Cube instructions (fp16 and s16) to the total cycles.

scalar_exe_time(us)

Time taken to execute Scalar instructions, in μs.

scalar_exe_ratio

Ratio of cycles taken to execute Scalar instructions to the total cycles.

mte1_exe_time(us)

Time taken to execute MTE1 instructions (L1-to-L0A/L0B transfer), in μs.

mte1_exe_ratio

Ratio of cycles taken to execute MTE1 instructions (L1-to-L0A/L0B transfer) to the total cycles.

mte2_exe_time(us)

Time taken to execute MTE2 instructions (DDR-to-AI Core transfer), in μs.

mte2_exe_ratio

Ratio of cycles taken to execute MTE2 instructions (DDR-to-AI Core transfer) to the total cycles.

mte3_exe_time(us)

Time taken to execute MTE3 instructions (AI Core-to-DDR transfer), in μs.

mte3_exe_ratio

Ratio of cycles taken to execute MTE3 instructions (AI Core-to-DDR transfer) to the total cycles.

fixpipe_exe_time(us)

Time taken to execute fixpipe instructions (L0C-to-OUT/L1 transfer) instructions, in μs.

fixpipe_exe_ratio

Ratio of cycles taken to execute fixpipe instructions (L0C-to-OUT/L1 transfer) to the total cycles.

memory_bound

AI Core memory bound, calculated as: mte2_ratio/max(mac_ratio, vec_ratio). If the value is less than 1, no memory bound exists. If the value is greater than 1, the AI Core is mostly engaged in memory transfer instead of computation when executing tasks. A greater value indicates a more severe bound.

cube_utilization(%)

Cube operator utilization. Check whether the number of operations of the Cube operator in a unit time reaches the theoretical upper limit. A value closer to 100% indicates a value closer to the theoretical upper limit. Formula: cube_utilization = total_cycles/(freq * core_num * task_duration)