op_summary (Operator Details)
AI Core, AI Vector Core, and AI CPU operator summary data does not contain the timeline information. The summary information is summarized in the op_summary_*.csv file, which collects statistics on the specific details and time consumptions of operators.
Availability
Atlas 200/500 A2 Inference Product
Atlas Inference Series Product
Atlas Training Series Product
Atlas A2 Training Series Product/Atlas 800I A2 Inference Product
Atlas A3 Training Series Product
op_summary_*.csv File
The file content is formatted as follows.
The Task Duration field specifies the operator time consumption. You can sort operators by Task Duration to find time-consuming operators, or sort them by Task Type to view the time-consuming operators running on the AI Core or AI CPU.
- Supported fields may vary by product. Please refer to the actual result files for the final list of fields.
- When task_time is set to l0 or off, op_summary_*.csv does not display the PMU data of the AI Core and AI Vector Core.
- Atlas A2 Training Series Product/Atlas 800I A2 Inference Product: The input matrices a and b of MatMul operators meet the following condition: When the inner axis is greater than 1000, the theoretical MAC calculation duration is greater than 50 μs, and the inner axis size is not 516-byte aligned, MatMul operators are converted into MIX operators. In this case, the number of MatMul operators in op_summary.csv decreases, and Task Type changes from the original AI_Core to MIX_AIC.
- Atlas A3 Training Series Product: The input matrices a and b of MatMul operators meet the following condition: When the inner axis is greater than 1000, the theoretical MAC calculation duration is greater than 50 μs, and the inner axis size is not 516-byte aligned, MatMul operators are converted into MIX operators. In this case, the number of MatMul operators in op_summary.csv decreases, and Task Type changes from the original AI_Core to MIX_AIC.
- The execution duration of some operators takes too long, resulting in inaccurate metrics, which are marked as N/A and invisible.
- Operators whose Task Type is communication usually contain a series of communication tasks. Each communication task has an independent task ID and stream ID, which are not displayed here. Therefore, the task IDs and stream IDs of this type of operators are N/A.
- If an input is scalar, the corresponding Input Shapes field is empty and formatted as ; ; ; ;. Each dimension is separated by a semicolon (;). This also applies to output shapes.
- The tool checks the operator overflow/underflow. If an operator overflow/underflow is detected, the following alarm is displayed. In this case, the operator computation result is unreliable.
Figure 2 Operator overflow/underflow alarm
The op_summary_*.csv file content varies depending on the values of --aic-mode, --aic-metrics, --task-time. The complete fields are as follows.
|
Field |
Description |
|---|---|
|
Device_id |
Device ID. |
|
Model Name |
Model name. It may be left empty if no related data is collected. (This field is not displayed by default or in the single-operator scenario.) |
|
Model ID |
Model ID. |
|
Task ID |
Task ID. |
|
Stream ID |
ID of the stream where a task is located. |
|
Infer ID |
Inference iteration ID. (This field is not displayed by default or in the single-operator scenario.) |
|
Op Name |
Operator name. |
|
OP Type |
Operator type. If task_time is set to l0, this field is not collected and is displayed as N/A. |
|
OP State |
Dynamic and static information about an operator. The value dynamic indicates a dynamic operator, and the value static indicates a static operator. The communication operator does not have this state, so N/A is displayed. This field is reported only when --task-time is l1. If --task-time is l0, N/A is displayed. |
|
Task Type |
Type of the accelerator that executes the task, including AI_CORE, AI_VECTOR_CORE, and AI_CPU. If task_time is set to l0, this field is not collected and is displayed as N/A. |
|
Task Start Time(us) |
Task start time, in μs. |
|
Task Duration(us) |
Task duration (μs), including scheduling time to the accelerator, execution time on the accelerator, and response end time. |
|
Task Wait Time(us) |
Interval between tasks, in μs. |
|
Block Dim |
Number of running task blocks, which corresponds to the number of cores during task running. If task_time is set to l0, this field is not collected and is displayed as 0. |
|
HF32 Eligible |
Whether to use the HF32 precision flag. YES indicates that the HF32 precision flag is used, while NO indicates that the HF32 precision flag is not used. This field is reported only when --task-time is set to l1. If --task-time is set to l0, this field is displayed as N/A. |
|
Mix Block Dim |
Some operators are executed simultaneously on both the AI Core and Vector Core. The Block Dim of the primary accelerator is described in the Block Dim field, and the Block Dim of the secondary accelerator is described in this field. If task_time is set to l0, this field is not collected and is displayed as N/A. (Atlas 200/500 A2 Inference Product) (Atlas A2 Training Series Product/Atlas 800I A2 Inference Product) (Atlas A3 Training Series Product) |
|
Input Shapes |
Input shapes. If task_time is set to l0, this field is not collected and is displayed as N/A. |
|
Input Data Types |
Input data types. If task_time is set to l0, this field is not collected and is displayed as N/A. |
|
Input Formats |
Input formats. If task_time is set to l0, this field is not collected and is displayed as N/A. |
|
Output Shapes |
Output shapes. If task_time is set to l0, this field is not collected and is displayed as N/A. |
|
Output Data Types |
Output data types. If task_time is set to l0, this field is not collected and is displayed as N/A. |
|
Output Formats |
Output data formats. If task_time is set to l0, this field is not collected and is displayed as N/A. |
|
Context ID |
Context ID, which identifies a small operator of a subtask. If no small operator exists, N/A is displayed. (Atlas 200/500 A2 Inference Product) (Atlas A2 Training Series Product/Atlas 800I A2 Inference Product) (Atlas A3 Training Series Product) |
|
aiv_time(us) |
Theoretical execution time of a task on the AI Vector Core when all blocks are scheduled simultaneously and each block has an equal execution duration. The unit is μs. Typically, the scheduling start time of each block is slightly different, so the value of this field is slightly less than the actual execution time of the task on the AI Vector Core. (Atlas A2 Training Series Product/Atlas 800I A2 Inference Product) (Atlas A3 Training Series Product) |
|
aicore_time(us) |
Theoretical execution time of a task on the AI Core when all blocks are scheduled simultaneously and each block has an equal execution duration. The unit is μs. Typically, the scheduling start time of each block is slightly different, so the value of this field is slightly less than the actual execution time of the task on the AI Core. If the AI Core frequency changes (for example, manual frequency adjustment or dynamic frequency adjustment when the power consumption exceeds the threshold, or when the Atlas 300V/Atlas 300I Pro is involved), the data is inaccurate and is not recommended for reference. Atlas 200/500 A2 Inference Product: For details about the frequency change, see AI Core Frequency Viewing. Atlas A2 Training Series Product/Atlas 800I A2 Inference Product: For details about the frequency change, see AI Core Frequency Viewing. Atlas A3 Training Series Product: For details about the frequency change, see AI Core Frequency Viewing. |
|
total_cycles |
Total number of execution cycles of a task on the AI Core, which is the sum of the execution cycles of all blocks. The Atlas 200/500 A2 Inference Product is split into aic_total_cycles (total number of cycles executed by the task on the AI Cube Core) and aiv_total_cycles (total number of cycles executed by the task on the AI Vector Core). The Atlas A2 Training Series Product/Atlas 800I A2 Inference Product is split into aic_total_cycles (total number of cycles executed by the task on the AI Cube Core) and aiv_total_cycles (total number of cycles executed by the task on the AI Vector Core). The Atlas A3 Training Series Product is split into aic_total_cycles (total number of cycles executed by the task on the AI Cube Core) and aiv_total_cycles (total number of cycles executed by the task on the AI Vector Core). |
|
Register value |
Value of the custom register whose data is to be collected. It is configured by --aic-metrics. |
The following fields are generated when --task-time is set to l1 and --aic-mode is set to task-based. If --task-time is set to l0, these fields are not profiled and N/A is displayed. The generated data is controlled by the aic_metrics parameter.
|
Field |
Description |
|---|---|
|
*_vec_time(us) |
Time taken to execute Vector instructions, in μs. For the Atlas 200/500 A2 Inference Product, this field is not supported and defaults to N/A. |
|
*_vec_ratio |
Ratio of cycles taken to execute Vector instructions to the total cycles. For the Atlas 200/500 A2 Inference Product, this field is not supported and defaults to N/A. |
|
*_mac_time(us) |
Time taken to execute Cube instructions, in μs. |
|
*_mac_ratio |
Ratio of cycles taken to execute Cube instructions to the total cycles. |
|
*_scalar_time(us) |
Time taken to execute Scalar instructions, in μs. |
|
*_scalar_ratio |
Ratio of cycles taken to execute Scalar instructions to the total cycles. |
|
aic_fixpipe_time(us) |
Time taken to execute fixpipe instructions (L0C-to-OUT/L1 transfer) instructions, in μs. |
|
aic_fixpipe_ratio |
Ratio of cycles taken to execute fixpipe instructions (L0C-to-OUT/L1 transfer) to the total cycles. |
|
*_mte1_time(us) |
Time taken to execute MTE1 instructions (L1-to-L0A/L0B transfer), in μs. |
|
*_mte1_ratio |
Ratio of cycles taken to execute MTE1 instructions (L1-to-L0A/L0B transfer) to the total cycles. |
|
*_mte2_time(us) |
Time taken to execute MTE2 instructions (DDR-to-AI Core transfer), in μs. |
|
*_mte2_ratio |
Ratio of cycles taken to execute MTE2 instructions (DDR-to-AI Core transfer) to the total cycles. |
|
*_mte3_time(us) |
Time taken to execute MTE3 instructions (AI Core-to-DDR transfer), in μs. |
|
*_mte3_ratio |
Ratio of cycles taken to execute MTE3 instructions (AI Core-to-DDR transfer) to the total cycles. |
|
*_icache_miss_rate |
iCache is the L2 cache reserved for instructions. If the value of icache_miss_rate is high, the AI Core reads instructions at a low efficiency. |
|
memory_bound |
AI Core memory bound, calculated as: mte2_ratio/max(mac_ratio, vec_ratio). If the value is less than 1, no memory bound exists. If the value is greater than 1, the AI Core is mostly engaged in memory transfer instead of computation when executing tasks. A greater value indicates a more severe bound. |
|
cube_utilization(%) |
Cube operator utilization. Check whether the number of operations of the Cube operator in a unit time reaches the theoretical upper limit. A value closer to 100% indicates a value closer to the theoretical upper limit. Formula: cube_utilization = total_cycles/(freq * core_num * task_duration) |
|
Note: For some products, the asterisk (*) prefix of some fields in the preceding table represents aic or aiv, indicating that the data is execution results on the Cube Core or Vector Core. |
|
|
Field |
Description |
|---|---|
|
*_mac_fp16_ratio |
Ratio of cycles taken to execute Cube fp16 instructions to the total cycles. |
|
*_mac_int8_ratio |
Ratio of cycles taken to execute Cube int8 instructions to the total cycles. |
|
*_vec_fp32_ratio |
Ratio of cycles taken to execute Vector fp32 instructions to the total cycles. For the Atlas 200/500 A2 Inference Product, this field is not supported and defaults to N/A. |
|
*_vec_fp16_ratio |
Ratio of cycles taken to execute Vector fp16 instructions to the total cycles. |
|
*_vec_int32_ratio |
Ratio of cycles taken to execute Vector int32 instructions to the total cycles. For the Atlas 200/500 A2 Inference Product, this field is not supported and defaults to N/A. |
|
*_vec_misc_ratio |
Ratio of cycles taken to execute Vector misc instructions to the total cycles. For the Atlas 200/500 A2 Inference Product, this field is not supported and defaults to N/A. |
|
*_cube_fops |
Floating-point operations (FLOPs, that is, fops in this field) of the Cube type, indicating the computation amount. This field can be used to measure the complexity of an algorithm or model. |
|
*_vector_fops |
Floating-point operations (FLOPs, that is, fops in this field) of the Vector type, indicating the computation amount. This field can be used to measure the complexity of an algorithm or model. |
|
Note: For some products, the asterisk (*) prefix of some fields in the preceding table represents aic or aiv, indicating that the data is execution results on the Cube Core or Vector Core. |
|
|
Field |
Description |
|---|---|
|
*_ub_read_bw(GB/s) |
UB read bandwidth, in GB/s. For the Atlas 200/500 A2 Inference Product, this field is not supported and defaults to N/A. |
|
*_ub_write_bw(GB/s) |
UB write bandwidth, in GB/s. For the Atlas 200/500 A2 Inference Product, this field is not supported and defaults to N/A. |
|
*_l1_read_bw(GB/s) |
L1 read bandwidth, in GB/s. |
|
*_l1_write_bw(GB/s) |
L1 write bandwidth, in GB/s. |
|
*_l2_read_bw |
L2 read bandwidth, in GB/s. |
|
*_l2_write_bw |
L2 write bandwidth, in GB/s. For the Atlas 200/500 A2 Inference Product, this field is not supported and defaults to N/A. |
|
*_main_mem_read_bw(GB/s) |
Main memory read bandwidth, in GB/s. |
|
*_main_mem_write_bw(GB/s) |
Main memory write bandwidth, in GB/s. For the Atlas 200/500 A2 Inference Product, this field is not supported and defaults to N/A. |
|
Note: For some products, the asterisk (*) prefix of some fields in the preceding table represents aic or aiv, indicating that the data is execution results on the Cube Core or Vector Core. |
|
|
Field |
Description |
|---|---|
|
*_l0a_read_bw(GB/s) |
l0a read bandwidth, in GB/s. |
|
*_l0a_write_bw(GB/s) |
l0a write bandwidth, in GB/s. |
|
*_l0b_read_bw(GB/s) |
l0b read bandwidth, in GB/s. |
|
*_l0b_write_bw(GB/s) |
l0b write bandwidth, in GB/s. |
|
*_l0c_read_bw(GB/s) |
Bandwidth for Vector to read data from L0C, in GB/s. |
|
*_l0c_write_bw(GB/s) |
Bandwidth for Vector to write data to L0C, in GB/s. |
|
*_l0c_read_bw_cube(GB/s) |
Bandwidth for Cube to read data from L0C, in GB/s. |
|
*_l0c_write_bw_cube(GB/s) |
Bandwidth for Cube to write data to L0C, in GB/s. |
|
Note: Data about the MemoryL0 metric of the AI Vector Core is 0. Note: For some products, the asterisk (*) prefix of some fields in the preceding table represents aic or aiv, indicating that the data is execution results on the Cube Core or Vector Core. |
|
|
Field |
Description |
|---|---|
|
*_ub_read_bw_vector(GB/s) |
Bandwidth for Vector to read data from UB, in GB/s. |
|
*_ub_write_bw_vector(GB/s) |
Bandwidth for Vector to write data to UB, in GB/s. |
|
*_ub_read_bw_scalar(GB/s) |
Bandwidth for Scalar to read data from UB, in GB/s. |
|
*_ub_write_bw_scalar(GB/s) |
Bandwidth for Scalar to write data to UB, in GB/s. |
|
Note: For some products, the asterisk (*) prefix of some fields in the preceding table represents aic or aiv, indicating that the data is execution results on the Cube Core or Vector Core. |
|
|
Field |
Description |
|---|---|
|
*_vec_bankgroup_cflt_ratio |
Ratio of cycles taken to execute vec_bankgroup_stall_cycles instructions to the total cycles. Improper block stride settings in Vector instructions can lead to bank group conflicts. For the Atlas 200/500 A2 Inference Product, this field is not supported and defaults to N/A. |
|
*_vec_bank_cflt_ratio |
Ratio of cycles taken to execute vec_bank_stall_cycles instructions to the total cycles. Improper read/write pointer addresses for Vector instruction operands can lead to bank conflicts. For the Atlas 200/500 A2 Inference Product, this field is not supported and defaults to N/A. |
|
*_vec_resc_cflt_ratio |
Ratio of cycles taken to execute vec_resc_cflt_ratio instructions to the total cycles. If an operator involves multiple compute units, ensure that they are concurrently scheduled. When a compute unit is working but the operator logic still delivers instructions to the unit, the overall computing power is not fully utilized. For the Atlas 200/500 A2 Inference Product, this field is not supported and defaults to N/A. |
|
Note: For some products, the asterisk (*) prefix of some fields in the preceding table represents aic or aiv, indicating that the data is execution results on the Cube Core or Vector Core. |
|
|
Field |
Description |
|---|---|
|
*_read_main_memory_datas(KB) |
Amount of data read from the on-chip memory, in KB. |
|
*_write_main_memory_datas(KB) |
Amount of data written to the on-chip memory, in KB. |
|
*_GM_to_L1_datas(KB) |
Amount of data transferred from GM to L1, in KB. |
|
*_L0C_to_L1_datas(KB) |
Amount of data transferred from L0C to L1, in KB. |
|
*_L0C_to_GM_datas(KB) |
Amount of data transferred from L0C to GM, in KB. |
|
*_GM_to_UB_datas(KB) |
Amount of data transferred from GM to UB, in KB. |
|
*_UB_to_GM_datas(KB) |
Amount of data transferred from UB to GM, in KB. |
|
Note: The asterisk (*) prefix of the fields in the preceding table represents aic or aiv, , indicating that the data is execution results on the Cube Core or Vector Core. Availability: Atlas A2 Training Series Product/Atlas 800I A2 Inference Product Atlas A3 Training Series Product |
|
|
Field |
Description |
|---|---|
|
*_write_cache_hit |
Write cache hits. |
|
*_write_cache_miss_allocate |
Cache re-allocations upon write misses. |
|
*_r*_read_cache_hit |
Read cache hits in the r* channel. |
|
*_r*_read_cache_miss_allocate |
Cache re-allocations upon read misses in the r* channel. |
|
Note: For some products, the asterisk (*) prefix of some fields in the preceding table represents aic or aiv, indicating that the data is execution results on the Cube Core or Vector Core. Availability: Atlas A2 Training Series Product/Atlas 800I A2 Inference Product Atlas A3 Training Series Product Atlas 200/500 A2 Inference Product |
|
|
Field |
Description |
|---|---|
|
vec_exe_time(us) |
Time taken to execute Vector instructions, in μs. |
|
vec_exe_ratio |
Ratio of cycles taken to execute Vector instructions to the total cycles. For the Atlas 200/500 A2 Inference Product, this field is not supported and defaults to N/A. |
|
mac_exe_time(us) |
Time taken to execute Cube instructions (fp16 and s16), in μs. |
|
mac_exe_ratio |
Ratio of cycles taken to execute Cube instructions (fp16 and s16) to the total cycles. |
|
scalar_exe_time(us) |
Time taken to execute Scalar instructions, in μs. |
|
scalar_exe_ratio |
Ratio of cycles taken to execute Scalar instructions to the total cycles. |
|
mte1_exe_time(us) |
Time taken to execute MTE1 instructions (L1-to-L0A/L0B transfer), in μs. |
|
mte1_exe_ratio |
Ratio of cycles taken to execute MTE1 instructions (L1-to-L0A/L0B transfer) to the total cycles. |
|
mte2_exe_time(us) |
Time taken to execute MTE2 instructions (DDR-to-AI Core transfer), in μs. |
|
mte2_exe_ratio |
Ratio of cycles taken to execute MTE2 instructions (DDR-to-AI Core transfer) to the total cycles. |
|
mte3_exe_time(us) |
Time taken to execute MTE3 instructions (AI Core-to-DDR transfer), in μs. |
|
mte3_exe_ratio |
Ratio of cycles taken to execute MTE3 instructions (AI Core-to-DDR transfer) to the total cycles. |
|
fixpipe_exe_time(us) |
Time taken to execute fixpipe instructions (L0C-to-OUT/L1 transfer) instructions, in μs. |
|
fixpipe_exe_ratio |
Ratio of cycles taken to execute fixpipe instructions (L0C-to-OUT/L1 transfer) to the total cycles. |
|
memory_bound |
AI Core memory bound, calculated as: mte2_ratio/max(mac_ratio, vec_ratio). If the value is less than 1, no memory bound exists. If the value is greater than 1, the AI Core is mostly engaged in memory transfer instead of computation when executing tasks. A greater value indicates a more severe bound. |
|
cube_utilization(%) |
Cube operator utilization. Check whether the number of operations of the Cube operator in a unit time reaches the theoretical upper limit. A value closer to 100% indicates a value closer to the theoretical upper limit. Formula: cube_utilization = total_cycles/(freq * core_num * task_duration) |