op_summary (Operator Details)
AI Core and AI CPU operator summary data does not contain the timeline information. The summary information is summarized in the op_summary_*.csv file to collect statistics on the specific details and time consumptions of operators.
Availability
op_summary_*.csv File
The file content is formatted as follows.
The Task Duration field specifies the operator time consumption. You can sort operators by Task Duration to find time-consuming operators, or sort them by Task Type to view the time-consuming operators running on the AI Core or AI CPU.
- When task_time is set to l0 or off, op_summary_*.csv does not display the PMU data of the AI Core.
- Operators whose Task Type is HCCL usually contain a series of communication tasks. Each communication task has an independent task ID and stream ID, which are not displayed here. Therefore, the task IDs and stream IDs of this type of operators are N/A.
- If an input is scalar, the corresponding Input Shapes field is empty and formatted as ; ; ; ;. Each dimension is separated by a semicolon (;). This also applies to output shapes.
- The tool checks the operator overflow/underflow. If an operator overflow/underflow is detected, the following alarm is displayed. In this case, the operator computation result is unreliable.
Figure 2 Operator overflow/underflow alarm
The op_summary_*.csv file content varies depending on the values of --aic-mode, --aic-metrics, --task-time. The complete fields are as follows.
|
Field |
Description |
|---|---|
|
Device_id |
Device ID. |
|
Model Name |
Model name. It may be left empty if no related data is collected. (This field is not displayed by default or in the single-operator scenario.) |
|
Model ID |
Model ID. |
|
Task ID |
Task ID. |
|
Stream ID |
ID of the stream where a task is located. |
|
Infer ID |
Inference iteration ID. (This field is not displayed by default or in the single-operator scenario.) |
|
Op Name |
Operator name. |
|
OP Type |
Operator type. If task_time is set to l0, this field is not collected and is displayed as N/A. |
|
OP State |
Dynamic and static information about an operator. The value dynamic indicates a dynamic operator, and the value static indicates a static operator. The HCCL operator does not have this state, so N/A is displayed. This field is reported only when --task-time is l1. If --task-time is l0, N/A is displayed. |
|
Task Type |
Type of the accelerator that executes a task, including AI Core, AI Vector Core, and AI CPU. If task_time is set to l0, this field is not collected and is displayed as N/A. |
|
Task Start Time(us) |
Task start time (μs). |
|
Task Duration(us) |
Task duration (μs), including scheduling time to the accelerator, execution time on the accelerator, and response end time. |
|
Task Wait Time(us) |
Interval between tasks (μs). |
|
Block Dim |
Number of running task blocks, which corresponds to the number of cores during task running. If task_time is set to l0, this field is not collected and is displayed as 0. |
|
HF32 Eligible |
Whether to use the HF32 precision flag. YES indicates that the HF32 precision flag is used, while NO indicates that the HF32 precision flag is not used. |
|
Input Shapes |
Input shapes. If task_time is set to l0, this field is not collected and is displayed as N/A. |
|
Input Data Types |
Input data types. If task_time is set to l0, this field is not collected and is displayed as N/A. |
|
Input Formats |
Input formats. If task_time is set to l0, this field is not collected and is displayed as N/A. |
|
Output Shapes |
Output shapes. If task_time is set to l0, this field is not collected and is displayed as N/A. |
|
Output Data Types |
Output data types. If task_time is set to l0, this field is not collected and is displayed as N/A. |
|
Output Formats |
Output data formats. If task_time is set to l0, this field is not collected and is displayed as N/A. |
|
aicore_time(us) |
Theoretical execution time of a task on the AI Core when all blocks are scheduled simultaneously and each block has an equal execution duration. The unit is μs. Typically, the scheduling start time of each block is slightly different, so the value of this field is slightly less than the actual execution time of the task on the AI Core. If the AI Core frequency changes (for example, manual frequency adjustment or dynamic frequency adjustment when the power consumption exceeds the threshold is involved), the data is inaccurate and is not recommended for reference. |
|
total_cycles |
Total number of execution cycles of a task on the AI Core, which is the sum of the execution cycles of all blocks. |
|
Register value |
Value of the custom register whose data is to be collected. It is configured by --aic-metrics. |
The following fields are supported by the
|
Field |
Description |
|---|---|
|
vec_time(us) |
Time (μs) taken to execute Vector instructions. |
|
vec_ratio |
Ratio of cycles taken to execute Vector instructions to the total cycles. |
|
mac_time(us) |
Time (μs) taken to execute Cube instructions. |
|
mac_ratio |
Ratio of cycles taken to execute Cube instructions to the total cycles. |
|
scalar_time(us) |
Time (μs) taken to execute Scalar instructions. |
|
scalar_ratio |
Ratio of cycles taken to execute Scalar instructions to the total cycles. |
|
mte1_time(us) |
Time (μs) taken to execute MTE1 instructions (L1-to-L0A/L0B transfer) instructions. |
|
mte1_ratio |
Ratio of cycles taken to execute MTE1 instructions (L1-to-L0A/L0B transfer) to the total cycles. |
|
mte2_time(us) |
Time (μs) taken to execute MTE2 instructions (DDR-to-AI Core transfer). |
|
mte2_ratio |
Ratio of cycles taken to execute MTE2 instructions (DDR-to-AI Core transfer) to the total cycles. |
|
mte3_time(us) |
Time (μs) taken to execute MTE3 instructions (AI Core-to-DDR transfer). |
|
mte3_ratio |
Ratio of cycles taken to execute MTE3 instructions (AI Core-to-DDR transfer) to the total cycles. |
|
icache_miss_rate |
iCache is the L2 cache reserved for instructions. If the value of icache_miss_rate is high, the AI Core reads instructions at a low efficiency. |
|
memory_bound |
AI Core memory bound, calculated as: mte2_ratio/max(mac_ratio, vec_ratio). If the value is less than 1, no memory bound exists. If the value is greater than 1, the AI Core spends most of the time on memory transfers rather than computation. A larger value indicates a more severe memory bottleneck. |
|
cube_utilization(%) |
Cube operator utilization. Check whether the number of operations of the Cube operator in a unit time reaches the theoretical upper limit. A value closer to 100% indicates a value closer to the theoretical upper limit. Formula: cube_utilization = total_cycles/(freq * core_num * task_duration) |
|
Field |
Description |
|---|---|
|
mac_fp16_ratio |
Ratio of cycles taken to execute Cube fp16 instructions to the total cycles. |
|
mac_int8_ratio |
Ratio of cycles taken to execute Cube int8 instructions to the total cycles. |
|
vec_fp32_ratio |
Ratio of cycles taken to execute Vector fp32 instructions to the total cycles. |
|
vec_fp16_ratio |
Ratio of cycles taken to execute Vector fp16 instructions to the total cycles. |
|
vec_int32_ratio |
Ratio of cycles taken to execute Vector int32 instructions to the total cycles. |
|
vec_misc_ratio |
Ratio of cycles taken to execute Vector misc instructions to the total cycles. |
|
cube_fops |
Floating-point operations (FLOPs, that is, fops in this field) of the Cube type, indicating the computation amount. This field can be used to measure the complexity of an algorithm or model. |
|
vector_fops |
Floating-point operations (FLOPs, that is, fops in this field) of the Vector type, indicating the computation amount. This field can be used to measure the complexity of an algorithm or model. |
|
Field |
Description |
|---|---|
|
ub_read_bw(GB/s) |
UB read bandwidth (GB/s) |
|
ub_write_bw(GB/s) |
UB write bandwidth (GB/s) |
|
l1_read_bw(GB/s) |
L1 read bandwidth (GB/s) |
|
l1_write_bw(GB/s) |
L1 write bandwidth (GB/s) |
|
l2_read_bw |
L2 read bandwidth (GB/s) It is supported only by the |
|
l2_write_bw |
L2 write bandwidth (GB/s) It is supported only by the |
|
main_mem_read_bw(GB/s) |
Main memory read bandwidth (GB/s) |
|
main_mem_write_bw(GB/s) |
Main memory write bandwidth (GB/s) |
|
Field |
Description |
|---|---|
|
l0a_read_bw(GB/s) |
l0a read bandwidth (GB/s) |
|
l0a_write_bw(GB/s) |
l0a write bandwidth (GB/s) |
|
l0b_read_bw(GB/s) |
l0b read bandwidth (GB/s) |
|
l0b_write_bw(GB/s) |
l0b write bandwidth (GB/s) |
|
l0c_read_bw(GB/s) |
Bandwidth rate for Vector to read data from L0C, in GB/s. |
|
l0c_write_bw(GB/s) |
Bandwidth rate for Vector to write data to L0C, in GB/s. |
|
l0c_read_bw_cube(GB/s) |
Bandwidth rate for Cube to read data from L0C, in GB/s. |
|
l0c_write_bw_cube(GB/s) |
Bandwidth rate for Cube to write data to L0C, in GB/s. |
|
Field |
Description |
|---|---|
|
ub_read_bw_mte(GB/s) |
Bandwidth rate for MTE to read data from UB, in GB/s. It is supported only by the |
|
ub_write_bw_mte(GB/s) |
Bandwidth rate for MTE to write data to UB, in GB/s. It is supported only by the |
|
ub_read_bw_vector(GB/s) |
Bandwidth rate for Vector to read data from UB, in GB/s. |
|
ub_write_bw_vector(GB/s) |
Bandwidth rate for Vector to write data to UB, in GB/s. |
|
ub_read_bw_scalar(GB/s) |
Bandwidth rate for Scalar to read data from UB, in GB/s. |
|
ub_write_bw_scalar(GB/s) |
Bandwidth rate for Scalar to write data to UB, in GB/s. |
|
Field |
Description |
|---|---|
|
vec_bankgroup_cflt_ratio |
Ratio of cycles taken to execute vec_bankgroup_stall_cycles instructions to the total cycles. The block stride of Vector instructions is improperly set, resulting in bankgroup conflicts. |
|
vec_bank_cflt_ratio |
Ratio of cycles taken to execute vec_bank_stall_cycles instructions to the total cycles. The read/write pointer address of the Vector instruction operand is improper, resulting in bank conflicts. |
|
vec_resc_cflt_ratio |
Ratio of cycles taken to execute vec_resc_cflt_ratio instructions to the total cycles. If an operator involves multiple compute units, ensure that they are concurrently scheduled. When a compute unit is working, but the operator logic still delivers instructions to it, the overall computing power is not fully utilized. |