GUI Description
Function
The Details tab page displays Base Info, Compute Workload Analysis, Core Occupancy, Roofline, and Memory Workload Analysis. The analysis results are displayed in charts and data panes.
GUI Display
Import the BIN file collected by msProf by referring to Importing Data. For details about how to obtain the file, see "msprof op" in Operator Tuning (msProf) in Operator Development Tool User Guide.
Only one BIN file can be imported at a time. Files cannot be imported in a folder.
The Details tab page consists of five areas: Base Info (area 1), Core Occupancy (area 2), Roofline (area 3), Compute Workload Analysis (area 4), and Memory Workload Analysis (area 5), as shown in Figure 1.
- Area 1: Base Info. You can view the basic operator information, including the name, duration, and type. Table 1 describes the parameters.
Table 1 Base Info parameters Field
Description
Name
Operator name.
Duration (μs)
Total operator duration.
Op Type
Operator type. The options are mix, vector, cube, and AiCore.
Device Id
Device ID.
Pid
Process ID.
Block Dim
Number of sub blocks. This parameter is used when the operator type is vector, cube, or AiCore.
Mix Block Dim
Number of sub blocks. This parameter is used when the operator type is mix.
Block Detail
Duration details of sub blocks. This parameter is used when the operator type is vector, cube, or AiCore. Table 2 describes the fields.
Mix Block Detail
Duration details of sub blocks. This parameter is used when the operator type is mix. Table 3 describes the fields.
- Area 2: Core Occupancy. The inter-core load is displayed and analyzed based on the number of clock cycles, total core throughput, and cache hit rate, as shown in Figure 2.Developers can select Cycles, Throughput, or Cache Hit Rate(%) to display the core usage and analysis result, helping them locate and analyze exceptions.
- This module is supported only for the profile data exported from Atlas A3 Training Series Product and Atlas A2 Training Series Product/Atlas 800I A2 Inference Product.
- If the colors of computing units of the same type are similar, the load balancing is high. The colors of cube computing units of all cores are compared, and the colors of vector computing units of all cores are compared.
- Area 3: Roofline. Developers use the Roofline model graph to view operator performance and analyze the results to provide a basis for performance optimization. In the Roofline model graph, the X axis represents the arithmetic intensity (Ops/Byte), which indicates the number of operations supported by each byte of memory. The Y axis represents the performance (TOPS/s), which indicates the number of trillion operations per second.The Roofline model graph displays the computing power name, which describes the instruction types that maximize the computing power, such as Cube_INT(100.000000%) + Vec_FP16(30.000000%). Vec_FP32(70.000000%) indicates that cube computing units process only INT instructions, and vector computing units process 30% FP16 instructions and 70% FP32 instructions.
This module is supported only by the Atlas A2 Training Series Product/Atlas 800I A2 Inference Product, Atlas A3 Training Series Product, and Atlas Inference Series Product.
- If the hardware product is Atlas A3 Training Series Product or Atlas A2 Training Series Product/Atlas 800I A2 Inference Product, the Roofline performance model analysis includes the Memory Unit, Memory Transfer, and Pipeline tab pages.
Memory Unit: displays the HBM/L2 and Memory Unit model graphs, as shown in Figure 3. Table 4 describes the parameters.
Table 4 Memory Unit parameters Parameter
Description
HBM Read + Write
Read and write of the high bandwidth memory unit.
L2 Read + Write
Read and write of the L2 memory unit.
L1 Read + Write
Read and write of the L1 memory unit.
Write to L1
Write to the L1 memory unit.
Read from L1
Read from the L1 memory unit.
Write to L0A
Write to the L0A memory unit.
Write to L0B
Write to the L0B memory unit.
Read from L0C
Read from the L0C memory unit.
UB Read + Write
Read and write of the UB memory unit.
Read from UB
Read from the UB memory unit.
Write to UB
Write to the UB memory unit.
Vector Read UB
Read from the UB memory unit by the vector unit.
Vector Write UB
Write to the UB memory unit by the vector unit.
Memory Transfer: displays the memory transfer path, as shown in Figure 4. Table 5 describes the parameters.
Table 5 Memory Channel parameters Parameter
Description
GM/L1 to L0A
Memory channel from GM/L1 to L0A.
GM/L1 to L0B
Memory channel from GM/L1 to L0B.
L0C to GM
Memory channel from L0C to GM.
L1 to GM
Memory channel from L1 to GM.
L0C to L1
Memory channel from L0C to L1.
GM to UB
Memory channel from GM to UB.
UB to GM
Memory channel from UB to GM.
Pipeline: displays the pipeline model graph, as shown in Figure 5. Table 6 describes the parameters.
- When the hardware product is Atlas Inference Series Product, only the memory unit content exists, as shown in Figure 6. Table 7 describes the parameters.
Table 7 Memory Unit parameters Parameter
Description
L1 Read + Write
Read and write of the L1 memory unit.
Read from L0C
Read from the L0C memory unit.
Read from L1
Read from the L1 memory unit.
Read from UB
Read from the UB memory unit.
UB Read + Write
Read and write of the UB memory unit.
Vector Read UB
Read from the UB memory unit by the vector unit.
Vector Write UB
Write to the UB memory unit by the vector unit.
Write to L0A
Write to the L0A memory unit.
Write to L0B
Write to the L0B memory unit.
Write to L1
Write to the L1 memory unit.
Write to UB
Write to the UB memory unit.
- If the hardware product is Atlas A3 Training Series Product or Atlas A2 Training Series Product/Atlas 800I A2 Inference Product, the Roofline performance model analysis includes the Memory Unit, Memory Transfer, and Pipeline tab pages.
- Area 4: Compute Workload Analysis. Developers can view the information in a bar chart and data pane, helping them analyze compute workload, as shown in Figure 7. Table 8 describes the fields. The content indicated by the
icon indicates the compute workload analysis result of each block.
Table 8 Compute Workload Analysis parameters Parameter
Description
Block ID
Sub block ID. You can switch the block ID to view the corresponding information.
When the operator type is AiCore, this parameter is displayed as NA, and the multi-core average value is displayed.
Pipe Utilization
Pipe (instruction queue) visualization. It is displayed in a bar chart.
- Horizontal coordinate: Cycles percentage, calculated as follows: Cycles/Total cycles. Cycles indicates the clock cycles consumed by the instruction execution on the sub block.
- Vertical coordinate: operator instructions, provided by the data in the BIN file.
CUBE
Name of a cube instruction. This parameter is displayed when the operator type is cube.
CUBE0
Name of a cube instruction. This parameter is displayed when the operator type is mix.
VECTOR
Name of a vector instruction. This parameter is displayed when the operator type is vector.
VECTOR0
Name of a vector instruction. This parameter is displayed when the operator type is mix.
VECTOR1
Name of a vector instruction. This parameter is displayed when the operator type is mix.
AICORE
Name of an AI Core instruction. This parameter is displayed when the operator type is AiCore.
Instructions
Number of operator instructions.
Duration(μs)
Duration of operator instructions.
Data Volume(byte)
Operator instruction data volume.
- Area 5: Memory Workload Analysis. You can view the memory workload analysis information in the memory heatmap and data pane, as shown in Figure 8. Table 9 describes the parameters. Peak on the left of the heatmap is the arrow color. The value is the peak bandwidth ratio (maximum bandwidth ratio). The content indicated by the
icon is the memory workload analysis result of each block.
Table 9 Parameter description Parameter
Description
Block ID
Sub block ID. You can select the sub block to be viewed from the Block ID drop-down list.
When the operator type is AiCore, Block ID is displayed as NA, and the multi-core average value is displayed.
Show As
Optional. You can select the flow arrow content of the heatmap to display the number of requests or bandwidth. The arrow on the heatmap indicates the flow direction.
- Num of Request
- Bandwidth
The content displayed in the data pane varies according to the operator type. The content is the data parsing result based on the BIN file. The details are as follows:
- When the operator type is AiCore, the parameters in the table pane are described in Table 10.
Table 10 Parameters for the AiCore type Parameter
Description
Cache
L2 cache.
Cube
Cube computing unit.
HBM
High bandwidth memory unit.
L0A
L0A memory unit.
L0B
L0B memory unit.
L0C
L0C memory unit.
L1
L1 memory unit.
Pipe
Computing channel.
UB
UB memory unit.
Vector
Vector computing unit.
Requests
Number of operations.
Throughput(GB/s)
Throughput, indicating the amount of data transferred per second by the channel, in GB/s.
- When the operator type is mix, the parameters in the table pane are described in Table 11.
Table 11 Parameters for the mix type Parameter
Description
Cache
L2 cache.
Hit
Number of cache hits.
Miss
Number of times that the cache is reallocated after a cache miss.
Total
Total number of cache requests.
Hit Rate(%)
Cache hit rate.
Cube
Cube computing unit.
HBM Cube
High bandwidth memory unit of the cube unit.
HBM Vector Core0
High bandwidth memory unit of the vector unit of core 0 in AI Core.
HBM Vector Core1
High bandwidth memory unit of the vector unit of core 1 in AI Core.
L0A
L0A memory unit.
L0B
L0B memory unit.
L0C
L0C memory unit.
L1
L1 memory unit.
Requests
Number of operations.
Throughput(GB/s)
Throughput, indicating the amount of data transferred per second by the channel, in GB/s.
Peak(%)
Ratio of the actual bandwidth to the theoretical bandwidth.
Pipe Cube
Computing channel of the cube unit.
Pipe Vector Core0
Computing channel of the vector unit of core 0 in AI Core.
Pipe Vector Core1
Computing channel of the vector unit of core 1 in AI Core.
Instructions
Number of instructions.
Cycle
Clock cycle consumed by the channel.
Wait Cycles
Number of blocked cycles on the corresponding pipe.
Active Rate(%)
Percentage of the running cycles to the total cycles.
UB Core0
UB memory unit of core 0 in AI Core of the mix operator.
UB Core1
UB memory unit of core 1 in AI Core of the mix operator.
Vector core0
Vector computing unit.
- When the operator type is vector, the parameters in the table pane are described in Table 12.
Table 12 Parameters for the vector type Parameter
Description
Cache
L2 cache.
Hit
Number of cache hits.
Miss
Number of times that the cache is reallocated after a cache miss.
Total
Total number of cache requests.
Hit Rate(%)
Cache hit rate.
HBM
High bandwidth memory unit.
Requests
Number of operations.
Throughput(GB/s)
Throughput, indicating the amount of data transferred per second by the channel, in GB/s.
Pipe
Computing channel.
Instructions
Number of instructions.
Cycle
Clock cycle consumed by the channel.
Wait Cycles
Number of blocked cycles on the corresponding pipe.
Active Rate(%)
Percentage of the running cycles to the total cycles.
UB
UB memory unit.
Vector
Vector computing unit.
Peak(%)
Ratio of the actual bandwidth to the theoretical bandwidth.
- When the operator type is cube, the parameters in the table pane are described in Table 13.
Table 13 Parameters for the cube type Parameter
Description
Cache
L2 cache.
Hit
Number of cache hits.
Miss
Number of times that the cache is reallocated after a cache miss.
Total
Total number of cache requests.
Hit Rate(%)
Cache hit rate.
Cube
Cube computing unit.
HBM
High bandwidth memory unit.
L0A
L0A memory unit.
L0B
L0B memory unit.
L0C
L0C memory unit.
L1
L1 memory unit.
Requests
Number of operations.
Throughput(GB/s)
Throughput, indicating the amount of data transferred per second by the channel, in GB/s.
Peak(%)
Ratio of the actual bandwidth to the theoretical bandwidth.
Pipe
Computing channel.
Instructions
Number of instructions.
Cycle
Clock cycle consumed by the channel.
Wait Cycles
Number of blocked cycles on the corresponding pipe.
Active Rate(%)
Percentage of the running cycles to the total cycles.







