GUI Description

Function

The Details tab page displays Base Info, Compute Workload Analysis, Core Occupancy, Roofline, and Memory Workload Analysis. The analysis results are displayed in charts and data panes.

GUI Display

Import the BIN file collected by msProf by referring to Importing Data. For details about how to obtain the file, see "msprof op" in Operator Tuning (msProf) in Operator Development Tool User Guide.

Only one BIN file can be imported at a time. Files cannot be imported in a folder.

The Details tab page consists of five areas: Base Info (area 1), Core Occupancy (area 2), Roofline (area 3), Compute Workload Analysis (area 4), and Memory Workload Analysis (area 5), as shown in Figure 1.

Figure 1 Details page

Area 1: Base Info. You can view the basic operator information, including the name, duration, and type. Table 1 describes the parameters.

**Table 1** Base Info parameters
Field	Description
Name	Operator name.
Duration (μs)	Total operator duration.
Op Type	Operator type. The options are mix, vector, cube, and AiCore.
Device Id	Device ID.
Pid	Process ID.
Block Dim	Number of sub blocks. This parameter is used when the operator type is vector, cube, or AiCore.
Mix Block Dim	Number of sub blocks. This parameter is used when the operator type is mix.
Block Detail	Duration details of sub blocks. This parameter is used when the operator type is vector, cube, or AiCore. Table 2 describes the fields.
Mix Block Detail	Duration details of sub blocks. This parameter is used when the operator type is mix. Table 3 describes the fields.

**Table 2** Block details fields
Field	Description
Block ID	Sub block ID. This parameter is not available when the operator type is AiCore.
Core Type	Sub block type.
Duration (μs)	Duration of sub blocks.

**Table 3** Mixed block detail fields
Field	Description
Block ID	Sub block ID.
Cube0 Duration (μs)	Duration of the cube core in AI Core.
Vector0 Duration (μs)	Duration of one vector core in AI Core.
Vector1 Duration (μs)	Duration of another vector core in AI Core.

Area 2: Core Occupancy. The inter-core load is displayed and analyzed based on the number of clock cycles, total core throughput, and cache hit rate, as shown in Figure 2.
Developers can select Cycles, Throughput, or Cache Hit Rate(%) to display the core usage and analysis result, helping them locate and analyze exceptions.
Figure 2 Core Occupancy
- This module is supported only for the profile data exported from Atlas A3 Training Series Product and Atlas A2 Training Series Product/Atlas 800I A2 Inference Product.
- If the colors of computing units of the same type are similar, the load balancing is high. The colors of cube computing units of all cores are compared, and the colors of vector computing units of all cores are compared.

Area 3: Roofline. Developers use the Roofline model graph to view operator performance and analyze the results to provide a basis for performance optimization. In the Roofline model graph, the X axis represents the arithmetic intensity (Ops/Byte), which indicates the number of operations supported by each byte of memory. The Y axis represents the performance (TOPS/s), which indicates the number of trillion operations per second.

The Roofline model graph displays the computing power name, which describes the instruction types that maximize the computing power, such as Cube_INT(100.000000%) + Vec_FP16(30.000000%). Vec_FP32(70.000000%) indicates that cube computing units process only INT instructions, and vector computing units process 30% FP16 instructions and 70% FP32 instructions.

This module is supported only by the Atlas A2 Training Series Product/Atlas 800I A2 Inference Product, Atlas A3 Training Series Product, and Atlas Inference Series Product.

If the hardware product is Atlas A3 Training Series Product or Atlas A2 Training Series Product/Atlas 800I A2 Inference Product, the Roofline performance model analysis includes the Memory Unit, Memory Transfer, and Pipeline tab pages.

Memory Unit: displays the HBM/L2 and Memory Unit model graphs, as shown in Figure 3. Table 4 describes the parameters.

Figure 3 Memory Unit

**Table 4** Memory Unit parameters
Parameter	Description
HBM Read + Write	Read and write of the high bandwidth memory unit.
L2 Read + Write	Read and write of the L2 memory unit.
L1 Read + Write	Read and write of the L1 memory unit.
Write to L1	Write to the L1 memory unit.
Read from L1	Read from the L1 memory unit.
Write to L0A	Write to the L0A memory unit.
Write to L0B	Write to the L0B memory unit.
Read from L0C	Read from the L0C memory unit.
UB Read + Write	Read and write of the UB memory unit.
Read from UB	Read from the UB memory unit.
Write to UB	Write to the UB memory unit.
Vector Read UB	Read from the UB memory unit by the vector unit.
Vector Write UB	Write to the UB memory unit by the vector unit.

Memory Transfer: displays the memory transfer path, as shown in Figure 4. Table 5 describes the parameters.

Figure 4 Memory Channel

**Table 5** Memory Channel parameters
Parameter	Description
GM/L1 to L0A	Memory channel from GM/L1 to L0A.
GM/L1 to L0B	Memory channel from GM/L1 to L0B.
L0C to GM	Memory channel from L0C to GM.
L1 to GM	Memory channel from L1 to GM.
L0C to L1	Memory channel from L0C to L1.
GM to UB	Memory channel from GM to UB.
UB to GM	Memory channel from UB to GM.

Pipeline: displays the pipeline model graph, as shown in Figure 5. Table 6 describes the parameters.

Figure 5 Transport Unit

**Table 6** Transfer Unit parameters
Parameter	Description
MTE1	MTE1 channel.
MTE2	MTE2 channel.
MTE3	MTE3 channel.
FIXP	FIXP channel.
MTE2 vector	MTE2 channel of the vector computing unit.
MTE3 vector	MTE3 channel of the vector computing unit.

When the hardware product is Atlas Inference Series Product, only the memory unit content exists, as shown in Figure 6. Table 7 describes the parameters.

Figure 6 Memory unit model graph

**Table 7** Memory Unit parameters
Parameter	Description
L1 Read + Write	Read and write of the L1 memory unit.
Read from L0C	Read from the L0C memory unit.
Read from L1	Read from the L1 memory unit.
Read from UB	Read from the UB memory unit.
UB Read + Write	Read and write of the UB memory unit.
Vector Read UB	Read from the UB memory unit by the vector unit.
Vector Write UB	Write to the UB memory unit by the vector unit.
Write to L0A	Write to the L0A memory unit.
Write to L0B	Write to the L0B memory unit.
Write to L1	Write to the L1 memory unit.
Write to UB	Write to the UB memory unit.

Area 4: Compute Workload Analysis. Developers can view the information in a bar chart and data pane, helping them analyze compute workload, as shown in Figure 7. Table 8 describes the fields. The content indicated by the

icon indicates the compute workload analysis result of each block.

Figure 7 Compute Workload Analysis

**Table 8** Compute Workload Analysis parameters
Parameter	Description
Block ID	Sub block ID. You can switch the block ID to view the corresponding information. When the operator type is AiCore, this parameter is displayed as NA, and the multi-core average value is displayed.
Pipe Utilization	Pipe (instruction queue) visualization. It is displayed in a bar chart. Horizontal coordinate: Cycles percentage, calculated as follows: Cycles/Total cycles. Cycles indicates the clock cycles consumed by the instruction execution on the sub block. Vertical coordinate: operator instructions, provided by the data in the BIN file.
CUBE	Name of a cube instruction. This parameter is displayed when the operator type is cube.
CUBE0	Name of a cube instruction. This parameter is displayed when the operator type is mix.
VECTOR	Name of a vector instruction. This parameter is displayed when the operator type is vector.
VECTOR0	Name of a vector instruction. This parameter is displayed when the operator type is mix.
VECTOR1	Name of a vector instruction. This parameter is displayed when the operator type is mix.
AICORE	Name of an AI Core instruction. This parameter is displayed when the operator type is AiCore.
Instructions	Number of operator instructions.
Duration(μs)	Duration of operator instructions.
Data Volume(byte)	Operator instruction data volume.

Area 5: Memory Workload Analysis. You can view the memory workload analysis information in the memory heatmap and data pane, as shown in Figure 8. Table 9 describes the parameters. Peak on the left of the heatmap is the arrow color. The value is the peak bandwidth ratio (maximum bandwidth ratio). The content indicated by the

icon is the memory workload analysis result of each block.

Figure 8 Memory Workload Analysis

**Table 9** Parameter description
Parameter	Description
Block ID	Sub block ID. You can select the sub block to be viewed from the Block ID drop-down list. When the operator type is AiCore, Block ID is displayed as NA, and the multi-core average value is displayed.
Show As	Optional. You can select the flow arrow content of the heatmap to display the number of requests or bandwidth. The arrow on the heatmap indicates the flow direction. Num of Request Bandwidth

The content displayed in the data pane varies according to the operator type. The content is the data parsing result based on the BIN file. The details are as follows:

When the operator type is AiCore, the parameters in the table pane are described in Table 10.

**Table 10** Parameters for the AiCore type
Parameter	Description
Cache	L2 cache.
Cube	Cube computing unit.
HBM	High bandwidth memory unit.
L0A	L0A memory unit.
L0B	L0B memory unit.
L0C	L0C memory unit.
L1	L1 memory unit.
Pipe	Computing channel.
UB	UB memory unit.
Vector	Vector computing unit.
Requests	Number of operations.
Throughput(GB/s)	Throughput, indicating the amount of data transferred per second by the channel, in GB/s.

When the operator type is mix, the parameters in the table pane are described in Table 11.

**Table 11** Parameters for the mix type
Parameter	Description
Cache	L2 cache.
Hit	Number of cache hits.
Miss	Number of times that the cache is reallocated after a cache miss.
Total	Total number of cache requests.
Hit Rate(%)	Cache hit rate.
Cube	Cube computing unit.
HBM Cube	High bandwidth memory unit of the cube unit.
HBM Vector Core0	High bandwidth memory unit of the vector unit of core 0 in AI Core.
HBM Vector Core1	High bandwidth memory unit of the vector unit of core 1 in AI Core.
L0A	L0A memory unit.
L0B	L0B memory unit.
L0C	L0C memory unit.
L1	L1 memory unit.
Requests	Number of operations.
Throughput(GB/s)	Throughput, indicating the amount of data transferred per second by the channel, in GB/s.
Peak(%)	Ratio of the actual bandwidth to the theoretical bandwidth.
Pipe Cube	Computing channel of the cube unit.
Pipe Vector Core0	Computing channel of the vector unit of core 0 in AI Core.
Pipe Vector Core1	Computing channel of the vector unit of core 1 in AI Core.
Instructions	Number of instructions.
Cycle	Clock cycle consumed by the channel.
Wait Cycles	Number of blocked cycles on the corresponding pipe.
Active Rate(%)	Percentage of the running cycles to the total cycles.
UB Core0	UB memory unit of core 0 in AI Core of the mix operator.
UB Core1	UB memory unit of core 1 in AI Core of the mix operator.
Vector core0	Vector computing unit.

When the operator type is vector, the parameters in the table pane are described in Table 12.

**Table 12** Parameters for the vector type
Parameter	Description
Cache	L2 cache.
Hit	Number of cache hits.
Miss	Number of times that the cache is reallocated after a cache miss.
Total	Total number of cache requests.
Hit Rate(%)	Cache hit rate.
HBM	High bandwidth memory unit.
Requests	Number of operations.
Throughput(GB/s)	Throughput, indicating the amount of data transferred per second by the channel, in GB/s.
Pipe	Computing channel.
Instructions	Number of instructions.
Cycle	Clock cycle consumed by the channel.
Wait Cycles	Number of blocked cycles on the corresponding pipe.
Active Rate(%)	Percentage of the running cycles to the total cycles.
UB	UB memory unit.
Vector	Vector computing unit.
Peak(%)	Ratio of the actual bandwidth to the theoretical bandwidth.

When the operator type is cube, the parameters in the table pane are described in Table 13.

**Table 13** Parameters for the cube type
Parameter	Description
Cache	L2 cache.
Hit	Number of cache hits.
Miss	Number of times that the cache is reallocated after a cache miss.
Total	Total number of cache requests.
Hit Rate(%)	Cache hit rate.
Cube	Cube computing unit.
HBM	High bandwidth memory unit.
L0A	L0A memory unit.
L0B	L0B memory unit.
L0C	L0C memory unit.
L1	L1 memory unit.
Requests	Number of operations.
Throughput(GB/s)	Throughput, indicating the amount of data transferred per second by the channel, in GB/s.
Peak(%)	Ratio of the actual bandwidth to the theoretical bandwidth.
Pipe	Computing channel.
Instructions	Number of instructions.
Cycle	Clock cycle consumed by the channel.
Wait Cycles	Number of blocked cycles on the corresponding pipe.
Active Rate(%)	Percentage of the running cycles to the total cycles.

Parent topic: Details