GUI Description

Function

The Details tab page displays Base Info, Compute Workload Analysis, Core Occupancy, Roofline, and Memory Workload Analysis. The analysis results are displayed in charts and data panes.

GUI Display

Import the BIN file collected by msProf by referring to Importing Data. For details about how to obtain the file, see "msprof op" in Operator Tuning (msProf) in Operator Development Tool User Guide.

Only one BIN file can be imported at a time. Files cannot be imported in a folder.

The Details tab page consists of five areas: Base Info (area 1), Core Occupancy (area 2), Roofline (area 3), Compute Workload Analysis (area 4), and Memory Workload Analysis (area 5), as shown in Figure 1.

Figure 1 Details page
  • Area 1: Base Info. You can view the basic operator information, including the name, duration, and type. Table 1 describes the parameters.
    Table 1 Base Info parameters

    Field

    Description

    Name

    Operator name.

    Duration (μs)

    Total operator duration.

    Op Type

    Operator type. The options are mix, vector, cube, and AiCore.

    Device Id

    Device ID.

    Pid

    Process ID.

    Block Dim

    Number of sub blocks. This parameter is used when the operator type is vector, cube, or AiCore.

    Mix Block Dim

    Number of sub blocks. This parameter is used when the operator type is mix.

    Block Detail

    Duration details of sub blocks. This parameter is used when the operator type is vector, cube, or AiCore. Table 2 describes the fields.

    Mix Block Detail

    Duration details of sub blocks. This parameter is used when the operator type is mix. Table 3 describes the fields.

    Table 2 Block details fields

    Field

    Description

    Block ID

    Sub block ID.

    This parameter is not available when the operator type is AiCore.

    Core Type

    Sub block type.

    Duration (μs)

    Duration of sub blocks.

    Table 3 Mixed block detail fields

    Field

    Description

    Block ID

    Sub block ID.

    Cube0 Duration (μs)

    Duration of the cube core in AI Core.

    Vector0 Duration (μs)

    Duration of one vector core in AI Core.

    Vector1 Duration (μs)

    Duration of another vector core in AI Core.

  • Area 2: Core Occupancy. The inter-core load is displayed and analyzed based on the number of clock cycles, total core throughput, and cache hit rate, as shown in Figure 2.
    Developers can select Cycles, Throughput, or Cache Hit Rate(%) to display the core usage and analysis result, helping them locate and analyze exceptions.
    Figure 2 Core Occupancy
    • This module is supported only for the profile data exported from Atlas A3 Training Series Product and Atlas A2 Training Series Product/Atlas 800I A2 Inference Product.
    • If the colors of computing units of the same type are similar, the load balancing is high. The colors of cube computing units of all cores are compared, and the colors of vector computing units of all cores are compared.
  • Area 3: Roofline. Developers use the Roofline model graph to view operator performance and analyze the results to provide a basis for performance optimization. In the Roofline model graph, the X axis represents the arithmetic intensity (Ops/Byte), which indicates the number of operations supported by each byte of memory. The Y axis represents the performance (TOPS/s), which indicates the number of trillion operations per second.
    The Roofline model graph displays the computing power name, which describes the instruction types that maximize the computing power, such as Cube_INT(100.000000%) + Vec_FP16(30.000000%). Vec_FP32(70.000000%) indicates that cube computing units process only INT instructions, and vector computing units process 30% FP16 instructions and 70% FP32 instructions.

    This module is supported only by the Atlas A2 Training Series Product/Atlas 800I A2 Inference Product, Atlas A3 Training Series Product, and Atlas Inference Series Product.

    • If the hardware product is Atlas A3 Training Series Product or Atlas A2 Training Series Product/Atlas 800I A2 Inference Product, the Roofline performance model analysis includes the Memory Unit, Memory Transfer, and Pipeline tab pages.

      Memory Unit: displays the HBM/L2 and Memory Unit model graphs, as shown in Figure 3. Table 4 describes the parameters.

      Figure 3 Memory Unit
      Table 4 Memory Unit parameters

      Parameter

      Description

      HBM Read + Write

      Read and write of the high bandwidth memory unit.

      L2 Read + Write

      Read and write of the L2 memory unit.

      L1 Read + Write

      Read and write of the L1 memory unit.

      Write to L1

      Write to the L1 memory unit.

      Read from L1

      Read from the L1 memory unit.

      Write to L0A

      Write to the L0A memory unit.

      Write to L0B

      Write to the L0B memory unit.

      Read from L0C

      Read from the L0C memory unit.

      UB Read + Write

      Read and write of the UB memory unit.

      Read from UB

      Read from the UB memory unit.

      Write to UB

      Write to the UB memory unit.

      Vector Read UB

      Read from the UB memory unit by the vector unit.

      Vector Write UB

      Write to the UB memory unit by the vector unit.

      Memory Transfer: displays the memory transfer path, as shown in Figure 4. Table 5 describes the parameters.

      Figure 4 Memory Channel
      Table 5 Memory Channel parameters

      Parameter

      Description

      GM/L1 to L0A

      Memory channel from GM/L1 to L0A.

      GM/L1 to L0B

      Memory channel from GM/L1 to L0B.

      L0C to GM

      Memory channel from L0C to GM.

      L1 to GM

      Memory channel from L1 to GM.

      L0C to L1

      Memory channel from L0C to L1.

      GM to UB

      Memory channel from GM to UB.

      UB to GM

      Memory channel from UB to GM.

      Pipeline: displays the pipeline model graph, as shown in Figure 5. Table 6 describes the parameters.

      Figure 5 Transport Unit
      Table 6 Transfer Unit parameters

      Parameter

      Description

      MTE1

      MTE1 channel.

      MTE2

      MTE2 channel.

      MTE3

      MTE3 channel.

      FIXP

      FIXP channel.

      MTE2 vector

      MTE2 channel of the vector computing unit.

      MTE3 vector

      MTE3 channel of the vector computing unit.

    • When the hardware product is Atlas Inference Series Product, only the memory unit content exists, as shown in Figure 6. Table 7 describes the parameters.
      Figure 6 Memory unit model graph
      Table 7 Memory Unit parameters

      Parameter

      Description

      L1 Read + Write

      Read and write of the L1 memory unit.

      Read from L0C

      Read from the L0C memory unit.

      Read from L1

      Read from the L1 memory unit.

      Read from UB

      Read from the UB memory unit.

      UB Read + Write

      Read and write of the UB memory unit.

      Vector Read UB

      Read from the UB memory unit by the vector unit.

      Vector Write UB

      Write to the UB memory unit by the vector unit.

      Write to L0A

      Write to the L0A memory unit.

      Write to L0B

      Write to the L0B memory unit.

      Write to L1

      Write to the L1 memory unit.

      Write to UB

      Write to the UB memory unit.

  • Area 4: Compute Workload Analysis. Developers can view the information in a bar chart and data pane, helping them analyze compute workload, as shown in Figure 7. Table 8 describes the fields. The content indicated by the icon indicates the compute workload analysis result of each block.
    Figure 7 Compute Workload Analysis
    Table 8 Compute Workload Analysis parameters

    Parameter

    Description

    Block ID

    Sub block ID. You can switch the block ID to view the corresponding information.

    When the operator type is AiCore, this parameter is displayed as NA, and the multi-core average value is displayed.

    Pipe Utilization

    Pipe (instruction queue) visualization. It is displayed in a bar chart.

    • Horizontal coordinate: Cycles percentage, calculated as follows: Cycles/Total cycles. Cycles indicates the clock cycles consumed by the instruction execution on the sub block.
    • Vertical coordinate: operator instructions, provided by the data in the BIN file.

    CUBE

    Name of a cube instruction. This parameter is displayed when the operator type is cube.

    CUBE0

    Name of a cube instruction. This parameter is displayed when the operator type is mix.

    VECTOR

    Name of a vector instruction. This parameter is displayed when the operator type is vector.

    VECTOR0

    Name of a vector instruction. This parameter is displayed when the operator type is mix.

    VECTOR1

    Name of a vector instruction. This parameter is displayed when the operator type is mix.

    AICORE

    Name of an AI Core instruction. This parameter is displayed when the operator type is AiCore.

    Instructions

    Number of operator instructions.

    Duration(μs)

    Duration of operator instructions.

    Data Volume(byte)

    Operator instruction data volume.

  • Area 5: Memory Workload Analysis. You can view the memory workload analysis information in the memory heatmap and data pane, as shown in Figure 8. Table 9 describes the parameters. Peak on the left of the heatmap is the arrow color. The value is the peak bandwidth ratio (maximum bandwidth ratio). The content indicated by the icon is the memory workload analysis result of each block.
    Figure 8 Memory Workload Analysis
    Table 9 Parameter description

    Parameter

    Description

    Block ID

    Sub block ID. You can select the sub block to be viewed from the Block ID drop-down list.

    When the operator type is AiCore, Block ID is displayed as NA, and the multi-core average value is displayed.

    Show As

    Optional. You can select the flow arrow content of the heatmap to display the number of requests or bandwidth. The arrow on the heatmap indicates the flow direction.

    • Num of Request
    • Bandwidth

    The content displayed in the data pane varies according to the operator type. The content is the data parsing result based on the BIN file. The details are as follows:

    • When the operator type is AiCore, the parameters in the table pane are described in Table 10.
      Table 10 Parameters for the AiCore type

      Parameter

      Description

      Cache

      L2 cache.

      Cube

      Cube computing unit.

      HBM

      High bandwidth memory unit.

      L0A

      L0A memory unit.

      L0B

      L0B memory unit.

      L0C

      L0C memory unit.

      L1

      L1 memory unit.

      Pipe

      Computing channel.

      UB

      UB memory unit.

      Vector

      Vector computing unit.

      Requests

      Number of operations.

      Throughput(GB/s)

      Throughput, indicating the amount of data transferred per second by the channel, in GB/s.

    • When the operator type is mix, the parameters in the table pane are described in Table 11.
      Table 11 Parameters for the mix type

      Parameter

      Description

      Cache

      L2 cache.

      Hit

      Number of cache hits.

      Miss

      Number of times that the cache is reallocated after a cache miss.

      Total

      Total number of cache requests.

      Hit Rate(%)

      Cache hit rate.

      Cube

      Cube computing unit.

      HBM Cube

      High bandwidth memory unit of the cube unit.

      HBM Vector Core0

      High bandwidth memory unit of the vector unit of core 0 in AI Core.

      HBM Vector Core1

      High bandwidth memory unit of the vector unit of core 1 in AI Core.

      L0A

      L0A memory unit.

      L0B

      L0B memory unit.

      L0C

      L0C memory unit.

      L1

      L1 memory unit.

      Requests

      Number of operations.

      Throughput(GB/s)

      Throughput, indicating the amount of data transferred per second by the channel, in GB/s.

      Peak(%)

      Ratio of the actual bandwidth to the theoretical bandwidth.

      Pipe Cube

      Computing channel of the cube unit.

      Pipe Vector Core0

      Computing channel of the vector unit of core 0 in AI Core.

      Pipe Vector Core1

      Computing channel of the vector unit of core 1 in AI Core.

      Instructions

      Number of instructions.

      Cycle

      Clock cycle consumed by the channel.

      Wait Cycles

      Number of blocked cycles on the corresponding pipe.

      Active Rate(%)

      Percentage of the running cycles to the total cycles.

      UB Core0

      UB memory unit of core 0 in AI Core of the mix operator.

      UB Core1

      UB memory unit of core 1 in AI Core of the mix operator.

      Vector core0

      Vector computing unit.

    • When the operator type is vector, the parameters in the table pane are described in Table 12.
      Table 12 Parameters for the vector type

      Parameter

      Description

      Cache

      L2 cache.

      Hit

      Number of cache hits.

      Miss

      Number of times that the cache is reallocated after a cache miss.

      Total

      Total number of cache requests.

      Hit Rate(%)

      Cache hit rate.

      HBM

      High bandwidth memory unit.

      Requests

      Number of operations.

      Throughput(GB/s)

      Throughput, indicating the amount of data transferred per second by the channel, in GB/s.

      Pipe

      Computing channel.

      Instructions

      Number of instructions.

      Cycle

      Clock cycle consumed by the channel.

      Wait Cycles

      Number of blocked cycles on the corresponding pipe.

      Active Rate(%)

      Percentage of the running cycles to the total cycles.

      UB

      UB memory unit.

      Vector

      Vector computing unit.

      Peak(%)

      Ratio of the actual bandwidth to the theoretical bandwidth.

    • When the operator type is cube, the parameters in the table pane are described in Table 13.
      Table 13 Parameters for the cube type

      Parameter

      Description

      Cache

      L2 cache.

      Hit

      Number of cache hits.

      Miss

      Number of times that the cache is reallocated after a cache miss.

      Total

      Total number of cache requests.

      Hit Rate(%)

      Cache hit rate.

      Cube

      Cube computing unit.

      HBM

      High bandwidth memory unit.

      L0A

      L0A memory unit.

      L0B

      L0B memory unit.

      L0C

      L0C memory unit.

      L1

      L1 memory unit.

      Requests

      Number of operations.

      Throughput(GB/s)

      Throughput, indicating the amount of data transferred per second by the channel, in GB/s.

      Peak(%)

      Ratio of the actual bandwidth to the theoretical bandwidth.

      Pipe

      Computing channel.

      Instructions

      Number of instructions.

      Cycle

      Clock cycle consumed by the channel.

      Wait Cycles

      Number of blocked cycles on the corresponding pipe.

      Active Rate(%)

      Percentage of the running cycles to the total cycles.