Tool Usage

The msProf tool provides two usage modes: msprof op and msprof op simulator, to allow you to locate exceptions in operator memory, operator code, and operator instructions, implementing comprehensive operator tuning. For details about the two usage modes, see Table 1.

Table 1 msprof op and msprof op simulator functions

Function Name

Application Scenario

Usage

Displayed Graph

msprof op

It is suitable for performance analysis in the actual operating environment and allows users to locate operator memory and performance bottlenecks.

It analyzes running operators without additional configuration, which is suitable for quickly locating operator performance issues in the board environment.

Computing Memory Heatmap

Roofline Bottleneck Analysis Chart

Cache Heatmap

Communication and Computing Pipeline Chart

Operator Code Hot Spot Map

NOTE:

To enable cache heat map redirection, configure it by referring to Configuration of msprof op.

msprof op simulator

It is applicable to the development and debugging phases for detailed simulation tuning, allowing you to analyze operator instructions and code hotspots.

You need to configure environment variables (such as LD_LIBRARY_PATH) and compilation options (such as adding -g to generate debugging information) by referring to Configurations of msprof op simulator. It is suitable for analyzing operator behavior in detail in the simulation environment.

Instruction Pipeline Chart

Operator Code Hot Spot Map

Memory Channel Throughput Waveform

NOTE:

The simulation result of msprof op simulator in the document is for reference only. The actual running status of the operator is subject to the actual simulation data.

  • The msProf tool depends on the msopprof executable file in the CANN package. The API usage in this file is the same as that in msprof op. This file is provided by the CANN package and does not need to be installed separately.
  • It is not allowed to initiate more than one profile data collection task on the same device.
  • Before using the msprof op and msprof op simulator, ensure that the app functions properly.

msprof op

  1. Log in to the operating environment and run the msprof op optional parameter app [arguments] to enable on-board operator tuning. For details about the optional parameters, see Table 2. An example command is as follows:
    msprof op --output=$HOME/projects/output $HOME/projects/MyApp/out/main  // --output is optional. $HOME/projects/MyApp/out/main is the application in use.
  2. Perform operator tuning in either of the following ways:
    • Executable file-based method
      • Single-operator scenario (using add_custom_npu as an example)
        Example 1:
        msprof op ./add_custom_npu
        Example 2:
        msprof op --aic-metrics=<select_metrics> --output=./output_data ./add_custom_npu 
      • Multi-operator scenario
        If the test executable contains Add, MatlMul, and Sub operators, you can use --launch-count and --kernel-name to specify collecting data for the Add and Sub operators.
        msprof op --launch-count=10 --kernel-name="Add|Sub" --output=./output_data ./test  // ./test is the user binary file and should be placed at the end of the command.
    • Method based on the JSON configuration file of the input operator binary file *.o. For details, see JSON Configuration File Description.
      msprof op --config=./add_test.json --aic-metrics=<select_metrics> --output=./output_data
  3. After the command is executed, a folder named OPPROF_{timestamp}_XXX is generated in the default path or the specified --output directory. When all --aic-metrics are enabled, the structure is as follows:
    • Collecting data in the multi-device multi-operator scenario.

      When tuning MC2 or LCCL fused operators in multi-device parallel mode, several subdirectories named after device IDs will exist in the result directory, depending on the specified number of NPUs. The tuning results of each NPU are stored in the corresponding device ID directory.

       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      └──OPPROF_{timestamp}_XXX
      ├── device0                  // ID of the Ascend AI Processor used during running
      └── device1                
        ├── OpName0                // OpName0 is the name of the collection operator.
         ├── 0                   // Sequence in which operators are scheduled.
          ├──dump              // Folder for storing the process files. The meaning of this parameter is the same as that in single-operator collection.
           └──xxx_yyy.csv       // xxx indicates the type of the metric generated by an operator, for example, L2Cache. For details about the metric types, see the description of the csv. file in . yyy indicates the timestamp suffix of the .csv file, for example, L2Cache_20240603022812284.csv.
          └──visualize_data.bin 
        ├── OpName1               
         ├── 0
          ├──dump 
          └──xxx_yyy.csv
          └──visualize_data.bin 
         ├── OpName2         
         ├── 0
          ├── dump  
          └── xxx_yyy.csv
          └──visualize_data.bin 
          └── trace.json      // Applicable only to MC2 and LCCL operators.
      
    • Collecting data of multiple operators on a single device
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      └──OPPROF_{timestamp}_XXX
      ├── OpName0                  // OpName0 is the name of the collection operator.
       ├── 0                     // Sequence in which operators are scheduled.
        ├── dump                // Folder for storing the process files. The meaning of this parameter is the same as that in single-operator collection.
        └── xxx_yyy.csv   // xxx indicates the type of the metric generated by an operator, for example, L2Cache. For details about the metric types, see the description of the csv. file in . yyy indicates the timestamp suffix of the .csv file, for example, L2Cache_20240603022812284.csv.
        └──visualize_data.bin 
       ├── 1
        ├──dump 
        └──xxx_yyy.csv
        └──visualize_data.bin 
      ├── OpName1         
       ├── 0
        ├── dump  
        └── xxx_yyy.csv
        └── visualize_data.bin 
      
    • Collecting data of a single-operator on a single device
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      OPPROF_{timestamp}_XXX
      ├── dump
      ├── ArithmeticUtilization.csv
      ├── L2Cache.csv
      ├── Memory.csv
      ├── MemoryL0.csv
      ├── MemoryUB.csv
      ├── OpBasicInfo.csv
      ├── PipeUtilization.csv
      ├── ResourceConflictRatio.csv
      ├── visualize_data.bin 
      
    Table 2 msprof op files

    File

    Description

    dump folder

    Raw profile data, which can be ignored.

    ArithmeticUtilization.csv

    Time consumptions and ratios of Cube and Vector instructions. For details, see ArithmeticUtilization (Time Consumptions and Percentages of Cube and Vector Instructions).

    L2Cache.csv

    L2 cache hit ratio. For details, see L2Cache (L2 Cache Hit Ratio).

    Memory.csv

    UB/L1/L2/main memory read/write bandwidth rate. For details, see Memory (Memory Read/Write Bandwidth Rate).

    MemoryL0.csv

    L0A/L0B/L0C memory read/write bandwidth rate. For details, see MemoryL0 (L0 Read/Write Bandwidth Rate).

    MemoryUB.csv

    MTE/Vector/Scalar UB read/write bandwidth rate. For details, see MemoryUB (UB Read/Write Bandwidth Rate).

    PipeUtilization.csv

    Time consumptions and ratios of compute units and MTE units. For details, see PipeUtilization (Percentages of Time Taken by Compute Units and MTEs).

    ResourceConflictRatio.csv

    Ratios of bank groups, bank conflicts, and resource conflicts in the UB to all instructions. For details, see ResourceConflictRatio (Resource Conflict Ratio).

    OpBasicInfo.csv

    Basic operator information, including the operator names, block dim, and time consumptions. For details, see OpBasicInfo (Basic Operator Information).

    visualize_data.bin

    File that displays basic operator information, compute unit load, hotspot functions, and Roofline bottleneck analysis. For details, see Computing Memory Heatmap, Roofline Bottleneck Analysis Chart, Cache Heatmap, Communication and Computing Pipeline Chart, and Operator Code Hot Spot Map.

    NOTE:

    trace.json

    File for visualizing the communication and computing pipelines. For details about how to use the Chrome browser to display the file, see Communication and Computing Pipeline Chart.

  4. After the visualize_data.bin file is imported to the MindStudio Insight, Computing Memory Heatmap, Roofline Bottleneck Analysis Chart, Cache Heatmap, Communication and Computing Pipeline Chart, and Operator Code Hot Spot Map are displayed.
  5. After the trace.json file is imported to the Chrome browser or MindStudio Insight, Communication and Computing Pipeline Chart is displayed.

msprof op simulator

The operator tuning tool supports profile data collection and automatic parsing in a simulation environment.

  • The collection of MC2 and HCCL operators is not supported in the simulation environment.
  • The number of simulation cores set by the user cannot exceed the number of physical cores.
  • For performance of some operators, call TRACE_START and TRACE_STOP APIs in the single core of Atlas A3 training products/Atlas A3 inference products, Atlas inference products, and Atlas A2 training products/Atlas A2 inference products. Add -DASCENDC_TRACE_ON to the compilation configuration file. For details, see adding -DASCENDC_TRACE_ON. Then, the system can generate the pipeline chart. For details about the flow chart content, see Instruction Pipeline Chart.
  • You need to add -DASCENDC_TRACE_ON to the compilation configuration file. For details, see the following sample project.
    For AddKernelInvocationNeo operator project, add the following code to the ${git_clone_path}/samples/operator/ascendc/0_introduction/3_add_kernellaunch/AddKernelInvocationNeo/cmake/npu_lib.cmake file:
    ascendc_compile_definitions
    (
        ...
        -DASCENDC_TRACE_ON
    )
  1. Log in to the operating environment, utilize the msprof op simulator to enable operator simulation tuning, and then use the optional simulation parameters and the application to be optimized (app [arguments]) for tuning. For details about the optional simulation parameters, see Table 3. You can use either of the following methods for operator simulation tuning:
    • Executable file-based method
      • Single-operator scenario (using add_custom_npu as an example)
        msprof op simulator --soc-version=Ascendxxxyy --output=./output_data ./add_custom_npu // xxxyy indicates the type of the processor used by the user.
      • Multi-operator scenario
        If the test executable contains Add, MatlMul, and Sub operators, you can use --launch-count and --kernel-name to specify collecting data for the Add and Sub operators.
        msprof op simulator --soc-version=Ascendxxxyy --launch-count=10 --kernel-name="Add|Sub" --output=./output_data ./test  // xxxyy indicates the type of the processor used by the user. ./test must be placed at the end of the command.
    • Method based on the JSON configuration file of the input operator binary file *.o

      When using --config, you can import environment variables only via LD_LIBRARY_PATH. The --soc-version parameter is not supported.

      export LD_LIBRARY_PATH=${INSTALL_DIR}/tools/simulator/Ascendxxxyy/lib:$LD_LIBRARY_PATH  // xxxyy specifies the processor type.
      msprof op simulator --config=./add_test.json --output=./output_data
  2. After the command is executed, a folder named OPPROF_{timestamp}_XXX is generated in the specified --output directory. An example of the folder structure is as follows:
    • Collecting data of a single-operator
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      OPPROF_{timestamp}_XXX
      ├── dump
      └── simulator
          ├── core0.veccore0       // Store the data file of each core in the core*.veccore* or core*.cubecore* directory.
             ├── core0.veccore0_code_exe.csv
             ├── core0.veccore0_instr_exe.csv
             └── trace.json     // Simulation instruction pipeline chart file of the core.
          ├── core0.veccore1
             ├── core0.veccore1_code_exe.csv
             ├── core0.veccore1_instr_exe.csv
             └── trace.json
          ├── core1.veccore0
             ├── core1.veccore0_code_exe.csv
             ├── core1.veccore0_instr_exe.csv
             └── trace.json
          ├── ... 
          ├── visualize_data.bin 
          └── trace.json      // Simulation instruction pipeline chart files of all cores.
      
    • Collecting data of multiple operators
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      └──OPPROF_{timestamp}_XXX
      ├── OpName1           // OpName1 is the name of the operator to be collected.
       ├── 0              // Sequence in which operators are scheduled.
        ├── dump        // Folder for storing the process files. The meaning of this parameter is the same as that in single-operator collection.
        └──simulator    // The content is the same as that in the single-operator simulator folder, but the .csv files in the simulator folder have timestamp suffixes added, for example, core*_code_exe_20240429111143146.csv.
       ├── 1
        ├── dump        
        └──simulator
       ├── dump          // Folder that stores the process files.
      ├── OpName2         
       ├── 0
        ├── dump       
        └── simulator
       ├── dump  
      
    Table 3 msprof op simulator files

    File

    Description

    dump folder

    Folder for storing the dump data generated by the original simulation.

    simulator folder

    NOTE:

    Folder for storing dump data file analysis results.

    core*_code_exe.csv

    Time consumed by code lines. The asterisk (*) indicates cores 0 to n, which helps users promptly identify the most time-consuming part of the written code. For details, see Code Line Time Consumption Data File.

    core*_instr_exe.csv

    Detailed information about code instructions. The asterisk (*) indicates cores 0 to n, which helps users promptly identify the most time-consuming instructions. For details, see Code Instruction Information File.

    visualize_data.bin

    Visualized-displayed files of simulation pipelines and simulation hotspot functions. For details, see Instruction Pipeline Chart, Operator Code Hot Spot Map, and Memory Channel Throughput Waveform.

    NOTE:

    The generated visualize_data.bin file that presents information about simulation pipeline charts and simulation hotspot functions can be displayed in MindStudio Insight. For details, see MindStudio Insight User Guide.

    trace.json

    Simulation instruction pipeline chart file, including the subfile of each core and the summary file of all cores. For details, see Instruction Pipeline Chart and Memory Channel Throughput Waveform.

  3. Optional: After the visualize_data.bin file is imported to MindStudio Insight, Instruction Pipeline Chart, Operator Code Hot Spot Map, and Memory Channel Throughput Waveform are displayed.
  4. After the trace.json file is imported to the Chrome browser or MindStudio Insight, Instruction Pipeline Chart and Memory Channel Throughput Waveform are displayed.