Overview

The msProf performance analysis tool is used to collect and analyze key performance metrics of operators running on Ascend AI Processors. You can efficiently locate software and hardware performance bottlenecks of operators based on the output profile data, thereby enhancing the overall efficiency of operator performance analysis.

Profile data can currently be collected and automatically parsed based on various running modes (onboard or simulation) and file formats (executable files or operator binary .o files).

The msProf tool depends on the msopprof executable file in the CANN package. The API usage in this file is the same as that in msprof op. This file is provided by the CANN package and does not need to be installed separately.

Features

For details about how to use the operator tuning tool, see Tool Usage. MindStudio Insight displays the computing memory heatmap, Roofline bottleneck analysis chart, cache heatmap, communication and computing pipeline chart (MC2 fused operator), instruction pipeline chart, operator code hot spot map, memory channel throughput waveform, and profile data file, which are single-operator tuning capabilities. For details, see Table 1.

**Table 1** msProf functions
Function	Link
Computing memory heatmap	Computing Memory Heatmap
Roofline bottleneck analysis chart	Roofline Bottleneck Analysis Chart
Cache heatmap	Cache Heatmap
Communication and computing pipeline chart	Communication and Computing Pipeline Chart
Instruction pipeline chart	Instruction Pipeline Chart
Operator code hot spot map	Operator Code Hot Spot Map
Memory channel throughput waveform	Memory Channel Throughput Waveform
Profile data file	Profile Data Files

After you enter CTRL+C, the operator execution stops, and the tool generates profile data files based on existing information. If you do not need to generate the file, enter CTRL+C again.
If the --output parameter is not specified, ensure that users in the group and other groups do not have the write permission on the parent directory of the current path.

Commands

You need to ensure the execution security of executable files or applications.

You are advised to restrict the operation permission on executable files or applications to avoid privilege escalation risks.
You are not advised to perform high-risk operations (such as deleting files, deleting directories, changing passwords, and running privilege escalation commands) to avoid security risks.

msprof op mode

Log in to the operating environment and call msprof op optional parameters app [arguments]. For details about optional parameters, see Table 2. An example command is as follows:

msprof op --output=$HOME/projects/output $HOME/projects/MyApp/out/main blockdim 1 //  --output is optional. $HOME/projects/MyApp/out/main is the application in use. blockdim 1 is an optional parameter of the application.

**Table 2** Options of msprof op
Option	Description	Mandatory (Yes/No)
--application NOTE: This option is currently compatible with ./app [arguments] and will be changed to ./app [arguments] later.	You are advised to use the *msprof op [msprof op parameter] ./app* to pull data. app is the specified executable file. If no path is specified for app, the current path is used by default. NOTE: When using ./app, add msprof op parameters before ./app to ensure that the related functions take effect.	Yes. Select either the specified executable file or --config.
--config	This option sets the .o binary file obtained by operator compilation. It can be set to an absolute path or a relative path. For details, see JSON Configuration File Description. NOTE: Before operator tuning, you can obtain the operator binary .o file in either of the following ways: Obtain the executable file on the NPU and extract the .o file from the executable file. For details, see "Kernel Launch". The .o file is automatically generated during operator compilation. For details, see Compiling and Deploying Operators. Ensure that users in the group and other groups do not have the write permission on the JSON file specified by --config and the parent directory. In addition, ensure that the owner of the parent directory of the JSON file is the current user.
--kernel-name	This option specifies the name of the operator whose data is to be collected. Fuzzy match using the operator name prefix is supported. If this option is not specified, only data of the first operator scheduled during program running is collected. NOTE: This option must be used together with --application. The value contains a maximum of 1024 characters. Only one or more characters in A-Za-z0-9_ are supported. If multiple operators need to be collected, use vertical bars (\|) to combine them. For example, --kernel-name="add\|abs" indicates that operators whose prefixes are add and abs are collected. The number of operators to be collected is determined by the value of --launch-count. The wildcard (*) can be used to match strings of any length.	No
--launch-count	This option sets the maximum number of operators that can be collected. The value is an integer ranging from 1 to 5000. The default value is 1.	No
--launch-skip-before-match	This option specifies the number of operators that do not require data collection, starting from the first operator to the specified number of operators. Data collection only begins from the operators after the specified number. NOTE: The count of this option increases no matter whether --launch-skip-before-match hits the operators specified in kernel-name. In addition, the operator is not collected. The value of this option is an integer ranging from 0 to 1000.	No
--aic-metrics	This option enables the collection of operator metrics. Enables the collection of operator metrics (ArithmeticUtilization, L2Cache, Memory, MemoryL0, MemoryUB, PipeUtilization, ResourceConflictRatio, and Default). You can select one or more metrics and separate them with commas (,), for example, --aic-metrics=Memory,MemoryL0. The default value is Default, indicating that the following metrics (ArithmeticUtilization, L2Cache, Memory, MemoryL0, MemoryUB, PipeUtilization, and ResourceConflictRatio) are collected. Example: --aic-metrics=Default. Enables the collection of metrics within a specified code segment on the operator kernel (KernelScale). KernelScale can be used to tune a specified code segment on the operator kernel. You need to configure --aic-metrics=KernelScale first, and then select one or more operator metrics. Use commas (,) to separate multiple metrics, for example, --aic-metrics=KernelScale,Memory,MemoryL0. By default, all operator metrics are collected, for example, --aic-metrics=KernelScale. NOTE: When specifying the code segment range, ensure that setting is done before and after the corresponding code segment on the operator kernel. For details, see MetricsProfStart and MetricsProfStop APIs. This function is supported only by Atlas A3 training products/Atlas A3 inference products and Atlas A2 training products/Atlas A2 inference products. Roofline: enables the generation of Roofline bottleneck analysis charts and displays them in a visual format on MindStudio Insight. Example: --aic-metrics=Roofline. For details, see Roofline Bottleneck Analysis Chart. NOTE: Roofline is bound with Default. Enabling Roofline simultaneously enables Roofline and Default modes. TimelineDetail: enables the generation of instruction pipeline diagrams and operator code hot spot map for visualization, for example, --aic-metrics=TimelineDetail. For details, see Instruction Pipeline Chart and Operator Code Hot Spot Map. NOTE: To enable this function, see Configurations of msprof op simulator. This function is supported only by Atlas A2 training products/Atlas A2 inference products and Atlas A3 training products/Atlas A3 inference products. This function applies only to Third-party framework operator calling: PyTorch framework scenario where single-operator APIs are used to call operators internally. This function does not support the collection of level-2 pointer operators, Triton operators, and MC2 fused operators. It cannot be enabled together with --replay-mode=application/range. To generate a CSV file or display the Computing Memory Heatmap, enable Default when starting the operator. The following is an example: msprof op --aic-metrics=TimelineDetail,Default Occupancy: enables the generation of the inter-core load analysis chart and displays the chart in a visual format on MindStudio Insight. Example: --aic-metrics=Occupancy. For details, see inter-core load analysis chart. The time consumption, data throughput, and cache hit ratio of each physical core are compared. If the difference between the maximum value and the minimum value is greater than 10%, the load is unbalanced, and the CLI will provide tuning suggestions. NOTE: This function is supported only by Atlas A3 training products/Atlas A3 inference products and Atlas A2 training products/Atlas A2 inference products. MemoryDetail: for example, --aic-metrics=MemoryDetail. After this function is enabled, the L2 cache-related functions (the L2 cache-L0A/L0B connection in the Compute Workload Analysis, and L2 cache hit ratio and GM-related data transfer volume in the Cache Heatmap and Operator Code Hot Spot Map) are enabled. When dynamic instrumentation is enabled, the active bandwidth of MTE1 and MTE2 in the Cube unit on the AI Core is displayed in the Memory Workload Analysis. If the instrumentation fails, the corresponding fields in the memory workload analysis diagram are displayed as NA, and aic_mte1_active_bw(GB/s) and aic_mte2_active_bw(GB/s) are not displayed in PipeUtilization (Percentages of Time Taken by Compute Units and MTEs). NOTE: This function cannot be enabled together with --replay-mode=range. This function is supported only by Atlas A3 training products/Atlas A3 inference products and Atlas A2 training products/Atlas A2 inference products. MemoryDetail is bound with Default. Enabling MemoryDetail simultaneously enables MemoryDetail and Default modes. BasicInfo: enables basic information collection. Only basic operator information is saved to the drive, for example, --aic-metrics=BasicInfo. For details about the saved content, see OpBasicInfo (Basic Operator Information). Source: enables the operator code hot spot map, for example, --aic-metrics=Source. For details, see Operator Code Hot Spot Map. NOTE: This function is supported only by Atlas A3 training products/Atlas A3 inference products and Atlas A2 training products/Atlas A2 inference products. To view the code call stack, add the -g compilation option when compiling the operator. For details, see Adding -g Compilation Option. This function cannot be enabled together with --replay-mode=range.	No
--kill	The value can be on or off. The default value is off, indicating that the function is disabled. If you set --kill to on to enable this function, the application automatically stops after collecting the number of operators specified by --launch-count. NOTE: After --kill is set to on, error logs may be generated because the application ends in advance. You can determine whether to use this function. For a multi-threaded process, the configuration of the --kill option takes effect only for subprocesses. Using this option prevents the last executed MC2 fused operator from properly obtaining the API calling pipeline. For details, see Communication and Computing Pipeline Chart. You are advised not to enable this function together with --replay-mode=range. Otherwise, collected operator data may be missing.	No
--mstx	This option determines whether the operator tuning tool enables the mstx API used in the user code program. The default value is off, indicating that the mstx API is disabled. If --mstx is set to on, the operator tuning tool enables the mstx API used in the user code program. Example: msprof op --mstx=on ./add_custom NOTE: Currently, the mstxRangeStartA and mstxRangeEnd APIs are supported to enable the specified range for operator tuning. For details about the parameters, see the and APIs in MindStudio mstx API Reference. When used together with --replay-mode=range, the mstxRangeStartA and mstxRangeEnd APIs must be called in pairs and cannot be nested across. The operators contained in each pair of mstx APIs form a replay range. The streams of the operators in the replay range cannot be changed. In addition, the number of operators that can be collected is limited by the number of operator block dims in OpBasicInfo (Basic Operator Information). It is recommended that the number be less than or equal to 50.	No
--mstx-include	This option can be used to enable only the specified mstx APIs when the mstx APIs are enabled in the operator tuning tool. If this option is not configured, all mstx APIs used in user code are enabled by default. If this option is configured, --mstx-include enables only the specified mstx APIs. The input of --mstx-include is the message character string transferred when the user calls the mstx function. Multiple character strings are combined using vertical bars (\|). Example: --mstx=on --mstx-include="hello\|hi" //Enable only the mstx APIs whose message parameters are hello and hi in the mstx function passed by the user. NOTE: This option cannot be configured independently and must be used together with --mstx. A message can contain only A-Z a-z 0-9_ characters. Use vertical bars (\|) to combine the messages.	No
--replay-mode	This option specifies the replay mode of operator data collection. The value can be kernel or application or range. The default value is kernel. If the value is set to application, the application is replayed for multiple times. NOTE: In the application mode, separately enabling some aic-metrics may lead to missing data in the visualize_data.bin file. To view complete visualize_data.bin data, you are advised to add Default to --aic-metrics. If the value set to kernel, the kernel function of a single operator within the specified collection range is replayed for multiple times. If the value is set to range, multiple operators within the specified range are replayed for multiple times as a whole. Multiple ranges can be specified, and ranges are independent of each other. NOTE: In the multi-device multi-operator scenario, this option cannot be set to application. Range-level replay must be used together with --mstx=on and applies only to Atlas A3 training products/Atlas A3 inference products and Atlas A2 training products/Atlas A2 inference products. Range-level replay does not support collection of MC2 and LCCL fused operators and cannot be enabled together with --kill=on, --aic-metrics=MemoryDetail, --aic-metrics=TimelineDetail, and --aic-metrics=Source.	No
--warm-up	When msprof op is used to collect data of some operators, the minimum task time required for processor frequency increase cannot be reached. As a result, the frequency is reduced, which affects the deliverable result. In this case, you can use --warm-up to specify the number of warmup times to improve the running frequency of Ascend AI Processor in advance and make the board data more accurate. NOTE: The default value is 5. The value range is [0,500]. This option does not take effect for the MC2 operator.	No
--output	This option specifies the path for storing the collected profile data. By default, the profile data is stored in the current directory. NOTE: Ensure that users in the group and other groups do not have the write permission on the parent directory of the path specified by --output. In addition, ensure that the owner of the parent directory of the directory specified by --output is the current user.	No
--dump	This option specifies whether to generate the dump file of the simulator. The value can be on or off. The default value is off, indicating that the simulator dump file is not generated. NOTE: This option is valid only when --aic-metrics=TimelineDetail is used. It takes effect only for Atlas A2 training products/Atlas A2 inference products and Atlas A3 training products/Atlas A3 inference products. It does not take effect for Atlas inference products. This option applies only to the single-process scenario and does not support the scenario where two operators run at the same time.	No
--core-id	This option applies to the scenario where operators are evenly distributed. You can use the --core-id option to specify the IDs of some logical cores and parse the simulation data of these cores. The value range of the core ID is [0,49]. NOTE: To parse the simulation data of multiple cores, use vertical bars (\|) to combine the data. For example, --core-id="0\|31" indicates to parse simulation data of cores whose IDs are 0 and 31. This option is valid only when --aic-metrics=TimelineDetail is used. It takes effect only for Instruction Pipeline Chart and Operator Code Hot Spot Map and applies only to Atlas A2 training products/Atlas A2 inference products and Atlas A3 training products/Atlas A3 inference products.	No
-h, --help	This option outputs the help information.	No

msprof op simulator mode

Log in to the operating environment, utilize the msprof op simulator to enable operator simulation tuning, and then use the optional simulation parameters and the application to be optimized (blockdim 1) for tuning. For details about the optional simulation parameters, see Table 3. An example command is as follows:

msprof op simulator --soc-version=Ascendxxxyy --output=/home/projects/output /home/projects/MyApp/out/main blockdim 1 // --output is an optional parameter, /home/projects/MyApp/out/main indicates the used app, blockdim 1 is an optional parameter of the user application, and xxxyy indicates the processor type.

Table 3 Options of msprof op simulator

Option

Description

Mandatory (Yes/No)

--application

NOTE:

This option is currently compatible with ./app [arguments] and will be changed to ./app [arguments] later.

You are advised to run msprof op simulator --soc-version=Ascendxxxyy [msprof op simulator parameters] ./app for file pulling. app indicates the specified executable file. If no app path is specified, the current path is used by default. xxxyy indicates the processor type.

NOTE:

When using ./app, add msprof op simulator parameters before ./app to ensure that the related functions take effect.

Yes. Select one of the specified executable file, --config, and --export.

--config

This option sets the binary file *.o obtained by operator compilation. It can be set to an absolute path or a relative path. For details, see JSON Configuration File Description.

NOTE:

Before operator tuning, you can obtain the operator binary *.o file in either of the following ways:

Obtain the executable file on the NPU and extract the *.o file from the executable file. For details, see "Kernel Launch".
The *.o file is automatically generated during operator compilation. For details, see Compiling and Deploying Operators.
Ensure that users in the group and other groups do not have the write permission on the JSON file specified by --config and the parent directory. In addition, ensure that the owner of the parent directory of the JSON file is the current user.

You need to use the LD_LIBRARY_PATH environment variable to set the simulator type.

export LD_LIBRARY_PATH=${INSTALL_DIR}/tools/simulator/Ascendxxxyy/lib:$LD_LIBRARY_PATH // xxxyy indicates the processor type.

--export

This option specifies the folder that contains the single-operator simulation result. The simulation result is directly parsed, and the single-core or multi-core instruction pipeline chart of the single-operator is displayed on MindStudio Insight.

NOTE:

The specified folder can store only multi-core data and the operator kernel function file aicore_binary.o. Therefore, you need to manually change the binary file name (*.o) configured in --config to aicore_binary.o.
If you provide only the dump file, the code line mapping cannot be generated in the instruction pipeline chart. To view the code line, you need to store the operator kernel function file named aicore_binary.o in the dump file.
Ensure that users in the group and other groups do not have the write permission on the directory specified by --export and all files in the directory specified by --export. In addition, ensure that the owner of the specified directory is the current user.

--kernel-name

This option specifies the name of the operator whose data is to be collected. Fuzzy match using the operator name prefix is supported. If this option is not specified, only data of the first operator scheduled during program running is collected.

NOTE:

This option must be used together with --application. The value contains a maximum of 1024 characters. Only one or more characters in A-Za-z0-9_ are supported.
If multiple operators need to be collected, use vertical bars (|) to combine them. For example, --kernel-name="add|abs" indicates that operators whose prefixes are add and abs are collected.
The number of operators to be collected is determined by the value of --launch-count.
The wildcard (*) can be used to match strings of any length.

--launch-count

This option sets the maximum number of operators that can be collected. The value is an integer ranging from 1 to 5000. The default value is 1.

--aic-metrics

This option enables the collection of operator performance metrics. The following performance metrics can be collected.

PipeUtilization (collected by default)
NOTE:
- PipeUtilization: indicates the computing and transfer instruction pipeline.
- When --aic-metrics=PipeUtilization is configured, ResourceConflictRatio is disabled. That is, only the instruction pipeline is displayed, and the details of synchronization event instructions are not included.
ResourceConflictRatio (collected by default)
NOTE:
- ResourceConflictRatio: displays details about synchronization event instructions.
  - For the Atlas A3 training products/Atlas A3 inference products and Atlas A2 training products/Atlas A2 inference products, the SET_FLAG/WAIT_FLAG instructions are displayed.
  - For the Atlas inference products, the set_event/wait_event instructions are displayed.

PMSampling: enables and visualizes the memory channel throughput waveform, for example, --aic-metrics=PMSampling. For details, see Memory Channel Throughput Waveform.
NOTE:
- --core-id does not take effect for the PMSampling parameter. PMSampling parses all cores.
- This function is disabled by default.

--core-id

This option applies to the scenario where operators are evenly distributed. You can use the --core-id option to specify the IDs of some logical cores and parse the simulation data of these cores.

The value range of the core ID is [0,49].

NOTE:

To parse the simulation data of multiple cores, use vertical bars (|) to combine the data. For example, --core-id="0|31" indicates to parse simulation data of cores whose IDs are 0 and 31.
--core-id does not take effect for the PMSampling parameter. PMSampling parses all cores.

--timeout

This option is applicable to operators with a large amount of data and repeated calculation. It takes a long time to run such operators. Necessary information can be obtained from some pipeline graphs. You can set the --timeout option to shorten the operator running duration and obtain the necessary pipeline information. The implementation is as follows:

When the simulation duration reaches the value of --timeout, msProf terminates the simulation and starts parsing. Only part of the simulation data is analyzed. In addition, msProf displays the following information:
1
[INFO] The timeout has reached and the application will be forcibly killed.
If the timeout value is not reached when the process ends normally, the simulation program ends normally and the parsing process starts.

The value is an integer ranging from 1 to 2880, in minutes. Example:

msprof op simulator --soc-version=Ascendxxxyy --timeout=1 ./add_custom // xxxyy indicates the processor type.

--mstx

This option determines whether the operator tuning tool enables the mstx APIs used in the user code program.

The default value is off, indicating that the mstx APIs are disabled.

If --mstx is set to on, the operator tuning tool enables the mstx API used in the user code program.

Example:

msprof op simulator --soc-version=Ascendxxxyy --mstx=on ./add_custom // xxxyy indicates the processor type.

NOTE:

Currently, the mstxRangeStartA and mstxRangeEnd APIs are supported to enable the specified range for operator tuning. For details about the parameters, see the and APIs in MindStudio mstx API Reference.

--mstx-include

This option can be used to enable the specified mstx APIs in the msProf tool.

If this option is not configured, all mstx APIs used in user code are enabled by default.

If this option is configured, --mstx-include enables only the specified mstx APIs. The input of --mstx-include is the message character string transferred when the user calls the mstx function. Multiple character strings must be separated by vertical bars (|).

Example:

--mstx=on --mstx-include="hello|hi" //Enable only the mstx APIs whose message parameters are hello and hi in the mstx function passed by the user.

NOTE:

This option cannot be configured independently and must be used together with --mstx.
A message can contain only A-Z a-z 0-9_ characters. Use vertical bars (|) to combine the messages.

--soc-version

You can use --soc-version or the LD_LIBRARY_PATH environment variable to specify the simulator type. Either of them must be used. The details are as follows:

--soc-version: specifies the simulator type in --application and --export modes. For details about the value range, see the simulator types in the ${INSTALL_DIR}/tools/simulator directory.
LD_LIBRARY_PATH environment variable: specifies the simulator type in --config mode or when --soc-version is not used.
```
export LD_LIBRARY_PATH=${INSTALL_DIR}/tools/simulator/Ascendxxxyy/lib:$LD_LIBRARY_PATH 
```
NOTE:
Replace ${INSTALL_DIR} with the CANN component directory. For example, if the installation is performed by the root user, the default file storage path is /usr/local/Ascend/cann.

--output

This option specifies the path for storing the collected profile data. By default, the profile data is stored in the current directory.

NOTE:

Ensure that users in the group and other groups do not have the write permission on the parent directory of the path specified by --output. In addition, ensure that the owner of the parent directory of the directory specified by --output is the current user.

--dump

This option specifies whether to generate the dump file of the simulator.

The value can be on or off. The default value is off, indicating that the simulator dump file is not generated.

NOTE:

This option takes effect only for Atlas A2 training products/Atlas A2 inference products and Atlas A3 training products/Atlas A3 inference products. This option does not take effect for Atlas inference products. The dump files are saved to drives as usual.
This option applies only to the single-process scenario and does not support the scenario where two operators run at the same time.

-h, --help

This option outputs the help information.

Segment-based Tuning Principles of msprof op

Run the --launch-skip-before-match command to filter the operator tuning range. The filtering principles are as follows:
- If the range configured by the --launch-skip-before-match command is no collection from the first operator to the specified number of operators, only operators after the specified number are collected.
- If no range is configured, no filtering is performed.
On the basis of Step 1, run the --mstx command to filter the operator tuning range. The filtering principles are as follows:
- If --mstx is configured, only the operators within the scope of mstxRangeStartA and mstxRangeEnd APIs are collected.
- If no range is configured, no filtering is performed.
On the basis of Step 2, run the --kernel-name command to filter the operator tuning range. The filtering principles are as follows:
- If --kernel-name has been configured, only operators within the range specified by --kernel-name are collected.
- If --kernel-name is not configured, only the first operator scheduled during program running is collected.
On the basis of Step 3, run the --aic-metrics command to filter the operator metrics for tuning. The filtering principles are as follows:
- If --aic-metrics has been configured, select the operator performance metrics.
- If --aic-metrics is not configured, operator performance metrics in the Default section are collected by default. Performance metrics in the KernelScale, TimelineDetail, Roofline, and Occupancy sections cannot be collected.
Perform Step 1 to Step 4 to obtain the actual number of tuned operators and the collection range of metrics.
With --kill=on, compare the actual number of tuned operators with the value of --launch-count to determine whether to automatically stop the program.
If the number of tuned operators is less than or equal to the value of --launch-count, go to the next step. Otherwise, the program automatically stops when the number of tuned operators reaches the value specified by --launch-count.

Call Scenarios

The following operator calling scenarios are supported. For details, see Collecting Profile Data of Ascend C Operators and Collecting Profile Data of MC2 Operators.

Kernel launch operator development: kernel launch
- For details about kernel launch, see "Kernel Launch Operator Development".
- In the kernel launch scenario, configure the prerequisites and then run the following command:
```
msprof op simulator --soc-version=Ascendxxxyy ./main  // main indicates the name of the user operator program, including the program name of the operator to be tuned. xxxyy indicates the type of the processor used by the user.
```
- Optional: If you need to perform simulation tuning on an operator that runs on the board without recompilation, perform the following steps:
  - Create a soft link named libruntime.so that points to libruntime_camodel.so in any directory.
    ln -s /{simulator_path}/lib/libruntime_camodel.so /{so_path}/libruntime.so //For example, if the CANN package is installed in the default path of the root user, simulator_path is /usr/local/Ascend/cann/tools/simulator/Ascendxxxyy.
  - Add the parent directory of the created soft link to the environment variable LD_LIBRARY_PATH.
    export LD_LIBRARY_PATH={so_path}:$LD_LIBRARY_PATH
Project-based operator development: single-operator API calling
- For details about single-operator API call, see "Single-Operator API Calling".
- In the single-operator API execution scenario, configure the prerequisites and then run the following command:
```
msprof op simulator --soc-version=Ascendxxxyy ./main  // main indicates the name of the user operator program, including the program name of the operator to be tuned. xxxyy indicates the type of the processor used by the user.
```
AI framework operator adaptation: PyTorch framework
- When the msProf tool is used for simulated tuning of the operators in the PyTorch script on Atlas Inference Series Product, only the Kernels-based operator package calling mode is supported. You need to install the binary kernels operator package by referring to "Installing CANN", modify the script entry file, for example, main.py, and add the information in bold under import torch_npu to ensure that the operators in the kernels operator package are used.
```
import torch
import torch_npu
torch_npu.npu.set_compile_mode(jit_compile=False)
......
```
- For details about single-operator execution in the PyTorch framework, see Adapting OpPlugin to a Single Operator.
- When the PyTorch framework is used to call a single-operator, configure the prerequisites and then run the following command:
```
msprof op simulator --soc-version=Ascendxxxyy python a.py  // a.py indicates the name of the user operator program, including the program name of the operator to be tuned. xxxyy indicates the type of the processor used by the user.
```
Triton operator development: Triton operator calling
- The Triton and Triton-Ascend plug-in have been installed and configured. For details, see link.
- The Triton operator calling scenario does not apply to Atlas inference products.

Parent topic: msProf (Operator Tuning)