Locating Methods of Operator Performance Problems

Operator performance is a key challenge in deep learning models. Specifically, the execution efficiency of some basic compute units is low, which affects the overall model runtime and leads to resource waste. Such issues need to be addressed using dedicated analysis tools and code tuning techniques. For example, when evaluating the performance of a fused operator, you can compare metrics such as the computing time and memory usage under different configurations.

Figure 1 Operator performance issue locating
Table 1 Operator performance issue locating methods

Analysis Method

Analysis Purpose

Troubleshooting Process

Advisor analysis

Long execution time of the AI CPU operator

Reducing the time consumed by the AI CPU operator

Locate the AI CPU operator on the Timeline page based on the operator name, find the operator in the code based on the call stack, and try to replace it with the same-logic operator. If the replacement fails, record the operator shape and type and contact the operator owner to check whether the case is supported.

Operator build error

You can add the code before Python training to specify the binary mode. If the error persists, record the operator shape and type and contact the operator owner to check whether the case is supported.
torch_npu.npu.set_compile_mode(jit_compile=False)  torch_npu.npu.config.allow_internal_format = False

Single-operator analysis

Vector operator analysis

  • The vector operator is optimized by modifying the code logic. The tuning logic is as follows:
    • Operator affinity tuning: For details, see Affinity Operator Tuning Strategy.
    • Model code tuning: Based on operator analysis, call operators at the model code layer and use methods such as redundancy elimination, shape tuning, and equivalent replacement. For details, see Model Code Tuning Strategy.
    • Version update: Contact the Ascend community to check whether the new version has tuning or subsequent tuning plans.
  • Check whether fusion operators are necessary based on the advisor fusion operator analysis. If necessary, you can develop a fusion operator for replacement.

Cube operator analysis

View the operator proportion on the Operator tab page (for details, see In-depth Analysis for Model Tuning (MindStudio Insight)), select the top N operators with the highest time consumption, analyze the average AI Core performance under the input shape, record the abnormal operators and shapes, and contact the operator owner to confirm the tuning plan.

  • MAC ratio: indicates whether the cube computing unit is fully used. The ideal value is 80%.
  • MTE ratio: If the MTE ratio is too high, a memory transfer bottleneck exists.

If the operator performance cannot meet the expectation, perform the following steps:

  • Operator affinity tuning: For details, see Affinity Operator Tuning Strategy.
  • Model code tuning: Based on operator analysis, call operators at the model code layer and use methods such as redundancy elimination, shape tuning, and equivalent replacement. For details, see Model Code Tuning Strategy.
  • Version update: Contact the Ascend community to check whether the new version has tuning or subsequent tuning plans.

Fusion operator/Affinity API replacement

Use fusion operators or affinity APIs to reduce the delivery of unnecessary small operators and improve the AI Core utilization.

The Affinity API issues analyzer in Advisor can automatically identify fused operators. You can locate code based on the call stack and use fused operators or affinity APIs.

Fused operator development

To further improve the model performance, you can develop fused operators to reduce the delivery of small operators and the proportion of free time.

The host bottleneck and MTE bottleneck are displayed and marked in the operator sequence analysis results of the advisor CSV deliverable. You are advised to analyze the code logic and determine whether the bottleneck can be alleviated by means of operator combination.