TBE Operator Performance Tuning

TBE DSL Operators

If the performance of a DSL operator does not meet your requirements, optimize the operator as follows:

  1. You can use the AOE tool for operator tuning by referring to AOE Instructions.
  2. If the tuning does not work, use the techniques for implementing the DSL operator for further optimization. For details, see DSL Performance Optimization.

TBE TIK Operators

The following figure shows the workflow of tuning TBE TIK operator performance.

Figure 1 TBE TIK operator tuning
  1. Check whether the target operator is among the operators supported by AOE. If yes, use AOE to tune the operator.

    For details about the operators supported by AOE and its usage, see AOE Instructions.

  2. After the tuning, analyze the multi-core solution of the operator to check whether it is proper and whether double buffering is enabled. For details, see AI Core Parallelism and Double Buffering.
  3. Use MindStudio to perform UT on the operator, use the UT performance simulation tool to display the scheduling pipeline of the operator, and perform detailed profiling.
    • MTE instruction pipeline analysis

      If the ratio of the executed streams of MTE1–MTE3 instructions to that over the entire clock cycle exceeds 80%, the DMA transfer performance is poor.

      If MTE instruction streams are discontinuous in execution, the degree of parallelism during data transfer is low.

      For details about the optimization, see Data Tiling for Computation.

    • Vector instruction pipeline analysis

      If Vector instruction streams are discontinuous in execution, the Vector Unit is used to full capacity. In this case, inspect the usage of the synchronization instruction and the degree of instruction parallelism.

      If the ratio of the executed streams of Vector instructions to that over the entire clock cycle exceeds 80%, the Vector Unit is fully used. If you want to further improve the performance, try to achieve the optimal instruction parallelism degree and algorithms.

      For details, see Data Tiling for Computation and Synchronization Instruction Analysis.