DSL Performance Optimization

When the operator code is implemented in DSL mode, performance can be optimized in the following three ways:

Avoid using APIs that need long processing time.
In the current version, instructions such as vrec, vsel, and vcmp are time-consuming. In scenarios that require high performance, you can avoid using these time-consuming APIs by means of formula conversion to realize the same calculation. For example, to calculate 1/exp(x), you can replace it with exp(-x). You can calculate "-x" first and then the exponent. This avoids calculating the reciprocal.

Note: Accuracy must be considered during instruction replacement.

Sample code
```
# Before replacement
res = tbe.dsl.vrec(vsqrt_res)
# After replacement
cosh_one = tvm.const(NUM_ONE, "float32")
tensor_one = tbe.dsl.broadcast(cosh_one, data_y.shape)
res = tbe.dsl.vdiv(tensor_one, vsqrt_res)
```
Reduce the total number of computation times.
By changing the formula, the number of compute times can be reduced, which can also reduce the build time and improve the performance.

The new formula must be correct and the accuracy must be within the acceptable range.

For example, when (1/vsqrt(x))*data_dy is computed, data_dy/vsqrt(x) can be directly used.
```
# Before modification
vsqrt_res = tbe.dsl.vsqrt(num_to_vrsqrt)res = tbe.dsl.vdiv(tvm.const(NUM_ONE, "float32"), vsqrt_res)res = tbe.dsl.vmul(res, data_dy)
# After modification
vsqrt_res = tbe.dsl.vsqrt(num_to_vrsqrt)res = tbe.dsl.vdiv(data_dy, vsqrt_res)
```

Reduce function encapsulation.

Reduce the encapsulation of functions to improve function call and return efficiency.

Note: Less than or equal to 15 variables per function is advisable. As a best practice, you can reuse variable names for intermediate variables that are used only once.

# Before replacement
def _cosh_rec_cloud(data):
     exp_pos = tbe.dsl.vexp(data)
     neg_exp = tbe.dsl.vmuls(data, tvm.const(NUM_MINUS_ONE, "float32"))
     neg_exp_pos = tbe.dsl.vexp(neg_exp)
     base = tbe.dsl.vadd(exp_pos, neg_exp_pos)
     base_rec = tbe.dsl.vrec(base)
     res = tbe.dsl.vmuls(base_rec, tvm.const(NUM_TWO, "float32"))
     return res
cosh_value_rec = _cosh_rec_cloud(data_y)
# After replacement
exp_pos = tbe.dsl.vexp(data_y)
neg_exp = tbe.dsl.vmuls(data_y, tvm.const(NUM_MINUS_ONE, "float32"))
neg_exp_pos = tbe.dsl.vexp(neg_exp)
base = tbe.dsl.vadd(exp_pos, neg_exp_pos)
base_rec = tbe.dsl.vrec(base)

Avoid separately defining tvm.const for DSL operators to improve performance.
The step of defining tvm.const separately is omitted, especially for values that are used only once. They can be defined when using. See the following example:
```
# Before replacement
cosh_one = tvm.const(NUM_ONE, "float32")
tensor_one = tbe.dsl.broadcast(cosh_one, data_y.shape)
# After replacement
tensor_one = tbe.dsl.broadcast(tvm.const(NUM_ONE, "float32"),data_y.shape)
```

Parent topic: Operator Code Implementation (TBE DSL)