DSL Performance Optimization
When the operator code is implemented in DSL mode, performance can be optimized in the following three ways:
- Avoid using APIs that need long processing time.
In the current version, instructions such as vrec, vsel, and vcmp are time-consuming. In scenarios that require high performance, you can avoid using these time-consuming APIs by means of formula conversion to realize the same calculation. For example, to calculate 1/exp(x), you can replace it with exp(-x). You can calculate "-x" first and then the exponent. This avoids calculating the reciprocal.
Note: Accuracy must be considered during instruction replacement.
Sample code
# Before replacement res = tbe.dsl.vrec(vsqrt_res) # After replacement cosh_one = tvm.const(NUM_ONE, "float32") tensor_one = tbe.dsl.broadcast(cosh_one, data_y.shape) res = tbe.dsl.vdiv(tensor_one, vsqrt_res)
- Reduce the total number of computation times.
By changing the formula, the number of compute times can be reduced, which can also reduce the build time and improve the performance.
The new formula must be correct and the accuracy must be within the acceptable range.
For example, when (1/vsqrt(x))*data_dy is computed, data_dy/vsqrt(x) can be directly used.
# Before modification vsqrt_res = tbe.dsl.vsqrt(num_to_vrsqrt)res = tbe.dsl.vdiv(tvm.const(NUM_ONE, "float32"), vsqrt_res)res = tbe.dsl.vmul(res, data_dy) # After modification vsqrt_res = tbe.dsl.vsqrt(num_to_vrsqrt)res = tbe.dsl.vdiv(data_dy, vsqrt_res)
- Reduce function encapsulation.
Reduce the encapsulation of functions to improve function call and return efficiency.
Note: Less than or equal to 15 variables per function is advisable. As a best practice, you can reuse variable names for intermediate variables that are used only once.
# Before replacement def _cosh_rec_cloud(data): exp_pos = tbe.dsl.vexp(data) neg_exp = tbe.dsl.vmuls(data, tvm.const(NUM_MINUS_ONE, "float32")) neg_exp_pos = tbe.dsl.vexp(neg_exp) base = tbe.dsl.vadd(exp_pos, neg_exp_pos) base_rec = tbe.dsl.vrec(base) res = tbe.dsl.vmuls(base_rec, tvm.const(NUM_TWO, "float32")) return res cosh_value_rec = _cosh_rec_cloud(data_y) # After replacement exp_pos = tbe.dsl.vexp(data_y) neg_exp = tbe.dsl.vmuls(data_y, tvm.const(NUM_MINUS_ONE, "float32")) neg_exp_pos = tbe.dsl.vexp(neg_exp) base = tbe.dsl.vadd(exp_pos, neg_exp_pos) base_rec = tbe.dsl.vrec(base) - Avoid separately defining tvm.const for DSL operators to improve performance.
The step of defining tvm.const separately is omitted, especially for values that are used only once. They can be defined when using. See the following example:
# Before replacement cosh_one = tvm.const(NUM_ONE, "float32") tensor_one = tbe.dsl.broadcast(cosh_one, data_y.shape) # After replacement tensor_one = tbe.dsl.broadcast(tvm.const(NUM_ONE, "float32"),data_y.shape)