DSL Performance Optimization

When the operator code is implemented in DSL mode, performance can be optimized in the following three ways:

  • Avoid using APIs that need long processing time.

    In the current version, instructions such as vrec, vsel, and vcmp are time-consuming. In scenarios that require high performance, you can avoid using these time-consuming APIs by means of formula conversion to realize the same calculation. For example, to calculate 1/exp(x), you can replace it with exp(-x). You can calculate "-x" first and then the exponent. This avoids calculating the reciprocal.

    Note: Accuracy must be considered during instruction replacement.

    Sample code

    # Before replacement
    res = tbe.dsl.vrec(vsqrt_res)
    # After replacement
    cosh_one = tvm.const(NUM_ONE, "float32")
    tensor_one = tbe.dsl.broadcast(cosh_one, data_y.shape)
    res = tbe.dsl.vdiv(tensor_one, vsqrt_res)
  • Reduce the total number of computation times.

    By changing the formula, the number of compute times can be reduced, which can also reduce the build time and improve the performance.

    The new formula must be correct and the accuracy must be within the acceptable range.

    For example, when (1/vsqrt(x))*data_dy is computed, data_dy/vsqrt(x) can be directly used.

    # Before modification
    vsqrt_res = tbe.dsl.vsqrt(num_to_vrsqrt)res = tbe.dsl.vdiv(tvm.const(NUM_ONE, "float32"), vsqrt_res)res = tbe.dsl.vmul(res, data_dy)
    # After modification
    vsqrt_res = tbe.dsl.vsqrt(num_to_vrsqrt)res = tbe.dsl.vdiv(data_dy, vsqrt_res)
  • Reduce function encapsulation.

    Reduce the encapsulation of functions to improve function call and return efficiency.

    Note: Less than or equal to 15 variables per function is advisable. As a best practice, you can reuse variable names for intermediate variables that are used only once.

    # Before replacement
    def _cosh_rec_cloud(data):
         exp_pos = tbe.dsl.vexp(data)
         neg_exp = tbe.dsl.vmuls(data, tvm.const(NUM_MINUS_ONE, "float32"))
         neg_exp_pos = tbe.dsl.vexp(neg_exp)
         base = tbe.dsl.vadd(exp_pos, neg_exp_pos)
         base_rec = tbe.dsl.vrec(base)
         res = tbe.dsl.vmuls(base_rec, tvm.const(NUM_TWO, "float32"))
         return res
    cosh_value_rec = _cosh_rec_cloud(data_y)
    # After replacement
    exp_pos = tbe.dsl.vexp(data_y)
    neg_exp = tbe.dsl.vmuls(data_y, tvm.const(NUM_MINUS_ONE, "float32"))
    neg_exp_pos = tbe.dsl.vexp(neg_exp)
    base = tbe.dsl.vadd(exp_pos, neg_exp_pos)
    base_rec = tbe.dsl.vrec(base)
  • Avoid separately defining tvm.const for DSL operators to improve performance.

    The step of defining tvm.const separately is omitted, especially for values that are used only once. They can be defined when using. See the following example:

    # Before replacement
    cosh_one = tvm.const(NUM_ONE, "float32")
    tensor_one = tbe.dsl.broadcast(cosh_one, data_y.shape)
    # After replacement
    tensor_one = tbe.dsl.broadcast(tvm.const(NUM_ONE, "float32"),data_y.shape)