How Do I Address Low-Performance ReduceSum Operator During Network Debugging?

Symptom

During network debugging, the overall performance is low. The ReduceSum operator shows low performance according to the network's profiling result (see Performance Tuning Tool User Guide for Profiling tool instructions).

The profiling result of the ReduceSum operator is as follows.

The data type of ReduceSum's input is DT_FLOAT16 and the value of block_dim is 1, which indicates that multiple blocks are not enabled for the operator.

Solution

For a built-in ReduceSum operator that is natively compatible with Ascend AI Processor or a custom ReduceSum operator developed by using the TBE DSL API reduce_sum, multiple blocks are unavailable for float16 inputs due to hardware restrictions.

Take the ReduceSum operator as an example. If the input data is float16, there are two solutions as follows:

  • The mixed precision is not enabled during network debugging and ReduceSum's input is of type float16. In this case, if ReduceSum's performance is poor, you can insert a Cast operator before the ReduceSum operator to cast the data type from float16 to float32.

    Multiple blocks can be enabled when the input data is of the float32 type. As such, the operator performance can be improved.

  • The mixed precision is enabled during network debugging and ReduceSum's input data type is cast from float32 to float16. In this case, you can add the ReduceSum operator to the blocklist for mixed precision to avoid the data type being cast to float16 during network debugging, preventing the ReduceSum operator from performance deterioration.

    To add the ReduceSum operator to the blocklist for mixed precision, perform the following steps:

    1. Specify the operator on the blocklist for mixed precision by using modify_mixlist.

      Example:

      # In Estimator mode
      npu_config=NPURunConfig(
        ...
        precision_mode="allow_mix_precision",
        modify_mixlist="/home/test/ops_info.json"
        )
      
      # In sess.run mode
      config = tf.ConfigProto()
      custom_op =  config.graph_options.rewrite_options.custom_optimizers.add()
      custom_op.name =  "NpuOptimizer" 
      custom_op.parameter_map["use_off_line"].b = True
      custom_op.parameter_map["precision_mode"].s = tf.compat.as_bytes("allow_mix_precision")
      custom_op.parameter_map["modify_mixlist"].s = tf.compat.as_bytes("/home/test/ops_info.json")
      ...
    2. Configure the operator graylist in the ops_info.json file. The following is a configuration example.
      {
          "black-list": {
              "to-add": ["ReduceSumD"]
          }
      }

      For details, see "Manual Porting and Training" > "Additional Features" > "Training with Mixed Precision" in TensorFlow 1.15 Model Porting Guide.

This solution is dedicated to improving operator ReduceSum's performance under conditions described in Symptom.