Loss Scaling

Overview

In mixed precision computing, when the float16 data type is used, the dynamic range of data is narrowed, leading to floating-point overflow/underflow in gradient calculations and causing partial parameter updates to fail. Loss scaling can prevent the divergence during mixed precision training.

Loss scaling is a method that amplifies gradients during backward propagation by multiplying the loss obtained from forward computation by a loss scale factor S. This effectively prevents underflow caused by small gradient values being unrepresentable in float16 during floating-point computation. After the parameter gradient aggregation and before the optimizer updates parameters, the aggregated parameter gradient is multiplied by 1/S.

Dynamic loss scaling checks the gradient floating-point exceptions during training and selects the loss scaling factor S adaptively with the gradient change in the training process.

In specific implementation:

For the Atlas A3 training products/Atlas A3 inference productsAtlas A2 training products/Atlas A2 inference products, the overflow/underflow mode of floating-point computation can be saturation or Inf/NaN. Retain the default Inf/NaN mode. The saturation mode is used only for compatibility with earlier versions and will not evolve in the future. In addition, the computing accuracy in this mode may be unreliable.

For Atlas training products, the default overflow/underflow mode of floating-point computation is saturation mode, and only the saturation mode is supported. This means when an overflow occurs during computation, the computation result is saturated to a floating-point extreme value (+-MAX).

In saturation mode, operations such as floating-point exception check of the Ascend AI Processor are different from those of the GPU due to various floating-point computation features. In this scenario, you need to enable loss scaling or port scripts based on the original loss scaling by referring to this section.
In Inf/NaN mode, directly use the native loss scaling of TensorFlow, without porting the function. If you have ported loss scaling by referring to this section, your network scripts can still run properly.

Principles

Compute procedure with dynamic loss scaling
1. Maintain a primary copy of weights in float32.
2. Initialize the loss scaling factor S to a large value.
3. For each iteration:
  1. Cast the primary copy of weights from float32 to float16.
  2. Perform forward propagation to obtain the loss.
  3. Multiply the resulting loss with S.
  4. Perform backpropagation to obtain the gradients.
  5. Perform gradient aggregation in distributed training.
  6. If Inf or NaN is detected in the gradients, reduce S, skip the parameter update, and proceed to the next iteration.
  7. Multiply the weight gradient with 1/S.
  8. Update weights using the optimizer.
  9. If no Inf or NaN is found in the last N iterations, increase S. N is configurable.
  Figure 1 Compute procedure with dynamic loss scale

Using Loss Scale

Automated porting
If loss scaling is enabled on the original network, in automated porting scenarios, the tool automatically ports LossScaleManager of TensorFlow to ExponentialUpdateLossScaleManager or FixedLossScaleManager of NPUs. If loss scaling is not used on the original network, you can add it as required by referring to this section.

Manual porting

If loss scaling is enabled on the original network, you need to port LossScaleOptimizer to the NPULossScaleOptimizer or NPUOptimizer constructor. The following uses NPULossScaleOptimizer as an example.

Static loss scaling: You can use a fixed loss scale factor during mixed precision training.
When enabling static loss scaling, instantiate a FixedLossScaleManager class before creating NPULossScaleOptimizer to specify the loss scaling parameters.
Dynamic loss scaling: You can adjust the loss scale factor based on the abnormal status of floating-point computation during mixed precision training.
When enabling dynamic loss scaling, instantiate an ExponentialUpdateLossScaleManager class before creating NPULossScaleOptimizer to dynamically manage the loss scaling parameters.

The objects of the ExponentialUpdateLossScaleManager class cannot be constructed within the influence range of the tf.control_dependencies() interface. Otherwise, the graph structure execution sequence may be different from the expected sequence. For details, see What Do I Do If an NPULossScaleOptimizer Error Occur?.

In distributed training, set is_distributed in NPULossScaleOptimizer to True to include loss scaling support in distributed training. In single-device training, retain the default value False for is_distributed in NPULossScaleOptimizer. Failure to do so may invite training exceptions.

Original TensorFlow code:

if FLAGS.use_fp16 and (FLAGS.bert_loss_scale not in [None, -1]):
  opt_tmp = opt
  if FLAGS.bert_loss_scale == 0:
    loss_scale_manager = tf.contrib.mixed_precision.ExponentialUpdateLossScaleManager(init_loss_scale=2**32, incr_every_n_steps=1000, decr_every_n_nan_or_inf=2, decr_ratio=0.5)
  elif FLAGS.bert_loss_scale >= 1:
    loss_scale_manager = tf.contrib.mixed_precision.FixedLossScaleManager(loss_scale=FLAGS.bert_loss_scale)
  else:
    raise ValueError("Invalid loss scale: %d" % FLAGS.bert_loss_scale)
  opt = tf.contrib.mixed_precision.LossScaleOptimizer(opt_tmp, loss_scale_manager)

Code after porting:

from npu_bridge.npu_init import *

if FLAGS.use_fp16 and (FLAGS.bert_loss_scale not in [None, -1]):
  opt_tmp = opt
  if FLAGS.bert_loss_scale == 0:
    loss_scale_manager = ExponentialUpdateLossScaleManager(init_loss_scale=2**32, incr_every_n_steps=1000, decr_every_n_nan_or_inf=2, decr_ratio=0.5)
  elif FLAGS.bert_loss_scale >= 1:
    loss_scale_manager = FixedLossScaleManager(loss_scale=FLAGS.bert_loss_scale)
  else:
    raise ValueError("Invalid loss scale: %d" % FLAGS.bert_loss_scale)
  # Check whether the number of devices is greater than 1. If yes, perform distributed training.
  if ops_adapter.size() > 1:
    opt_tmp = npu_distributed_optimizer_wrapper(opt_tmp)
    opt = NPULossScaleOptimizer(opt_tmp, loss_scale_manager, is_distributed=True)
  else:
    opt = NPULossScaleOptimizer(opt_tmp, loss_scale_manager)

In addition, if loss scaling is not enabled in the original code, add the following lines, which use static loss scaling as an example:

loss_scale_manager = FixedLossScaleManager(loss_scale=1024)
optimizer=NPULossScaleOptimizer(optimizer,loss_scale_manager)
optimizer=optimizer.minimize(self.loss)

You may need to modify LossScaleManager parameters, as the NPU differs from the GPU in mixed precision computing. If training with the default loss scale parameters results in too many overflow iterations and affects final accuracy, you need to adjust the loss scaling parameters accordingly to reduce floating-point exceptions.

Modification method: Print the loss scale value by following Printing the Loss Scale Value, check the number of overflows based on the value and adjust the LossScaleManager parameters.

Updating the Global Step

After loss scaling is enabled, the step where the loss scaling overflow/underflow occurs needs to be discarded. For details, see the update step logic of the optimizer.

In most cases, tf.train.MomentumOptimizer used in networks such as ResNet-50HC updates the global step in apply_gradients. This ensures the step is not updated when overflow/underflow occurs, so no script modifications are required.
However, for some networks (such as BERT), the global step update, including the judgment logic, is implemented in create_optimizer. In this case, the global step update needs to be moved to the optimizer. The following is a porting example:

In the original TensorFlow code, the global step is updated in create_optimizer, including the judgment logic.

def create_optimizer(loss, init_lr, num_train_steps, num_warmup_steps, hvd=None, manual_fp16=False, use_fp16=False, num_accumulation_steps=1,
                     optimizer_type="adam", allreduce_post_accumulation=False):
  ...
      if tf.flags.FLAGS.npu_bert_clip_by_global_norm:
        new_global_step = tf.cond(all_are_finite, lambda: global_step + 1, lambda: global_step)
      else:
        new_global_step = global_step + 1
      new_global_step = tf.identity(new_global_step, name='step_update')
      train_op = tf.group(train_op, [global_step.assign(new_global_step)])
  return train_op

During the porting to the Ascend platform, you need to update the global step in the optimizer as follows:

Comment out the global step update logic implemented in create_optimizer in the script.

def create_optimizer(loss, init_lr, num_train_steps, num_warmup_steps, hvd=None, manual_fp16=False, use_fp16=False, num_accumulation_steps=1,
                     optimizer_type="adam", allreduce_post_accumulation=False):
  ...
      #if tf.flags.FLAGS.npu_bert_clip_by_global_norm:
      #  new_global_step = tf.cond(all_are_finite, lambda: global_step + 1, lambda: global_step)
      #else:
      #  new_global_step = global_step + 1
      #new_global_step = tf.identity(new_global_step, name='step_update')
      #train_op = tf.group(train_op, [global_step.assign(new_global_step)])
  return train_op

Before the last return statement of the apply_gradients function, add the logic for updating the global step in the AdamWeightDecayOptimizer and LAMBOptimizer classes, respectively. The apply_gradients function is called only when no loss scaling overflow/underflow is detected in status check.

  def apply_gradients(self, grads_and_vars, global_step=None, name=None,
      manual_fp16=False):
    assignments = []
    for (grad, param) in grads_and_vars:
        ...
    new_global_step = global_step + 1
    new_global_step = tf.identity(new_global_step, name='step_update')
    assignments.extend([global_step.assign(new_global_step)])
    return tf.group(*assignments, name=name)

Printing the Loss Scale Value

In Estimator mode, the loss scale value can be printed by adding a hook.

class _LogSessionRunHook(tf.train.SessionRunHook):
   def before_run(self, run_context):
       return tf.estimator.SessionRunArgs(
               fetches=['loss_scale:0'])
 
   def after_run(self, run_context, run_values):
       print('loss scale value=%d' % run_values.results[0], flush=True)
  
...

if 'train' in params.exec_mode:
    training_hooks = get_hooks(params, logger)
    training_hooks.append(_LogSessionRunHook())
    estimator.train(
        input_fn = dataset.train_fn,
        steps = max_steps,
        hooks = training_hooks)

Note that the preceding hook does not apply to all networks because the loss scale value is printed by operator name. If the names of some operators in the network are specified by using scope or the like, the hook needs to be changed to the name of the desired operator.

In sess.run mode, you can call the get_loss_scale API to obtain the loss scale value from the loss scaling optimizer of the NPU.

# Original code
for step in range(restore_step, FLAGS.max_steps):
    data = next(data_generator)
    inputs_padded = data[0]
    bbox_padded = pad_bbox(data[1],FLAGS.num_bbox)
    input_image_np = inputs_padded
    input_bbox_np = bbox_padded

    ml, tl,ce_loss, bbox_loss, _, summary_str = sess.run([
                                       model_loss,
                                       total_loss, 
                                       rpn_cross_entropy,
                                       rpn_loss_box,
                                       train_op, summary_op],
                                       feed_dict={input_image: input_image_np,input_bbox: input_bbox_np})
    summary_writer.add_summary(summary_str, global_step=step)

# Tweaked code
for step in range(restore_step, FLAGS.max_steps):
    data = next(data_generator)
    inputs_padded = data[0]
    bbox_padded = pad_bbox(data[1],FLAGS.num_bbox)
    input_image_np = inputs_padded
    input_bbox_np = bbox_padded
    lossScale = loss_scale_manager.get_loss_scale()
    l_s, global_step, ml, tl,ce_loss, bbox_loss, _, summary_str = sess.run(
                                      [lossScale,
                                       global_step,
                                       model_loss,
                                       total_loss,
                                       rpn_cross_entropy,
                                       rpn_loss_box,
                                       train_op, summary_op],
                                       feed_dict={input_image: input_image_np, input_bbox: input_bbox_np})
    summary_writer.add_summary(summary_str, global_step=step)
    print('loss_scale is: ', l_s)
    print("global_step:", global_step)

Parent topic: Training with Mixed Precision