Loss Scaling

Overview

In mixed precision computing, when the float16 data type is used, the dynamic range of data is narrowed, leading to floating-point overflow/underflow in gradient calculations and causing partial parameter updates to fail. Loss scaling can prevent the divergence during mixed precision training.

Loss scaling refers to multiplying the resultant loss in the forward pass by a loss scaling factor S prior to backpropagation, to avoid gradient values from becoming unrepresentable in float16. After the parameter gradient aggregation and before the optimizer updates parameters, the aggregated parameter gradient is multiplied by 1/S.

Dynamic loss scaling checks the gradient floating-point exceptions during training and selects the loss scaling factor S adaptively with the gradient change in the training process.

In specific implementation:

For Atlas Training Series Product, the default overflow/underflow mode of floating-point computation is saturation mode, and only the saturation mode is supported. This means when an overflow occurs during computation, the computation result is saturated to a floating-point extreme value (+-MAX).

In saturation mode, operations such as floating-point exception check of the Ascend AI Processor are different from those of the GPU due to various floating-point computation features. In this scenario, you need to enable loss scaling or port scripts based on the original loss scaling by referring to this section.
In INF/NAN mode, directly use the native loss scaling of TensorFlow, without porting the function. If you have ported loss scaling by referring to this section, your network scripts can still run properly.

Principles

Dynamic loss scaling works as follows:
1. Maintain a primary copy of weights in float32.
2. Initialize the loss scaling factor S to a large value.
3. For each iteration:
  1. Cast the primary copy of weights from float32 to float16.
  2. Perform forward propagation to obtain the loss.
  3. Multiply the resulting loss with the scaling factor S.
  4. Perform backpropagation to obtain the gradients.
  5. Perform gradient aggregation in distributed training.
  6. If there is an Inf or NaN in weight gradients, reduce S. Skip the weight update and move to the next iteration.
  7. Multiply the weight gradient with 1/S.
  8. Update weights using the optimizer.
  9. If no Inf or NaN is found in the last N iterations, increase S. N is configurable.
  Figure 1 Compute procedure with dynamic loss scaling

Using Loss Scaling

Automated porting
If loss scaling is enabled on the original network, in automated porting scenarios, the tool automatically ports LossScaleManager of TensorFlow to ExponentialUpdateLossScaleManager or FixedLossScaleManager of NPUs. If loss scaling is not used on the original network, you can add it as required by referring to this section.

Manual porting

If loss scaling is enabled on the original network, you need to port LossScaleOptimizer to the NPULossScaleOptimizer or NPUOptimizer constructor. The following uses NPULossScaleOptimizer as an example.

Static loss scaling: You can use a fixed loss scaling factor during mixed precision training.
When enabling static loss scaling, instantiate a FixedLossScaleManager class before creating NPULossScaleOptimizer to specify loss scaling.
Dynamic loss scaling: You can adjust the loss scaling factor based on the abnormal status of floating-point computation during mixed precision training.
When enabling dynamic loss scaling, instantiate a ExponentialUpdateLossScaleManager class before creating NPULossScaleOptimizer to dynamically specify loss scaling.

The objects of the ExponentialUpdateLossScaleManager class cannot be constructed within the influence range of the tf.control_dependencies() interface. Otherwise, the graph structure execution sequence may be different from the expected sequence. For details, see What Do I Do If an NPULossScaleOptimizer Error Occur?.

In distributed training, set is_distributed in NPULossScaleOptimizer to True to include loss scaling support in distributed training. In single-device training, retain the default value False for is_distributed in NPULossScaleOptimizer. Failure to do so may invite training exceptions.

Original TensorFlow code:

if FLAGS.use_fp16 and (FLAGS.bert_loss_scale not in [None, -1]):
  opt_tmp = opt
  if FLAGS.bert_loss_scale == 0:
    loss_scale_manager = tf.contrib.mixed_precision.ExponentialUpdateLossScaleManager(init_loss_scale=2**32, incr_every_n_steps=1000, decr_every_n_nan_or_inf=2, decr_ratio=0.5)
  elif FLAGS.bert_loss_scale >= 1:
    loss_scale_manager = tf.contrib.mixed_precision.FixedLossScaleManager(loss_scale=FLAGS.bert_loss_scale)
  else:
    raise ValueError("Invalid loss scale: %d" % FLAGS.bert_loss_scale)
  opt = tf.contrib.mixed_precision.LossScaleOptimizer(opt_tmp, loss_scale_manager)

Code after porting:

from npu_bridge.npu_init import *

if FLAGS.use_fp16 and (FLAGS.bert_loss_scale not in [None, -1]):
  opt_tmp = opt
  if FLAGS.bert_loss_scale == 0:
    loss_scale_manager = ExponentialUpdateLossScaleManager(init_loss_scale=2**32, incr_every_n_steps=1000, decr_every_n_nan_or_inf=2, decr_ratio=0.5)
  elif FLAGS.bert_loss_scale >= 1:
    loss_scale_manager = FixedLossScaleManager(loss_scale=FLAGS.bert_loss_scale)
  else:
    raise ValueError("Invalid loss scale: %d" % FLAGS.bert_loss_scale)
  # Check whether the number of devices is greater than 1. If yes, perform distributed training.
  if ops_adapter.size() > 1:
    opt_tmp = npu_distributed_optimizer_wrapper(opt_tmp)
    opt = NPULossScaleOptimizer(opt_tmp, loss_scale_manager, is_distributed=True)
  else:
    opt = NPULossScaleOptimizer(opt_tmp, loss_scale_manager)

In addition, if loss scaling is not enabled in the original code, add the following lines, which use static loss scaling as an example:

loss_scale_manager = FixedLossScaleManager(loss_scale=1024)
optimizer=NPULossScaleOptimizer(optimizer,loss_scale_manager)
optimizer=optimizer.minimize(self.loss)

You may need to modify LossScaleManager parameters, as the NPU differs from the GPU in mixed precision computing. Modify loss scaling parameters, if accuracy loss occurs as overflow/underflow is detected on too many iterations proceeding with default loss scaling parameters. This helps reduce floating-point exceptions.

Modification method: Print the loss scaling value by following Printing the Loss Scaling Value, check the number of times overflow/underflow occurs based on the said value, and then adjust LossScaleManager parameters.

Updating the Global Step

After loss scaling is enabled, the step where the loss scaling overflow/underflow occurs needs to be discarded. For details, see the update step logic of the optimizer.

In most cases, for example, tf.train.MomentumOptimizer used on the ResNet-50HC network updates the global step in apply_gradients, the step does not need to be updated when overflow/underflow occurs. Therefore, the script does not need to be modified.
However, for the BERT network, the global step update is implemented in create_optimizer, including the judgment logic. In this case, the global step update needs to be performed in the optimizer. The following is a porting example:

In the original TensorFlow code, the global step is updated in create_optimizer, including the judgment logic.

def create_optimizer(loss, init_lr, num_train_steps, num_warmup_steps, hvd=None, manual_fp16=False, use_fp16=False, num_accumulation_steps=1,
                     optimizer_type="adam", allreduce_post_accumulation=False):
  ...
      if tf.flags.FLAGS.npu_bert_clip_by_global_norm:
        new_global_step = tf.cond(all_are_finite, lambda: global_step + 1, lambda: global_step)
      else:
        new_global_step = global_step + 1
      new_global_step = tf.identity(new_global_step, name='step_update')
      train_op = tf.group(train_op, [global_step.assign(new_global_step)])
  return train_op

During the porting to the Ascend platform, you need to update the global step in the optimizer as follows:

Comment out the global step update logic implemented in create_optimizer in the script.

def create_optimizer(loss, init_lr, num_train_steps, num_warmup_steps, hvd=None, manual_fp16=False, use_fp16=False, num_accumulation_steps=1,
                     optimizer_type="adam", allreduce_post_accumulation=False):
  ...
      #if tf.flags.FLAGS.npu_bert_clip_by_global_norm:
      #  new_global_step = tf.cond(all_are_finite, lambda: global_step + 1, lambda: global_step)
      #else:
      #  new_global_step = global_step + 1
      #new_global_step = tf.identity(new_global_step, name='step_update')
      #train_op = tf.group(train_op, [global_step.assign(new_global_step)])
  return train_op

Before the last return statement of the apply_gradients function, add the logic for updating the global step in the AdamWeightDecayOptimizer and LAMBOptimizer classes, respectively. The apply_gradients function is called only when no loss scaling overflow/underflow is detected in status check.

  def apply_gradients(self, grads_and_vars, global_step=None, name=None,
      manual_fp16=False):
    assignments = []
    for (grad, param) in grads_and_vars:
        ...
    new_global_step = global_step + 1
    new_global_step = tf.identity(new_global_step, name='step_update')
    assignments.extend([global_step.assign(new_global_step)])
    return tf.group(*assignments, name=name)

Printing the Loss Scaling Value

In Estimator mode, the loss scaling value can be printed by adding a hook.

class _LogSessionRunHook(tf.train.SessionRunHook):
   def before_run(self, run_context):
       return tf.estimator.SessionRunArgs(
               fetches=['loss_scale:0'])
 
   def after_run(self, run_context, run_values):
       print('loss scale value=%d' % run_values.results[0], flush=True)
  
...

if 'train' in params.exec_mode:
    training_hooks = get_hooks(params, logger)
    training_hooks.append(_LogSessionRunHook())
    estimator.train(
        input_fn = dataset.train_fn,
        steps = max_steps,
        hooks = training_hooks)

Note that the preceding hook does not apply to all networks because the loss scaling value is printed by operator name. If the names of some operators in the network are specified by using scope or the like, the hook needs to be changed to the name of the desired operator.

In sess.run mode, you can call the get_loss_scale interface to obtain the loss scaling value from the lossscale optimizer of the NPU.

# Original code
for step in range(restore_step, FLAGS.max_steps):
    data = next(data_generator)
    inputs_padded = data[0]
    bbox_padded = pad_bbox(data[1],FLAGS.num_bbox)
    input_image_np = inputs_padded
    input_bbox_np = bbox_padded

    ml, tl,ce_loss, bbox_loss, _, summary_str = sess.run([
                                       model_loss,
                                       total_loss, 
                                       rpn_cross_entropy,
                                       rpn_loss_box,
                                       train_op, summary_op],
                                       feed_dict={input_image: input_image_np,input_bbox: input_bbox_np})
    summary_writer.add_summary(summary_str, global_step=step)

# Tweaked code
for step in range(restore_step, FLAGS.max_steps):
    data = next(data_generator)
    inputs_padded = data[0]
    bbox_padded = pad_bbox(data[1],FLAGS.num_bbox)
    input_image_np = inputs_padded
    input_bbox_np = bbox_padded
    lossScale = loss_scale_manager.get_loss_scale()
    l_s, global_steppp, ml, tl,ce_loss, bbox_loss, _, summary_str = sess.run(
                                      [lossScale,
                                       global_step,
                                       model_loss,
                                       total_loss,
                                       rpn_cross_entropy,
                                       rpn_loss_box,
                                       train_op, summary_op],
                                       feed_dict={input_image: input_image_np, input_bbox: input_bbox_np})
    summary_writer.add_summary(summary_str, global_step=step)
    print('loss_scale is: ', l_s)
    print("global_step:", global_steppp)

Parent topic: Training with Mixed Precision