NPUOptimizer Constructor
Description
- Loss scaling: Loss scaling can be enabled during mixed precision training to solve the underflow problem caused by a small float16 representation range.
- Distributed training: With an NPU distributed training optimizer wrapped from a single-server training optimizer, calculated gradients can be aggregated in single-server single-device, single-server multi-device, and multi-server multi-device networking modes.
- By changing a computation dependency relationship, a computation operation that does not depend on the last AR (gradient aggregation fragment) is scheduled to be performed in parallel with the last AR, to optimize communication tailing.
Prototype
def __init__(self,opt,loss_scale_manager=None,is_distributed=False,is_loss_scale=False,is_tailing_optimization=False,name=None)
Options
Option |
Input/Output |
Description |
|---|---|---|
opt |
Input |
Single-server training optimizer for gradient calculation and weight update. |
loss_scale_manager |
Input |
This option needs to be configured only when is_loss_scale is set to True and the loss scaling function is enabled. This option determines the update mode of loss scaling, including static update and dynamic update.
|
is_distributed |
Input |
Distributed training enable.
|
is_loss_scale |
Input |
Loss scaling enable.
|
is_tailing_optimization |
Input |
Communication hangover optimization enable, for improving training performance. This function takes effect only when is_distributed is set to True.
Argument of this option must be the same as that set in NPURunConfig Constructor. |
name |
Input |
Name of the optimizer. |
Returns
An object of the NPUOptimizer class
Example
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | import tensorflow as tf from npu_bridge.npu_init import * # Define a single-server optimizer. optimizer = LAMBOptimizer( learning_rate=learning_rate, weight_decay_rate=0.01, beta_1=0.9, beta_2=0.999, epsilon=1e-6, exclude_from_weight_decay=["LayerNorm", "layer_norm", "bias"]) # Enable loss scaling. if tf.flags.FLAGS.npu_bert_loss_scale not in [None, -1]: if tf.flags.FLAGS.npu_bert_loss_scale == 0: loss_scale_manager = ExponentialUpdateLossScaleManager(init_loss_scale=tf.flags.FLAGS.init_loss_scale_value, incr_every_n_steps=1000, decr_every_n_nan_or_inf=2, decr_ratio=0.5) elif tf.flags.FLAGS.npu_bert_loss_scale >= 1: loss_scale_manager = FixedLossScaleManager(loss_scale=tf.flags.FLAGS.npu_bert_loss_scale) else: raise ValueError("Invalid loss scale: %d" % tf.flags.FLAGS.npu_bert_loss_scale) optimizer = NPUOptimizer(optimizer, loss_scale_manager, is_distributed=tf.flags.FLAGS.distributed, is_loss_scale=True, is_tailing_optimization=True) # Disable loss scaling. else: optimizer = NPUOptimizer(optimizer, is_distributed=tf.flags.FLAGS.distributed) |