Loss Scaling on NPU

Objective

In mixed precision computing, the narrower dynamic range of float16 leads to floating-point overflow/underflow during gradient calculation as well as parameter update failure. Loss scaling can prevent the divergence during mixed precision training.

Loss scaling refers to multiplying the resultant loss in the forward pass by a loss scale S prior to backpropagation, to avoid gradient values from becoming unrepresentable in float16. After the parameter gradient aggregation and before the optimizer updates parameters, the aggregated parameter gradient is multiplied by 1/S.

Dynamic loss scaling checks the gradient floating-point exceptions during training and selects the loss scale S adaptively with the gradient change in the training process.

In practice, as floating-point computing on the Ascend AI Processor is different from that on the GPU, floating-point exception detection may show different results. As such, you need to properly configure loss scaling on the NPU.