Loss Scaling on NPU
Objective
In mixed precision computing, the narrower dynamic range of float16 leads to floating-point overflow/underflow during gradient calculation as well as parameter update failure. Loss scaling can prevent the divergence during mixed precision training.
Loss scaling is a method that amplifies gradients during backward propagation by multiplying the loss obtained from forward computation by a loss scale factor S. This effectively prevents underflow caused by small gradient values being unrepresentable in float16 during floating-point computation. After the parameter gradient aggregation and before the optimizer updates the parameter, the aggregated parameter gradient value is divided by the loss scaling factor S to restore the gradient value.
Dynamic loss scaling checks the gradient floating-point exceptions during training and selects the loss scaling factor S adaptively with the gradient change in the training process.
In practice, as floating-point computing on the Ascend AI Processor is different from that on the GPU or CPU, floating-point exception detection may show different results. As such, you need to properly configure loss scaling on the NPU.
Procedure
- Follow Using Loss Scale to modify manually ported scripts.
- If scripts are ported by the tool, the tool has ported related APIs by default.
- You may need to modify LossScaleManager parameters, as the NPU differs from the GPU in mixed precision computing. Modify the loss scaling parameters, if accuracy loss occurs as underflow is detected on too many iterations proceeding with default loss scaling parameters. This helps reduce floating-point exceptions. Modification method: Print the loss scale value by following Printing the Loss Scale Value, check the number of overflows based on the value and adjust the LossScaleManager parameters.
- If the step where the loss scaling overflow (or underflow) occurs needs to be discarded after loss scaling is enabled, follow the procedure described in Updating the Global Step.