Loss Scaling on NPU
Objective
In mixed precision computing, the narrower dynamic range of float16 leads to floating-point overflow/underflow during gradient calculation as well as parameter update failure. Loss scaling can prevent the divergence during mixed precision training.
Loss scaling refers to multiplying the resultant loss in the forward pass by a loss scaling factor S prior to backpropagation, to avoid gradient values from becoming unrepresentable in float16. After the parameter gradient aggregation and before the optimizer updates the parameter, the aggregated parameter gradient value is divided by the loss scaling factor S.
Dynamic loss scaling checks the gradient floating-point exceptions during training and selects the loss scaling factor S adaptively with the gradient change in the training process.
In practice, as floating-point computing on Ascend AI Processor is different from that on the GPU or CPU, floating-point exception detection may show different results. As such, you need to properly configure loss scaling on the NPU.
Procedure
- Follow Using Loss Scaling to modify manually ported scripts.
- If scripts are ported by the tool, the tool has ported related APIs by default.
- You may need to modify LossScaleManager parameters, as the NPU differs from the GPU in mixed precision computing. Modify the loss scaling parameters, if accuracy loss occurs as underflow is detected on too many iterations proceeding with default loss scaling parameters. This helps reduce floating-point exceptions. Modification method: Print the loss scaling value by following Printing the Loss Scaling Value, check the number of times overflow/underflow occurs based on the said value, and then adjust LossScaleManager parameters.
- If the step where the loss scaling overflow (or underflow) occurs needs to be discarded after loss scaling is enabled, follow the procedure described in Updating the Global Step.