Mixed Precision Training

Objective

If the benchmark model is trained in float32 on the GPU or CPU, the prediction does not change in mixed precision training (using float16 on the GPU).

If the benchmark model is trained in mixed precision mode (in float16 on the GPU), the prediction does not change in float32 training on the GPU or CPU.

If the result of training in high-precision mode fails to be obtained due to memory or hardware limitations, ensure that a constant model accuracy is achieved.

Principle

As Ascend AI Processor (or NPU) hardware architecture supports only mixed precision training, the user model must be trained on the GPU with mixed precision to obtain a converged benchmark model. If the user model is not validated to reach convergence after mixed precision training on the GPU, the model may fail to converge after being ported to the NPU.

The model is not qualified for a benchmark if the result of the comparison of mixed precision training and high-precision training does not meet the accuracy analysis requirements, as this means that mixed precision training in float16 significantly perturbs the model accuracy. In this case, adjust the model structure to avoid accuracy risk in mixed precision training on both the GPU and NPU.

Procedure

  1. Check that training with mixed precision is enabled for the benchmark model.
  2. Check that dynamic loss scaling is enabled.

    You can also enable static loss scaling, but it is not recommended, as you need to adjust the loss scaling value on the NPU to avoid frequent overflow or underflow when using the loss scaling value set on the GPU.

  3. Modify the user model architecture to print floating-point exception information. For details, see Loss Scaling on NPU.
  4. Check the proportion of floating-point exceptions reported during mixed precision training. A percentage less than 0.5% (0.1% for a large global batch size) is recommended.
  5. Modify the initial value and scaling factor of the loss scaling to minimize the number of floating-point exceptions.
  6. Perform training more than three times and check that the validation accuracy is constant. Adjust the hyperparameters and model structure, and repeat the preceding steps until all the preceding conditions are met.