Mixed Precision Training

Objective

If the benchmark model is trained with GPU-fp32 or CPU-fp32, you need to adjust it to mixed-precision training (GPU-fp16) while maintaining the same accuracy.

If the benchmark model is trained in mixed-precision mode (GPU-fp16), use high-precision mode (GPU-fp32/CPU-fp32) for training while maintaining the same accuracy.

If the result of training in high-precision mode fails to be obtained due to memory or hardware limitations, ensure that a constant model accuracy is achieved.

Principle

As Ascend AI Processor (or NPU) hardware architecture supports only mixed precision training, the user model must be trained on the GPU with mixed precision to obtain a converged benchmark model. If the user model is not validated to reach convergence after mixed precision training on the GPU, the model may fail to converge after being ported to the NPU.

The model is not qualified for a benchmark if the result of the comparison of mixed-precision training and high-precision training does not meet the accuracy analysis requirements, as this means that mixed precision training in float16 significantly perturbs the model accuracy. In this case, adjust the model structure to avoid accuracy risk in mixed precision training on both the GPU and NPU.

Procedure

  1. Check that training with mixed precision is enabled for the benchmark model.
  2. Check that dynamic loss scaling is enabled.

    You can also enable static loss scaling, but it is not recommended, as you need to adjust the loss scale value on the NPU to avoid frequent overflow or underflow when using the loss scale value set on the GPU.

  3. Modify the user model architecture to print floating-point exception information. For details, see Loss Scaling on NPU.
  4. Check the proportion of floating-point exceptions reported during mixed precision training. A percentage less than 0.5% (0.1% for a large global batch size) is recommended.
  5. Modify the initial value and scaling factor of the loss scale to minimize the number of floating-point exceptions.
  6. Perform training more than three times and check that the validation accuracy is constant. Adjust the hyperparameters and model structure, and repeat the preceding steps until all the preceding conditions are met.