Training Procedure

Objective

The following are not changed after the model is ported: the initial status, intermediate steps, and results of the training, as well as the samples to validate on and validation process.

Principle

During training, a process error such as not clearing intermediate activations occurs frequently, which causes accuracy difference from the benchmark model.

Get familiar with the training process and check your training and validation steps.

Procedure

  1. Check the weight initialization mode. When using random initialization for the weights, ensure that the randomness is consistent with the benchmark. When initializing weights by loading a pre-trained weight file, ensure that the weight file is consistent with the benchmark.
  2. Check the startup script and parameters set in it.
  3. Ensure that the distributed training is correctly configured. In particular, avoid the common issue where each node performs training independently without any information synchronization.