Training Procedure
Objective
The following are not changed after the model is ported: the initial status, intermediate steps, and results of the training, as well as the samples to validate on and validation process.
Principle
During training, a process error such as not clearing intermediate activations occurs frequently, which causes accuracy difference from the benchmark model.
Get familiar with the training process and check your training and validation steps.
Procedure
- Check the weight initialization mode. When the initial weight is random initialization, ensure that the random feature is consistent with the benchmark. When the initial weight is initialized by loading the pre-trained weight file, ensure that the weight file is consistent with the benchmark.
- Check the startup script and parameters set in it.
- For cluster training, check that related parameters are correctly included or set. Especially, ensure that nodes in the cluster are participating in collective communication.
Parent topic: Ported Script Check