Confirming the Overflow Steps
Overflow locating depends on the dump of accuracy data. If the model output NaN issue can be stably reproduced in a certain iteration step after the randomness is fixed, you can specify the number of steps for dumping data during training.
- Modify the config.py file in the precision_tool/lib/config directory to specify the steps for dumping data.
# Dump the data of a specific step. Generally, you only need to compare and analyze the dump data of the first layer. That is, retain the default value. If you need to specify a specific step, you can change the value, for example, 0, 5, or 10. TF_DUMP_STEP = '0'
- Change the value of TF_DUMP_STEP to the number of steps where NaN occurs. Note that TF_DUMP_STEP=0 corresponds to the first step of model training.
If the loss NaN problem cannot be reproduced in an iteration step of training, you can change the value of TF_DUMP_STEP to a certain range or perform the operation for multiple times based on the site requirements. Ensure that the accuracy data of the corresponding step is dumped before the next step of analysis. Dump data occupies a large amount of memory. Therefore, do not dump too much data and delete useless dump data in a timely manner.
Parent topic: NaN Overflow Locating