Reproducing an Issue
Before using the tool for analysis and locating, ensure that the issue can be reproduced and the symptom is fixed during repeated training.
- Fix randomness.
- Fix the random seed.
- Disable shuffle when loading data batches. Set shuffle to False.
- Disable the Dropout layer.
- Enable deterministic computing.
1torch.use_deterministic_algorithms(True)
- Enable deterministic communication.
export HCCL_DETERMINISTIC=TRUE
- If the issue cannot be reproduced, use the accuracy collection tool to collect MD5 data for troubleshooting.
Fixed randomness, and enabled deterministic computing and communication (except data loading) can be automatically fixed using the seed_all tool.
1 2 | from msprobe.pytorch import seed_all seed_all(seed=1234, mode=True, rm_dropout=True) |
For special operators, the random numbers generated by the same random seed on different hardware are different due to hardware differences, or deterministic computing may not be supported. To address this, you can manually generate random numbers on the CPU and transfer them to the NPU, or replace these operators with smaller ones.
If an accuracy issue occurs on thousands or even tens of thousands of cards, reduce the cluster training scale to reproduce and locate the issue.
The common practice is to keep the tensor parallel and pipeline parallel parameters unchanged, reduce the data parallel parameters or directly reduce the number of model layers. Experiments are required to ensure that the issue can be reproduced after reduction. Choose the smallest set of training parameters where the issue persists.