Checking the Environment
- Check the following configuration items:
- Training hyperparameters and environment variables
You can use Beyond Compare to compare the training hyperparameters and environment variables in the training logs or startup scripts.
- Third-party library version
Check the versions of Megatron and DeepSpeed through the git branch and check the versions of torch and PyTorch through pip list.
- Training hyperparameters and environment variables
- Check the input data read from the dataset.
Use the accuracy collection tool to collect the initial input data or save or print the input tensor when calling model forward in the code to check the dataset.
- Check the model structure.
- Check weight initialization.
Check whether the initialization weights before training are consistent. Ensure that the same pre-trained model is loaded or the same initialization random seed is used.
- Update the environment version.
If possible, you are advised to install the CANN, driver, and PyTorch packages of the latest versions.
Parent topic: Fault Locating Method