Checking the Environment

  • Check the following configuration items:
    • Training hyperparameters and environment variables

      You can use Beyond Compare to compare the training hyperparameters and environment variables in the training logs or startup scripts.

    • Third-party library version

      Check the versions of Megatron and DeepSpeed through the git branch and check the versions of torch and PyTorch through pip list.

  • Check the input data read from the dataset.

    Use the accuracy collection tool to collect the initial input data or save or print the input tensor when calling model forward in the code to check the dataset.

  • Check the model structure.

    Print and compare the model structure during the training.

  • Check weight initialization.

    Check whether the initialization weights before training are consistent. Ensure that the same pre-trained model is loaded or the same initialization random seed is used.

  • Update the environment version.

    If possible, you are advised to install the CANN, driver, and PyTorch packages of the latest versions.