Adapting Torch to X1

  1. Log in to the compute node.
  2. Go to the X1 installation directory.
    cd {X1_installation_directory}/Megatron-LM/megatron
  3. Modify the checkpointing.py file.
    1. Open the checkpointing.py file.
      vim checkpointing.py
    2. Press i to enter the insert mode and modify the following content:
      • Add the following content to the first line of the file:
        import mindio_acp
      • Replace the torch.load function with the mindio_acp.load function.
        Before:
        optim_checkpoint = torch.load(optim_load_path,
                                      map_location=torch.device('cpu'))

        After:

        optim_checkpoint = mindio_acp.load(optim_load_path, map_location='cpu')
      • Replace the torch.save function with the mindio_acp.save function.
        Before:
        torch.save(state, save_path)

        After:

        mindio_acp.save(state, save_path)
      • Replace the with open statement that contains the torch.save function with the mindio_acp.save function.
        Before:
        with open(self._get_optimizer_ckpt_name(save_dir, tag, expp_rank), 'wb') as fd:
            torch.save(optimizer_state, fd)
            fd.flush()

        After:

        mindio_acp.save(optimizer_state, self._get_optimizer_ckpt_name(save_dir, tag, expp_rank))
    3. Press Esc, type :wq!, and press Enter to save the changes and exit.