Adapting Torch to X1
- Log in to the compute node.
- Go to the X1 installation directory.
cd {X1_installation_directory}/Megatron-LM/megatron - Modify the checkpointing.py file.
- Open the checkpointing.py file.
vim checkpointing.py
- Press i to enter the insert mode and modify the following content:
- Add the following content to the first line of the file:
import mindio_acp
- Replace the torch.load function with the mindio_acp.load function.Before:
optim_checkpoint = torch.load(optim_load_path, map_location=torch.device('cpu'))After:
optim_checkpoint = mindio_acp.load(optim_load_path, map_location='cpu')
- Replace the torch.save function with the mindio_acp.save function.Before:
torch.save(state, save_path)
After:
mindio_acp.save(state, save_path)
- Replace the with open statement that contains the torch.save function with the mindio_acp.save function.Before:
with open(self._get_optimizer_ckpt_name(save_dir, tag, expp_rank), 'wb') as fd: torch.save(optimizer_state, fd) fd.flush()After:
mindio_acp.save(optimizer_state, self._get_optimizer_ckpt_name(save_dir, tag, expp_rank))
- Add the following content to the first line of the file:
- Press Esc, type :wq!, and press Enter to save the changes and exit.
- Open the checkpointing.py file.
Parent topic: Usage Guidance