Adapting Torch to DeepSpeed
- Log in to the compute node as a service user.
The service user is not {MindIO-install-user}, HwHiAiUser, or hwMindX, but it is determined based on the actual situation.
- Go to the DeepSpeed installation directory.
cd {DeepSpeed_installation_directory}/runtime - Modify the engine.py file.
- Open the engine.py file.
vim engine.py
- Press i to enter the insert mode and modify the following content:
- Add the following content to the first line of the file:
import mindio_acp
- Replace the torch.load function with the mindio_acp.load function.Before:
optim_checkpoint = torch.load(optim_load_path, map_location=torch.device('cpu'))After:
optim_checkpoint = mindio_acp.load(optim_load_path, map_location='cpu')
- Replace the torch.save function with the mindio_acp.save function.Before:
torch.save(state, save_path)
After:
mindio_acp.save(state, save_path)
- Replace the with open statement that contains the torch.save function with the mindio_acp.save function.Before:
with open(self._get_optimizer_ckpt_name(save_dir, tag, expp_rank), 'wb') as fd: torch.save(optimizer_state, fd) fd.flush()After:
mindio_acp.save(optimizer_state, self._get_optimizer_ckpt_name(save_dir, tag, expp_rank))
- Replace the DeepSpeedEngine._get_expert_ckpt_name function.
Before:
expert_state_dict = torch.load(DeepSpeedEngine._get_expert_ckpt_name( checkpoint_path, -1, # -1 means ignore layer_id global_expert_id, tag, mpu), map_location=torch.device('cpu'))After:
expert_state_dict = mindio_acp.load(DeepSpeedEngine._get_expert_ckpt_name( checkpoint_path, -1, # -1 means ignore layer_id global_expert_id, tag, mpu), map_location='cpu')
- Add the following content to the first line of the file:
- Press Esc, type :wq!, and press Enter to save the changes and exit.
- Open the engine.py file.
- Modify the module.py file.
- Modify the state_dict_factory.py file.
- After the .py files from 3 to 5 are modified, DeepSpeed can use the MindIO ACP service.
Parent topic: Usage Guidance