Adapting Torch to DeepSpeed

  1. Log in to the compute node as a service user.

    The service user is not {MindIO-install-user}, HwHiAiUser, or hwMindX, but it is determined based on the actual situation.

  2. Go to the DeepSpeed installation directory.
    cd {DeepSpeed_installation_directory}/runtime
  3. Modify the engine.py file.
    1. Open the engine.py file.
      vim engine.py
    2. Press i to enter the insert mode and modify the following content:
      • Add the following content to the first line of the file:
        import mindio_acp
      • Replace the torch.load function with the mindio_acp.load function.
        Before:
        optim_checkpoint = torch.load(optim_load_path,
                                      map_location=torch.device('cpu'))

        After:

        optim_checkpoint = mindio_acp.load(optim_load_path, map_location='cpu')
      • Replace the torch.save function with the mindio_acp.save function.
        Before:
        torch.save(state, save_path)

        After:

        mindio_acp.save(state, save_path)
      • Replace the with open statement that contains the torch.save function with the mindio_acp.save function.
        Before:
        with open(self._get_optimizer_ckpt_name(save_dir, tag, expp_rank), 'wb') as fd:
            torch.save(optimizer_state, fd)
            fd.flush()

        After:

        mindio_acp.save(optimizer_state, self._get_optimizer_ckpt_name(save_dir, tag, expp_rank))
      • Replace the DeepSpeedEngine._get_expert_ckpt_name function.

        Before:

                        expert_state_dict = torch.load(DeepSpeedEngine._get_expert_ckpt_name(
                            checkpoint_path,
                            -1, # -1 means ignore layer_id
                            global_expert_id,
                            tag,
                            mpu),
                            map_location=torch.device('cpu'))

        After:

                        expert_state_dict = mindio_acp.load(DeepSpeedEngine._get_expert_ckpt_name(
                            checkpoint_path,
                            -1, # -1 means ignore layer_id
                            global_expert_id,
                            tag,
                            mpu),
                            map_location='cpu')
    3. Press Esc, type :wq!, and press Enter to save the changes and exit.
  4. Modify the module.py file.
    1. Open the module.py file.
      vim pipe/module.py
    2. Replace torch.save and torch.load. For details, see 3.b to 3.c.
  5. Modify the state_dict_factory.py file.
    1. Open the state_dict_factory.py file.
      vim state_dict_factory.py
    2. Replace torch.save and torch.load. For details, see 3.b to 3.c.
  6. After the .py files from 3 to 5 are modified, DeepSpeed can use the MindIO ACP service.