mindx_elastic.restore_module.restore_manager.restore_manager.RestoreCheckpoint

restore_exception_checkpoint(CheckpointRestoreParas)

Loads the dying gasp checkpoint of the restoration policy for hybrid parallel training jobs.

Table 1 CheckpointRestoreParas structure

Item

Type

Description

args_param

NameSpace

Parameter configuration for the training jobs

sink_size

int

Sink size

dataset

mindspore.dataset: MindSpore data type

Dataset used for training

model

mindspore.train.model: MindSpore training type

Train the model.

network

mindspore.nn.cell: MindSpore cell type

Model network used for training

epoch

int

Number of training epochs.

Table 2 Return value

Parameter

Type

Description

bool

boolean

Indicates whether the checkpoint is successfully loaded after training recovery

Example:

from mindx_elastic.restore_module.restore_manager.restore_checkpoint import RestoreCheckpoint, CheckpointRestoreParas
res_ckpt = RestoreCheckpoint()
input_checkpoint_paras = CheckpointRestoreParas(args_opt, args_opt.sink_size, ds, model,pangu_alpha_with_grads, epoch=actual_epoch_num)
flag = res_ckpt.restore_exception_checkpoint(input_checkpoint_paras)