mindx_elastic.restore_module.restore_manager.restore_manager.RestoreCheckpoint
restore_exception_checkpoint(CheckpointRestoreParas)
Loads the dying gasp checkpoint of the restoration policy for hybrid parallel training jobs.
Item |
Type |
Description |
|---|---|---|
args_param |
NameSpace |
Parameter configuration for the training jobs |
sink_size |
int |
Sink size |
dataset |
mindspore.dataset: MindSpore data type |
Dataset used for training |
model |
mindspore.train.model: MindSpore training type |
Train the model. |
network |
mindspore.nn.cell: MindSpore cell type |
Model network used for training |
epoch |
int |
Number of training epochs. |
Parameter |
Type |
Description |
|---|---|---|
bool |
boolean |
Indicates whether the checkpoint is successfully loaded after training recovery |
Example:
from mindx_elastic.restore_module.restore_manager.restore_checkpoint import RestoreCheckpoint, CheckpointRestoreParas res_ckpt = RestoreCheckpoint() input_checkpoint_paras = CheckpointRestoreParas(args_opt, args_opt.sink_size, ds, model,pangu_alpha_with_grads, epoch=actual_epoch_num) flag = res_ckpt.restore_exception_checkpoint(input_checkpoint_paras)
Parent topic: Elastic-Agent (APIs Related to Resumable Training)