mindx_elastic.terminating_message.ExceptionCheckpoint(prefix='CKP', directory=None, config=None, partial_save=False, replicas=1)
Function Description
Fixed action executed in each training epoch or iteration. It is used to capture INT and TERM signals and save the dying gasp checkpoint.
Parameters:
- prefix (str): prefix name of the checkpoint file.
- directory (str): path of the folder that stores the checkpoint file. By default, the file is saved in the current directory.
- config (CheckpointConfig): checkpoint policy configuration.
- partial_save (bool): whether to enable partial saving.
- replicas (int): number of partially saved copies. The value ranges from 1 to 5.
Parent topic: Elastic-Agent (APIs Related to Resumable Training)