Constraints
- In the event that a training framework fails to save checkpoints to MindIO ACP, if a checkpoint fails to be saved, the checkpoint that is being saved cannot be used as a recovery point for training. Consequently, the training framework needs to revert to the last successfully saved checkpoint for training restoration.
- If MindIO ACP is faulty during training and services have been delivered, the MindIO ACP SDK retries the connection for three times. If the connection fails for three times, it connects to the native storage mode. The maximum retry waiting time is 60 seconds. If MindIO ACP is faulty before training starts, the MindIO ACP SDK skips the interconnection with MindIO ACP, and checkpoints are directly adapted to the native data storage mode.
- This feature is incompatible with MindIO TFT (fast fault recovery).
- This feature is incompatible with versions earlier than MindSpore 2.7.0.
Parent topic: Before You Start