Feature Description
- The following table lists the system specifications supported by resumable training.
Table 1 System specifications Type
Configuration
Server
- Atlas 800 training server (model 9000) (full configuration of NPUs)
- Atlas 800 training server (model 9010) (full configuration of NPUs)
- Atlas 800 training server (model 9000) (half configuration of NPUs)
- Atlas 800 training server (model 9010) (half configuration of NPUs)
Training framework
MindSpore/TensorFlow/PyTorch. Only the MindSpore framework supports the dying gasp feature of resumable training.
- The dying gasp function does not support encryption and decryption of saved checkpoint files.
- SigTerm and SigInt are used in the dying gasp function.
- Before enabling resumable training, you need to check whether the drive space of the storage device is sufficient for the checkpoints.
Parent topic: Example of Resumable Training