Feature Description

  • The following table lists the system specifications supported by resumable training.
    Table 1 System specifications

    Type

    Configuration

    Server

    • Atlas 800 training server (model 9000) (full configuration of NPUs)
    • Atlas 800 training server (model 9010) (full configuration of NPUs)
    • Atlas 800 training server (model 9000) (half configuration of NPUs)
    • Atlas 800 training server (model 9010) (half configuration of NPUs)

    Training framework

    MindSpore/TensorFlow/PyTorch. Only the MindSpore framework supports the dying gasp feature of resumable training.

  • The dying gasp function does not support encryption and decryption of saved checkpoint files.
  • SigTerm and SigInt are used in the dying gasp function.
  • Before enabling resumable training, you need to check whether the drive space of the storage device is sufficient for the checkpoints.