Preparing Kubernetes and Shared Storage

Resumable training is a high-level feature of MindCluster cluster scheduling components, designed to enable training fault recovery through collaboration with Ascend full-stack software and hardware. Prior to using this feature, verify that the following prerequisites are satisfied.

  • A shared storage system is available.

    Some processes of resumable training, such as checkpoint loading, training startup, and cache building and loading, depend on reading storage data. Therefore, storage performance affects the overall recovery time of resumable training. To prevent the training recovery time from deteriorating, you are advised to optimize storage performance. The following uses a 10,000-card cluster as an example.

    • 8 KB I/O read IOPS > 1024 W
    • 8 KB I/O write IOPS > 128 W
    • Sequential read bandwidth of large files > 288 GB/s
    • Write bandwidth upon large file creation > 173 GB/s