Product Description

Overview

The MindCluster MindIO Async Checkpoint Persistence (MindIO ACP for short) accelerates the saving and loading of checkpoints during foundation model training. The checkpoint data is written into the memory system of a training server and is asynchronously written to a reliable storage device at the backend. This document describes vertical acceleration, including the write and read processes of checkpoints in the system.

Product Benefit

Large language models (LLMs) are a major focus of global scientific and technological competition. Training LLMs can take several days to months. Checkpoints are crucial for model recovery upon training interruptions. The density of checkpoints, as well as their storage and recovery performance, are critical factors, which can significantly improve the effective throughput of the training system The MindIO ACP's acceleration solution for checkpoints helps Ascend products to expand the market share in the LLM field.

This solution improves the training throughput of LLMs on the Ascend platform, with performance surpassing the Microsoft Azure Nebula solution.

MindIO ACP Architecture

Key points for MindIO ACP to accelerate LLM checkpoint saving and loading:

  • Asynchronous persistence: After a training framework saves checkpoints to MindIO ACP through the save and load APIs of mindio_acp or MindSpore, MindIO directly returns checkpoints to continue training, which takes seconds. Furthermore, MindIO ACP asynchronously writes checkpoints to persistent distributed storage, which takes minutes.
  • High-performance memory file system (MemFS): MindIO ACP implements a user-mode memory-based file system for fast checkpoint writing. It eliminates system calls of various standard file systems and memory copy from the user mode to the kernel mode.
  • Efficient checkpoint saving and loading: MindIO ACP develops efficient checkpoint saving and loading modes to implement fast checkpoint write and recovery.
  • Auto fault tolerance: When data read/write fails or experiences a timeout due to MindIO ACP service exceptions, the system automatically switches to the native data storage mode to ensure service continuity.

    MindIO ACP stores only checkpoint data during training and does not store and process sensitive data. If sensitive data needs to be stored, desensitize the data before using it to avoid information security issues.