Constraints

  • MindIO provides TTP, UCE, and ARF features. MindIO TTP can be used on Atlas 800 training servers (model 9000), while MindIO UCE and MindIO ARF do not support this server model.
  • Many large model frameworks support Zero Redundancy Optimizer (ZeRO) to reduce the use of graphics memory. Currently, MindIO TFT supports only ZeRO-1 and an even-numbered Data Parallelism Size (DP Size). Different functions have different restrictions on DP Size.
    • MindIO TTP
      • To ensure that complete optimizer state data is available after a fault occurs, the DP Size must be exactly divided by the number of replicas.
      • Before enabling Mixture of Experts (MoE), the DP Size of the dense layer must be greater than 1. After MoE is enabled, the DP Size of both the dense layer and sparse layer must be greater than 1.
      • For distributed optimizers, MindIO TFT re-segments the optimizer ZeRO-1 range on the DP Group by using computation instead of transmission based on ZeRO-1 functions, implementing optimizer data replicas.
    • MindIO UCE and MindIO ARF
      • To restore training from the current step, the DP Size follows the restrictions same as those of MindIO TTP.
      • If graphics memory is limited and no replica is required (DP Size = 1), model weights and optimizer parameters can be loaded online from periodic checkpoints to restore training in the event of a UCE or node fault. This allows training to resume, incurring training cost loss between the current step and the step of the last periodic checkpoint.
    • Once the ZeRO feature is enabled for the distributed optimizer, there is only a single global piece of the optimizer state data, resulting in no data redundancy. MindIO TFT introduces redundant replicas of the optimizer state to ensure data integrity in fault scenarios, though this increases on-chip memory usage. If MindIO TFT is enabled based on original model configurations, an out of memory (OOM) exception may occur during model training startup. In this case, you need to increase the total on-chip memory allocated for training jobs.

      Formula for calculating the on-chip memory size when replicas are added: Total on-chip memory (GB) = Model parameter quantity N (B) × 12 × Number of replicas. In the formula, the unit of model parameter quantity is B (billion). Calculate the additional on-chip memory using the preceding formula, and then enable MindIO TFT after memory expansion.

  • Training Fault Tolerance(TFT) includes one Active Controller and two Backup Controllers. To ensure Backup Controllers can perform dying gasp saving when multiple NPUs (including the Active Controller) fail, the number of functioning NPUs must be greater than half of world_size.
  • MindIO TFT creates replicas of the optimizer state data, and MindIO UCE or MindIO ARF restores the faulty NPU by searching for valid replicas. If a training cluster has many faults and a complete data replica cannot be obtained, recovery falls back to online periodic checkpoint loading rather than step-level restoration.
  • During the generation of the dying gasp checkpoint data, MindIO TFT not only creates a complete data replica, but checks if data is consistent. After a fault occurs, if an optimizer state (OS) data shard remains in an updating state for a long time or the training iterations of different OS data shards are inconsistent, the global data is deemed inconsistent, preventing the generation of the latest checkpoint data.
  • MindIO TTP does not use MindIO ACP (standing for Async Checkpoint Persistence). After MindIO TTP saves the dying gasp checkpoint, the training process ends. To ensure that the dying gasp checkpoint has been saved to the persistent storage before the process exits, MindIO TTP directly writes data to the persistent storage instead of using the asynchronous checkpoint.
  • Currently, MindIO TFT does not support cascading fault scenarios. For example, if another fault occurs when MindIO TTP is in the process of saving data, the saving operation will fail.
  • MindIO TFT increases the memory usage. For details, see Table 1 Theoretical value changes of optimizer parameters between the native optimizer and optimizer with MindIO TFT enabled.
  • The Transport Layer Security (TLS) feature is enabled by default. Disabling this feature may result in a forged Controller connection, impacting the training process.
  • MindIO ARF requires multiple nodes (≥ 2). Faults on the Controller node and cascading faults are not supported. If MindIO ARF fails, MindCluster controls subsequent processes.
  • By default, logs are stored in the logs/ttp_log.log file in the same directory as the running script. You can configure the log level in the running script. The default log level is INFO. The maximum size of a single log file is 10 MB. Logs are written in append-only mode. When the size of a single log file reaches the upper limit, a rotating log file is created. The maximum number of rotating log files is 5. Multiple files are written cyclically to overwrite the old file.