Adapting to non-MindSpeed-LLM Framework

Prerequisites

Learn about Constraints of MindIO TFT.

  • The release package is compatible with a Megatron-like framework. You need to prepare the environment, code, and dataset, and ensure their security.
  • This section provides only adaptation guidance. The detailed implementation must be carried out independently.

Feature Reference

Table 1 describes adaptation points required by certain features and Table 2 provides related code reference links.

Table 1 Features and adaptation points

Feature

Adaptation Point No.

Dying gasp

1, 2, 3, 4, 5, 6, 7

Quick UCE recovery

1, 2, 3, 4, 5, 6, 8, 10, 11

Quick network recovery

1, 2, 5, 6, 11

Quick process recovery

1, 2, 3, 4, 5, 6, 9, 10, 11

Subhealth hot switchover

1, 2, 3, 4, 5, 9, 10, 11, 12

Online stress testing/Link failover and switchback

1, 2, 12

Table 2 Code reference links of related functions

No.

Adaptation Point

Reference Code

1

Boot while initializing

See here.

2

Optimizer update status reporting

3

DP replica group creation

See here.

4

Optimizer replica

See here.

5

Exception capture decorator

See here.

6

Operator resource clearing

See here.

7

Dying gasp checkpoint

See here.

8

UCE model optimizer rebuilding

See here.

9

Node restart and communication re-establishment

See here.

10

Online parameter plane repair

See here.

11

Status rollback

See here.

12

Graceful suspension

See here.