Adapting to non-MindSpeed-LLM Framework
Prerequisites
Learn about Constraints of MindIO TFT.
- The release package is compatible with a Megatron-like framework. You need to prepare the environment, code, and dataset, and ensure their security.
- This section provides only adaptation guidance. The detailed implementation must be carried out independently.
Feature Reference
Table 1 describes adaptation points required by certain features and Table 2 provides related code reference links.
Feature |
Adaptation Point No. |
|---|---|
Dying gasp |
1, 2, 3, 4, 5, 6, 7 |
Quick UCE recovery |
1, 2, 3, 4, 5, 6, 8, 10, 11 |
Quick network recovery |
1, 2, 5, 6, 11 |
Quick process recovery |
1, 2, 3, 4, 5, 6, 9, 10, 11 |
Subhealth hot switchover |
1, 2, 3, 4, 5, 9, 10, 11, 12 |
Online stress testing/Link failover and switchback |
1, 2, 12 |
No. |
Adaptation Point |
Reference Code |
|---|---|---|
1 |
Boot while initializing |
|
2 |
Optimizer update status reporting |
|
3 |
DP replica group creation |
|
4 |
Optimizer replica |
|
5 |
Exception capture decorator |
|
6 |
Operator resource clearing |
|
7 |
Dying gasp checkpoint |
|
8 |
UCE model optimizer rebuilding |
|
9 |
Node restart and communication re-establishment |
|
10 |
Online parameter plane repair |
|
11 |
Status rollback |
|
12 |
Graceful suspension |