Connecting to a Third-Party AI Platform's Recovery Policy
Description
An AI platform can use Pod Group Annotation to control the fault recovery process and policy. For example, if the platform writes Pod Group Annotation key: ProcessRecoverStrategy but the value is empty, the fault recovery is suspended. The recovery process continues until the platform writes a specific recovery policy.
Pod Group Annotation
Parameter |
Value |
Description |
|---|---|---|
ProcessRecoverStrategy |
retry |
The platform triggers process-level online recovery. |
recover |
The platform triggers online recovery. |
|
dump |
The platform triggers dying gasp saving. |
|
Null or none |
Wait for the platform to make a decision. |
|
Field not exist |
Disable process-level recovery. |
|
ProcessConfirmFault |
String |
List of fault key-value pairs refreshed by ClusterD. The value is a character string in the format of "id1:type1,id2:type2". id indicates the global rank ID, and type indicates the fault type. type = 0 indicates only on-chip memory faults, and type = 1 indicates at least one non-on-chip memory fault. |
ProcessResultFault |
String |
List of fault key-value pairs confirmed by the platform. The value is a character string in the format of "id1:type1,id2:type2". id indicates the global rank ID, and type indicates the fault type. type = 0 indicates only on-chip memory faults, and type = 1 indicates at least one non-on-chip memory fault. |
RankTableReady |
true |
The platform has generated RankTable. |
false or other values |
The platform has not generated RankTable. |
|
Field not exist |
Non-RankTable mode |
|
ProcessRecoverStatus |
retry-success |
The process-level online recovery is successful. |
retry-failed |
The process-level online recovery fails. |
|
recover-success |
The online recovery is successful. |
|
recover-failed |
The online recovery fails. |
|
dump-success |
Dying gasp is saved successfully. |
|
dump-failed |
Fails to save dying gasp. |
|
exit-completed |
- |
|
Null or other values |
Being recovered. |