Connecting to a Third-Party AI Platform's Recovery Policy

Description

An AI platform can use Pod Group Annotation to control the fault recovery process and policy. For example, if the platform writes Pod Group Annotation key: ProcessRecoverStrategy but the value is empty, the fault recovery is suspended. The recovery process continues until the platform writes a specific recovery policy.

Pod Group Annotation

Table 1 Parameters

Parameter

Value

Description

ProcessRecoverStrategy

retry

The platform triggers process-level online recovery.

recover

The platform triggers online recovery.

dump

The platform triggers dying gasp saving.

Null or none

Wait for the platform to make a decision.

Field not exist

Disable process-level recovery.

ProcessConfirmFault

String

List of fault key-value pairs refreshed by ClusterD. The value is a character string in the format of "id1:type1,id2:type2". id indicates the global rank ID, and type indicates the fault type. type = 0 indicates only on-chip memory faults, and type = 1 indicates at least one non-on-chip memory fault.

ProcessResultFault

String

List of fault key-value pairs confirmed by the platform. The value is a character string in the format of "id1:type1,id2:type2". id indicates the global rank ID, and type indicates the fault type. type = 0 indicates only on-chip memory faults, and type = 1 indicates at least one non-on-chip memory fault.

RankTableReady

true

The platform has generated RankTable.

false or other values

The platform has not generated RankTable.

Field not exist

Non-RankTable mode

ProcessRecoverStatus

retry-success

The process-level online recovery is successful.

retry-failed

The process-level online recovery fails.

recover-success

The online recovery is successful.

recover-failed

The online recovery fails.

dump-success

Dying gasp is saved successfully.

dump-failed

Fails to save dying gasp.

exit-completed

-

Null or other values

Being recovered.