tft_register_repair_handler
Function
Registers the repair callback function.
- For MindSpeed-LLM, the callback function has been adapted by MindIO TFT. For other frameworks, you need to ensure the security of the callback function.
- MindIO TFT has rebuilt and overwritten the variables in the model optimizer in the callback function. For other custom variables involved in computing, rebuild and overwrite them in the repair function.
Format
mindio_ttp.framework_ttp.tft_register_repair_handler(func: Callable, ctx = None)
Parameters
Parameter |
Mandatory/Optional |
Description |
Value |
|---|---|---|---|
func |
Mandatory |
Callback function of repair, which is used to repair data such as the optimizer data. The default timeout interval of the callback function execution is 180 seconds. If the execution times out, the process fails to be executed. You can use the environment variable TTP_NORMAL_ACTION_TIME_LIMIT to set the timeout interval. |
The callback function cannot be empty. For details about the input parameters of the callback function, see Table 1. The callback function has no return value. If the execution fails, an exception is thrown. |
ctx |
Optional |
Callback function context. |
This parameter is left empty by default. |
Parameter |
Mandatory/Optional |
Description |
Value |
|---|---|---|---|
step |
- |
Step corresponding to the repair. |
Positive integer |
need_rebuild |
- |
Whether the model and optimizer need to be rebuilt. |
|
error_ranks |
- |
List of faulty NPUs to be repaired. |
List |
repair_info |
- |
Repair policy dictionary. The optimizer type follows the relationship of ATTENTION (0) and MOE (1). |
{
"type": int, optimizer type.
"repair_type": Enum. For details, see 5.34-RepairType.
"src": list, list of source NPUs from which the optimizer repairs data
"dst": list, list of destination NPUs to which the optimizer repairs data
"rank_list": list, list of NPUs required for repairing a communication group
}
|
args |
- |
Parameter set by tft_set_step_args. |
Determined by the registration party. |
ctx |
- |
Callback function context. |
Determined by the registration party. |
Return Value
No return value. If an error occurs, an error log is recorded and an exception is thrown.