tft_register_rebuild_group_handler

Function

Registers the callback function for MindIO ARF to re-create a group.

For MindSpeed-LLM, the callback function has been adapted by MindIO TFT. For other frameworks, you need to ensure the security of the callback function.

Format

mindio_ttp.framework_ttp.tft_register_rebuild_group_handler(func: Callable, ctx = None)

Parameters

Parameter

Mandatory/Optional

Description

Value

func

Mandatory

Callback function for MindIO ARF to re-create a group, which is used to clear the old communication group and re-create a new communication group on both the normal and restarted nodes.

The default timeout interval of the callback function execution is 180 seconds. If the execution times out, the process fails to be executed. You can use the environment variable TTP_NORMAL_ACTION_TIME_LIMIT to set the timeout interval.

The callback function cannot be empty. For details about the input parameters of the callback function, see Table 1. The callback function has no return value. If the execution fails, an exception is thrown.

ctx

Optional

Callback function context.

This parameter is left empty by default.

Table 1 Parameters of the callback function

Parameter

Mandatory/Optional

Description

Value

fault_ranks

-

A collection of faulty NPUs.

List

args

-

Parameter set by tft_set_step_args.

Determined by the registration party.

ctx

-

Callback function context.

Determined by the registration party.

Return Value

No return value. If an error occurs, an error log is recorded and an exception is thrown.