tft_register_mindx_callback

Function

This API is called by MindCluster to register the callback function of the repair process with MindIO TFT.

Format

mindio_ttp.controller_ttp.tft_register_mindx_callback(action: str, func: Callable)

Parameters

Parameter

Mandatory/Optional

Description

Value

action

Mandatory

Name of the action to be registered by the callback function.

The value is of the string type. The following action names are supported:

  • report_fault_ranks
  • report_stop_complete
  • report_strategies
  • report_result

func

Mandatory

Function to be registered.

The callback function cannot be empty. For details about the input parameters of the callback function, see Table 1 to Table 4.

Table 1 Callback function parameters when action is set to report_fault_ranks

Parameter

Mandatory/Optional

Description

Value

error_rank_dict

-

Information about the faulty NPU.

<int key, int errorType> dictionary:

  • key: rank ID of the faulty NPU
  • errorType: fault type
    • 0: UCE
    • 1: non-UCE fault
Table 2 Callback function parameters when action is set to report_stop_complete

Parameter

Mandatory/Optional

Description

Value

code

-

Action execution result.

  • 0: success
  • 400: common error
  • 401: MindCluster task ID not found
  • 402: model error
  • 403: sequence error
  • 404: Processor not ready

msg

-

Message indicating whether the training stops.

String

error_rank_dict

-

Information about the faulty NPU.

<int key, int errorType> dictionary:

  • key: rank ID of the faulty NPU
  • errorType: fault type
    • 0: UCE
    • 1: non-UCE fault
Table 3 Callback function parameters when action is set to report_strategies

Parameter

Mandatory/Optional

Description

Value

error_rank_dict

-

Information about the faulty NPU.

<int key, int errorType> dictionary:

  • key: rank ID of the faulty NPU
  • errorType: fault type
    • 0: UCE
    • 1: non-UCE fault

strategy_list

-

List of repair policies supported by MindIO TFT based on the current available replica information.

The value is of the list type. The supported repair policies (string) are as follows:

  • retry: UCE repair
  • recover: ARF repair
  • dump: dying gasp
  • exit
Table 4 Callback function parameters when action is set to report_result

Parameter

Mandatory/Optional

Description

Value

code

-

Action execution result.

  • 0: The repair is successful.
  • 405: The retry policy fails. The recover, dump, and exit policies are supported.
  • 406: The repair fails. The dump or exit repair policy is supported.
  • 499: The repair fails. Only the exit policy is supported.

msg

-

Message indicating repair success or failure.

String

error_rank_dict

-

Information about the faulty NPU.

<int key, int errorType> dictionary:

  • key: rank ID of the faulty NPU
  • errorType: fault type
    • 0: UCE
    • 1: non-UCE fault

curr_strategy

-

Current repair policy.

The value is of the string type. For details about the value, see strategy_list in Table 3.

Return Value

  • 0: API call succeeded.
  • 1: API call failed.