tft_notify_controller_stop_train

Function

This API is called by MindCluster to instruct MindIO TFT to proactively stop training and inform MindIO TFT of the faulty NPU information.

Format

mindio_ttp.controller_ttp.tft_notify_controller_stop_train(fault_ranks: dict, stop_type: str = "stop", timeout: int = None)

Parameters

Parameter

Mandatory/Optional

Description

Value

fault_ranks

Mandatory

Information about the faulty NPU.

<int key, int errorType> dictionary:

  • key: rank ID of the faulty NPU
  • errorType: fault type
    • 0: UCE
    • 1: non-UCE fault

stop_type

Optional

Mode of stopping training.

The value is a character string and can be either of the following:

  • stop: taskabort mode
  • pause: non-taskabort mode

timeout

Optional

Timeout interval for MindCluster to issue a notification after training is paused.

0 or a positive integer

Return Value

  • 0: API call succeeded.
  • 1: API call failed.