Parameter Description

The parameters to be configured vary according to fault handling modes, as shown in Table 1. For details about the meaning and setting of each parameter, see Table 2. In process-level rescheduling, process-level online recovery, process-level in-place recovery, and elastic training scenarios, Ascend Operator injects different environment variables based on recover-strategy and pod-rescheduling configured by users and automatically adds the process-recover-enable=on label to jobs to enable process-level recovery. For details about the environment variables, see Table 3.

Table 1 Parameters required for fault handling

-

Job-Level Rescheduling

Pod-Level Rescheduling

Process-Level Rescheduling (recover)

Process-Level In-Place Recovery

(recover-in-place)

Process-Level Online Recovery

Graceful Fault Tolerance

Elastic Training

hotReset

-

-

-

-

-

-

fault-scheduling

-

pod-rescheduling

-

-

-

-

-

-

process-recover-enable

-

-

-

recover-strategy

-

-

-

PROCESS_RECOVER

-

-

-

ENABLE_RESTART_FAULT_PROCESS

-

-

-

-

-

-

ELASTIC_PROCESS_RECOVER_ENABLE

-

-

-

-

--enable-high-availability (required by MindSpeed-LLM)

-

-

-

--enable-hbmfault-repair (required by MindSpeed-LLM)

-

-

-

-

-

-

--enable-worker-reboot (required by MindSpeed-LLM)

-

-

-

-

-

--enable-elastic-training (required by MindSpeed-LLM)

-

-

-

-

-

-

max_restarts

-

-

-

monitor_interval

-

-

-

fault-retry-times

-

-

-

Table 2 Parameter description

Parameter

Location

Description

hotReset

Ascend Device Plugin startup YAML

Controls graceful fault tolerance.

1: Hot reset is enabled to trigger graceful fault tolerance in addition to job-level or pod-level rescheduling during resumable training.

2: Process-level recovery is enabled to trigger offline recovery.

NOTE:

The value 1 cannot be used because the function has become unavailable. Set this parameter to another value.

pod-rescheduling

metadata.labels of the training job YAML file

  • on: Enable pod-level rescheduling.
  • Other values or not using this field: Disable pod-level rescheduling.

fault-scheduling

metadata.labels of the training job YAML file

Controls rescheduling.

process-recover-enable

metadata.labels of the training job YAML file

  • on: Enable process-level rescheduling and process-level online recovery.

    Process-level rescheduling and graceful fault tolerance cannot be enabled at the same time. If both of them are enabled, training is resumed through job-level rescheduling.

  • pause: Temporarily disable process-level rescheduling and process-level online recovery.
  • off or not using this field: Disable process-level rescheduling and process-level online recovery.

recover-strategy

metadata.annotations of the training job YAML file

Available recovery policy.

  • retry: process-level online recovery
  • recover: process-level rescheduling
  • recover-in-place: process-level in-place recovery
  • elastic-training: elastic training
  • dump: saving dying gasp
  • exit: exiting training

PROCESS_RECOVER

spec.replicaSpecs.{ Master|Scheduler| Worker}.template.spec.containers.env of the training job YAML file

Controls process-level rescheduling and process-level online recovery on Elastic Agent/TaskD.

  • on: enabled
  • off: disabled

ELASTIC_PROCESS_RECOVER_ENABLE

spec.replicaSpecs.{ Master|Scheduler| Worker}. template.spec.containers.args in the training startup YAML file

Controls process-level rescheduling, process-level online recovery, and dying gasp checkpoint recovery on Elastic Agent.

  • 1: enabled
  • Other values: disabled

    If disabled, the related functions of MindIO must be disabled at the same time.

NOTE:

Elastic Agent has reached its end of life and its documentation will be deleted on the 30th of December, 2026. This environment variable will be deleted.

ENABLE_RESTART_FAULT_PROCESS

spec.replicaSpecs.{ Master|Scheduler| Worker}. template.spec.containers.args in the training startup YAML file

Controls process-level in-place recovery on Elastic Agent/TaskD.

  • on: enabled
  • Other values: disabled

--enable-high-availability

Startup parameter of pretrain_gpt.py

Controls fast fault recovery, which is disabled by default. After this function is enabled, the dying gasp function is also enabled.

--enable-hbmfault-repair

Startup parameter of pretrain_gpt.py

Controls process-level online recovery, which is disabled by default. After this function is enabled, the on-chip memory fault is detected and recovered online. enable-high-availability must be enabled at the same time.

--enable-worker-reboot

Startup parameter of pretrain_gpt.py

Controls process-level rescheduling, which is disabled by default. After this function is enabled, process-level scheduling is performed when a general fault occurs. enable-high-availability must be enabled at the same time.

--enable-elastic-training

Startup parameter of pretrain_gpt.py

Enables or disables elastic training. By default, this function is disabled.

max_restarts

Shell script, for example, train_start.sh, for starting training

Specifies the maximum number of faults that can be triggered in a container. The value is an integer. If the number exceeds the upper limit, the PyTorch training process directly exits. If this parameter is not set, the default value 32767 is used.

monitor_interval

Shell script, for example, train_start.sh, for starting training

Specifies the interval for monitoring the training process status. The value is an integer, in seconds. If this parameter is not set, the default value 5 is used.

HIGH_AVAILABILITY

Environment variable injected by Ascend Operator to the container

Ascend Operator automatically injects this environment variable based on the job type. MindSpeed-LLM 2.3.0 automatically reads this environment variable. You do not need to manually add --enable-high-availability, --enable-hbmfault-repair, --enable-worker-reboot, and --enable-elastic-training to train_start.sh to enable the corresponding functions.

Table 3 Environment variables injected by Ascend Operator

-

recover

retry

recover-in-place

elastic-training

dump

exit

pod-rescheduling

PyTorch

  • PROCESS_RECOVER=on
  • ELASTIC_PROCESS_RECOVER_ENABLE=1
  • HIGH_AVAILABILITY=recover
  • PROCESS_RECOVER=on
  • ELASTIC_PROCESS_RECOVER_ENABLE=1
  • HIGH_AVAILABILITY=retry

  • PROCESS_RECOVER=on
  • ELASTIC_PROCESS_RECOVER_ENABLE=1
  • ENABLE_RESTART_FAULT_PROCESS=on
  • HIGH_AVAILABILITY=recover
  • PROCESS_RECOVER=on
  • HIGH_AVAILABILITY=elastic-training
  • PROCESS_RECOVER=on
  • ELASTIC_PROCESS_RECOVER_ENABLE=1
  • HIGH_AVAILABILITY=dump

-

-

MindSpore

  • PROCESS_RECOVER=on
  • ELASTIC_PROCESS_RECOVER_ENABLE=1
  • MINDIO_FOR_MINDSPORE=1
  • MS_ENABLE_TFT='{ ARF:1}'

  • PROCESS_RECOVER=on
  • ELASTIC_PROCESS_RECOVER_ENABLE=1
  • MINDIO_FOR_MINDSPORE=1
  • MS_ENABLE_TFT='{ UCE:1, HCCE:1}'

  • PROCESS_RECOVER=on
  • ELASTIC_PROCESS_RECOVER_ENABLE=1
  • ENABLE_RESTART_FAULT_PROCESS=on
  • MINDIO_FOR_MINDSPORE=1
  • MS_ENABLE_TFT='{ ARF:1}'

-

  • PROCESS_RECOVER=on
  • ELASTIC_PROCESS_RECOVER_ENABLE=1
  • MINDIO_FOR_MINDSPORE=1
  • MS_ENABLE_TFT='{ TTP:1}'

-

MS_ENABLE_TFT='{ RSC:1}'