Parameter Description

The parameters to be configured vary according to fault handling modes, as shown in Table 1. For details about the meaning and setting of each parameter, see Table 2. In process-level rescheduling, process-level online recovery, process-level in-place recovery, and elastic training scenarios, Ascend Operator injects different environment variables based on recover-strategy and pod-rescheduling configured by users and automatically adds the process-recover-enable=on label to jobs to enable process-level recovery. For details about the environment variables, see Table 3.

**Table 1** Parameters required for fault handling
-	Job-Level Rescheduling	Pod-Level Rescheduling	Process-Level Rescheduling (recover)	Process-Level In-Place Recovery (recover-in-place)	Process-Level Online Recovery	Graceful Fault Tolerance	Elastic Training
hotReset	-	-	-	-	-	√	-
fault-scheduling	√	√	√	√	√	-	√
pod-rescheduling	-	√	-	-	-	-	-
process-recover-enable	-	-	√	√	√	-	√
recover-strategy	-	-	√	√	√	-	√
PROCESS_RECOVER	-	-	√	√	√	-	√
ENABLE_RESTART_FAULT_PROCESS	-	-	-	√	-	-	-
ELASTIC_PROCESS_RECOVER_ENABLE	-	-	√	√	√	-	-
--enable-high-availability (required by MindSpeed-LLM)	-	-	√	√	√	-	√
--enable-hbmfault-repair (required by MindSpeed-LLM)	-	-	-	-	√	-	-
--enable-worker-reboot (required by MindSpeed-LLM)	-	-	√	√	-	-	-
--enable-elastic-training (required by MindSpeed-LLM)	-	-	-	-	-	-	√
max_restarts	-	√	√	√	√	-	-
monitor_interval	-	√	√	√	√	-	-
fault-retry-times	√	√	√	-	-	-	√

**Table 2** Parameter description
Parameter	Location	Description
hotReset	Ascend Device Plugin startup YAML	Controls graceful fault tolerance. 1: Hot reset is enabled to trigger graceful fault tolerance in addition to job-level or pod-level rescheduling during resumable training. 2: Process-level recovery is enabled to trigger offline recovery. NOTE: The value 1 cannot be used because the function has become unavailable. Set this parameter to another value.
pod-rescheduling	metadata.labels of the training job YAML file	on: Enable pod-level rescheduling. Other values or not using this field: Disable pod-level rescheduling.
fault-scheduling	metadata.labels of the training job YAML file	Controls rescheduling.
process-recover-enable	metadata.labels of the training job YAML file	on: Enable process-level rescheduling and process-level online recovery. Process-level rescheduling and graceful fault tolerance cannot be enabled at the same time. If both of them are enabled, training is resumed through job-level rescheduling. pause: Temporarily disable process-level rescheduling and process-level online recovery. off or not using this field: Disable process-level rescheduling and process-level online recovery.
recover-strategy	metadata.annotations of the training job YAML file	Available recovery policy. retry: process-level online recovery recover: process-level rescheduling recover-in-place: process-level in-place recovery elastic-training: elastic training dump: saving dying gasp exit: exiting training
PROCESS_RECOVER	spec.replicaSpecs.{ Master\|Scheduler\| Worker}.template.spec.containers.env of the training job YAML file	Controls process-level rescheduling and process-level online recovery on Elastic Agent/TaskD. on: enabled off: disabled
ELASTIC_PROCESS_RECOVER_ENABLE	spec.replicaSpecs.{ Master\|Scheduler\| Worker}. template.spec.containers.args in the training startup YAML file	Controls process-level rescheduling, process-level online recovery, and dying gasp checkpoint recovery on Elastic Agent. 1: enabled Other values: disabled If disabled, the related functions of MindIO must be disabled at the same time. NOTE: Elastic Agent has reached its end of life and its documentation will be deleted on the 30th of December, 2026. This environment variable will be deleted.
ENABLE_RESTART_FAULT_PROCESS	spec.replicaSpecs.{ Master\|Scheduler\| Worker}. template.spec.containers.args in the training startup YAML file	Controls process-level in-place recovery on Elastic Agent/TaskD. on: enabled Other values: disabled
--enable-high-availability	Startup parameter of pretrain_gpt.py	Controls fast fault recovery, which is disabled by default. After this function is enabled, the dying gasp function is also enabled.
--enable-hbmfault-repair	Startup parameter of pretrain_gpt.py	Controls process-level online recovery, which is disabled by default. After this function is enabled, the on-chip memory fault is detected and recovered online. enable-high-availability must be enabled at the same time.
--enable-worker-reboot	Startup parameter of pretrain_gpt.py	Controls process-level rescheduling, which is disabled by default. After this function is enabled, process-level scheduling is performed when a general fault occurs. enable-high-availability must be enabled at the same time.
--enable-elastic-training	Startup parameter of pretrain_gpt.py	Enables or disables elastic training. By default, this function is disabled.
max_restarts	Shell script, for example, train_start.sh, for starting training	Specifies the maximum number of faults that can be triggered in a container. The value is an integer. If the number exceeds the upper limit, the PyTorch training process directly exits. If this parameter is not set, the default value 32767 is used.
monitor_interval	Shell script, for example, train_start.sh, for starting training	Specifies the interval for monitoring the training process status. The value is an integer, in seconds. If this parameter is not set, the default value 5 is used.
HIGH_AVAILABILITY	Environment variable injected by Ascend Operator to the container	Ascend Operator automatically injects this environment variable based on the job type. MindSpeed-LLM 2.3.0 automatically reads this environment variable. You do not need to manually add --enable-high-availability, --enable-hbmfault-repair, --enable-worker-reboot, and --enable-elastic-training to train_start.sh to enable the corresponding functions.

**Table 3** Environment variables injected by Ascend Operator
-	recover	retry	recover-in-place	elastic-training	dump	exit	pod-rescheduling
PyTorch	PROCESS_RECOVER=on ELASTIC_PROCESS_RECOVER_ENABLE=1 HIGH_AVAILABILITY=recover	PROCESS_RECOVER=on ELASTIC_PROCESS_RECOVER_ENABLE=1 HIGH_AVAILABILITY=retry	PROCESS_RECOVER=on ELASTIC_PROCESS_RECOVER_ENABLE=1 ENABLE_RESTART_FAULT_PROCESS=on HIGH_AVAILABILITY=recover	PROCESS_RECOVER=on HIGH_AVAILABILITY=elastic-training	PROCESS_RECOVER=on ELASTIC_PROCESS_RECOVER_ENABLE=1 HIGH_AVAILABILITY=dump	-	-
MindSpore	PROCESS_RECOVER=on ELASTIC_PROCESS_RECOVER_ENABLE=1 MINDIO_FOR_MINDSPORE=1 MS_ENABLE_TFT='{ ARF:1}'	PROCESS_RECOVER=on ELASTIC_PROCESS_RECOVER_ENABLE=1 MINDIO_FOR_MINDSPORE=1 MS_ENABLE_TFT='{ UCE:1, HCCE:1}'	PROCESS_RECOVER=on ELASTIC_PROCESS_RECOVER_ENABLE=1 ENABLE_RESTART_FAULT_PROCESS=on MINDIO_FOR_MINDSPORE=1 MS_ENABLE_TFT='{ ARF:1}'	-	PROCESS_RECOVER=on ELASTIC_PROCESS_RECOVER_ENABLE=1 MINDIO_FOR_MINDSPORE=1 MS_ENABLE_TFT='{ TTP:1}'	-	MS_ENABLE_TFT='{ RSC:1}'

Parent topic: Configuring Fault Handling Policies