Parameter Description
The parameters to be configured vary according to fault handling modes, as shown in Table 1. For details about the meaning and setting of each parameter, see Table 2. In process-level rescheduling, process-level online recovery, process-level in-place recovery, and elastic training scenarios, Ascend Operator injects different environment variables based on recover-strategy and pod-rescheduling configured by users and automatically adds the process-recover-enable=on label to jobs to enable process-level recovery. For details about the environment variables, see Table 3.
- |
Job-Level Rescheduling |
Pod-Level Rescheduling |
Process-Level Rescheduling (recover) |
Process-Level In-Place Recovery (recover-in-place) |
Process-Level Online Recovery |
Graceful Fault Tolerance |
Elastic Training |
hotReset |
- |
- |
- |
- |
- |
√ |
- |
fault-scheduling |
√ |
√ |
√ |
√ |
√ |
- |
√ |
pod-rescheduling |
- |
√ |
- |
- |
- |
- |
- |
process-recover-enable |
- |
- |
√ |
√ |
√ |
- |
√ |
recover-strategy |
- |
- |
√ |
√ |
√ |
- |
√ |
PROCESS_RECOVER |
- |
- |
√ |
√ |
√ |
- |
√ |
ENABLE_RESTART_FAULT_PROCESS |
- |
- |
- |
√ |
- |
- |
- |
ELASTIC_PROCESS_RECOVER_ENABLE |
- |
- |
√ |
√ |
√ |
- |
- |
--enable-high-availability (required by MindSpeed-LLM) |
- |
- |
√ |
√ |
√ |
- |
√ |
--enable-hbmfault-repair (required by MindSpeed-LLM) |
- |
- |
- |
- |
√ |
- |
- |
--enable-worker-reboot (required by MindSpeed-LLM) |
- |
- |
√ |
√ |
- |
- |
- |
--enable-elastic-training (required by MindSpeed-LLM) |
- |
- |
- |
- |
- |
- |
√ |
max_restarts |
- |
√ |
√ |
√ |
√ |
- |
- |
monitor_interval |
- |
√ |
√ |
√ |
√ |
- |
- |
fault-retry-times |
√ |
√ |
√ |
- |
- |
- |
√ |
- |
recover |
retry |
recover-in-place |
elastic-training |
dump |
exit |
pod-rescheduling |
PyTorch |
|
|
|
- |
- |
||
MindSpore |
- |
- |
MS_ENABLE_TFT='{ RSC:1}' |