Rescheduling Mode
Rescheduling
When a node is faulty, Volcano schedules its training jobs to other nodes that meet the requirements.
Log in to the management node and run the following command to check the running statuses of the training jobs:
kubectl get pods -A -o wide
Before the fault occurs, assume that the training jobs are scheduled to node1 and node2. After node1 is faulty, Volcano reschedules the training jobs on node1 and node2 to node2 and node3. The following shows a sample command output after rescheduling:
1 2 3 4 | NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES default default-test-pytorch-master-0 1/1 Running 0 5s xxx.xxx.xxx.xxx node2 <none> <none> default default-test-pytorch-worker-0 1/1 Running 0 5s xxx.xxx.xxx.xxx node3 <none> <none> ... |
Checking the Job Running Status on a Single Pod
Run the following command to check the running status of the training job on a single pod:
kubectl logs default-test-pytorch-worker-0 -n default -f
The sample output below shows that the latest checkpoint file from the ninth iteration is used to resume training at the tenth iteration after a fault occurs.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | 2025-09-08 11:34:00.400331 warn 1900637 [77840][PYH tft_replica_optimizer.py:659] Replica optimizer increase Memory On Chip Usage by:0.6572 GB! 2025-09-08 11:34:00.401841 warn 1900631 [28432][PYH tft_replica_optimizer.py:659] Replica optimizer increase Memory On Chip Usage by:0.6572 GB! 2025-09-08 11:34:00.402489 warn 1900639 [10928][PYH tft_replica_optimizer.py:659] Replica optimizer increase Memory On Chip Usage by:0.6572 GB! 2025-09-08 11:34:00.426989 warn 1900627 [98608][PYH tft_replica_optimizer.py:659] Replica optimizer increase Memory On Chip Usage by:0.6572 GB! 2025-09-08 11:34:00.429141 warn 1900634 [24592][PYH tft_replica_optimizer.py:659] Replica optimizer increase Memory On Chip Usage by:0.6572 GB! (min, max) time across ranks (ms): load-checkpoint ................................: (32107.12, 32108.53) (min, max) time across ranks (ms): model-and-optimizer-setup ......................: (32528.79, 32544.35) train/valid/test-data-iterators-setup ..........: (72.68, 656.79) [rank16]:[W908 11:34:01.252908110 compiler_depend.ts:335] Warning: Cannot create tensor with interal format while allow_internel_format=False, tensor will be created with base format. (function operator()) [rank24]:[W908 11:34:01.254614170 compiler_depend.ts:335] Warning: Cannot create tensor with interal format while allow_internel_format=False, tensor will be created with base format. (function operator()) [rank17]:[W908 11:34:01.421349990 compiler_depend.ts:335] Warning: Cannot create tensor with interal format while allow_internel_format=False, tensor will be created with base format. (function operator()) [rank20]:[W908 11:34:01.431165020 compiler_depend.ts:335] Warning: Cannot create tensor with interal format while allow_internel_format=False, tensor will be created with base format. (function operator()) [rank19]:[W908 11:34:01.431240250 compiler_depend.ts:335] Warning: Cannot create tensor with interal format while allow_internel_format=False, tensor will be created with base format. (function operator()) [rank30]:[W908 11:34:01.431707980 compiler_depend.ts:335] Warning: Cannot create tensor with interal format while allow_internel_format=False, tensor will be created with base format. (function operator()) ... /root/MindSpeed/mindspeed/core/fp8_utils.py:11: UserWarning: Currently, it is not supported to Cast shard fp32 main params to fp8 model params warnings.warn("Currently, it is not supported to Cast shard fp32 main params to fp8 model params") /root/MindSpeed/mindspeed/core/fp8_utils.py:11: UserWarning: Currently, it is not supported to Cast shard fp32 main params to fp8 model params warnings.warn("Currently, it is not supported to Cast shard fp32 main params to fp8 model params") [2025-09-08 11:37:00] iteration 10/ 5000 | consumed samples: 640 | elapsed time per iteration (ms): 6932.5 | learning rate: 2.500000E-07 | global batch size: 64 | lm loss: 1.053084E+01 | loss scale: 1.0 | g rad norm: 56.739 | num zeros: 0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2025-09-08 11:37:03] iteration 11/ 5000 | consumed samples: 704 | elapsed time per iteration (ms): 1981.0 | learning rate: 2.750000E-07 | global batch size: 64 | lm loss: 1.044677E+01 | loss scale: 1.0 | g rad norm: 57.590 | num zeros: 0 | number of skipped iterations: 0 | number of nan iterations: 0 | ... |
Viewing Task Rescheduling Records
Run the following command to view the task rescheduling records:
kubectl describe cm -n mindx-dl job-reschedule-reason
Command output:
1 2 3 4 5 6 7 8 9 10 | Name: job-reschedule-reason Namespace: mindx-dl Labels: <none> Annotations: <none> Data ==== recent-reschedule-records: ---- {"default/default-test-pytorch-141274b7-ce93-4d31-adde-6c24456a8a3b":{"JobID":"default/default-test-pytorch-141274b7-ce93-4d31-adde-6c24456a8a3b","TotalRescheduleTimes":1,"RescheduleRecords":[{"LogFileFormatTime":"I0908 11:36:10","RescheduleTimeStamp":1759683370,"ReasonOfTask":[{"RescheduleReason":"pod-failed","PodName":"default-test-pytorch-worker-0","NodeName":"node2","NodeRankIndex":"1"}]}]}} Events: <none> |
Parent topic: Training Result Viewing