重调度模式
重调度情况
当节点发生故障时,Volcano会将该训练任务调度到其他满足条件的节点上继续运行。
登录管理节点,执行以下命令查看训练任务运行情况。
kubectl get pods -A -o wide
故障前,若训练任务调度到了node1和node2上面,当node1节点上发生故障,此时Volcano组件会将node1和node2上训练任务重调度到node2和node3节点上,重调度后回显示例如下。
1 2 3 4 | NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES default default-test-pytorch-master-0 1/1 Running 0 5s xxx.xxx.xxx.xxx node2 <none> <none> default default-test-pytorch-worker-0 1/1 Running 0 5s xxx.xxx.xxx.xxx node3 <none> <none> …… |
查看其中一个Pod运行情况
执行以下命令,查看单个Pod的训练任务运行情况。
kubectl logs default-test-pytorch-worker-0 -n default -f
回显如下表示发生故障时,使用最近保存的第9步的CheckPoint文件恢复,实现训练任务第10个iteration开始继续训练。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | 2025-09-08 11:34:00.400331 warn 1900637 [77840][PYH tft_replica_optimizer.py:659] Replica optimizer increase Memory On Chip Usage by:0.6572 GB! 2025-09-08 11:34:00.401841 warn 1900631 [28432][PYH tft_replica_optimizer.py:659] Replica optimizer increase Memory On Chip Usage by:0.6572 GB! 2025-09-08 11:34:00.402489 warn 1900639 [10928][PYH tft_replica_optimizer.py:659] Replica optimizer increase Memory On Chip Usage by:0.6572 GB! 2025-09-08 11:34:00.426989 warn 1900627 [98608][PYH tft_replica_optimizer.py:659] Replica optimizer increase Memory On Chip Usage by:0.6572 GB! 2025-09-08 11:34:00.429141 warn 1900634 [24592][PYH tft_replica_optimizer.py:659] Replica optimizer increase Memory On Chip Usage by:0.6572 GB! (min, max) time across ranks (ms): load-checkpoint ................................: (32107.12, 32108.53) (min, max) time across ranks (ms): model-and-optimizer-setup ......................: (32528.79, 32544.35) train/valid/test-data-iterators-setup ..........: (72.68, 656.79) [rank16]:[W908 11:34:01.252908110 compiler_depend.ts:335] Warning: Cannot create tensor with interal format while allow_internel_format=False, tensor will be created with base format. (function operator()) [rank24]:[W908 11:34:01.254614170 compiler_depend.ts:335] Warning: Cannot create tensor with interal format while allow_internel_format=False, tensor will be created with base format. (function operator()) [rank17]:[W908 11:34:01.421349990 compiler_depend.ts:335] Warning: Cannot create tensor with interal format while allow_internel_format=False, tensor will be created with base format. (function operator()) [rank20]:[W908 11:34:01.431165020 compiler_depend.ts:335] Warning: Cannot create tensor with interal format while allow_internel_format=False, tensor will be created with base format. (function operator()) [rank19]:[W908 11:34:01.431240250 compiler_depend.ts:335] Warning: Cannot create tensor with interal format while allow_internel_format=False, tensor will be created with base format. (function operator()) [rank30]:[W908 11:34:01.431707980 compiler_depend.ts:335] Warning: Cannot create tensor with interal format while allow_internel_format=False, tensor will be created with base format. (function operator()) ... /root/MindSpeed/mindspeed/core/fp8_utils.py:11: UserWarning: Currently, it is not supported to Cast shard fp32 main params to fp8 model params warnings.warn("Currently, it is not supported to Cast shard fp32 main params to fp8 model params") /root/MindSpeed/mindspeed/core/fp8_utils.py:11: UserWarning: Currently, it is not supported to Cast shard fp32 main params to fp8 model params warnings.warn("Currently, it is not supported to Cast shard fp32 main params to fp8 model params") [2025-09-08 11:37:00] iteration 10/ 5000 | consumed samples: 640 | elapsed time per iteration (ms): 6932.5 | learning rate: 2.500000E-07 | global batch size: 64 | lm loss: 1.053084E+01 | loss scale: 1.0 | g rad norm: 56.739 | num zeros: 0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2025-09-08 11:37:03] iteration 11/ 5000 | consumed samples: 704 | elapsed time per iteration (ms): 1981.0 | learning rate: 2.750000E-07 | global batch size: 64 | lm loss: 1.044677E+01 | loss scale: 1.0 | g rad norm: 57.590 | num zeros: 0 | number of skipped iterations: 0 | number of nan iterations: 0 | ... |
查看任务重调度记录
执行如下命令查看任务重调度记录。
kubectl describe cm -n mindx-dl job-reschedule-reason
回显示例如下。
1 2 3 4 5 6 7 8 9 10 | Name: job-reschedule-reason Namespace: mindx-dl Labels: <none> Annotations: <none> Data ==== recent-reschedule-records: ---- {"default/default-test-pytorch-141274b7-ce93-4d31-adde-6c24456a8a3b":{"JobID":"default/default-test-pytorch-141274b7-ce93-4d31-adde-6c24456a8a3b","TotalRescheduleTimes":1,"RescheduleRecords":[{"LogFileFormatTime":"I0908 11:36:10","RescheduleTimeStamp":1759683370,"ReasonOfTask":[{"RescheduleReason":"pod-failed","PodName":"default-test-pytorch-worker-0","NodeName":"node2","NodeRankIndex":"1"}]}]}} Events: <none> |
父主题: 查看训练结果