当节点发生故障时,Volcano会将该训练任务调度到其他满足条件的节点上继续运行。
~# kubectl get pods -A -o wide
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES …… vcjob mindx-dls-test-default-test-0 0/1 ContainerCreating 0 7s <none> node1 <none> <none> vcjob mindx-dls-test-default-test-1 0/1 ContainerCreating 0 7s <none> node2 <none> <none> ……
~# kubectl logs mindx-dls-test-default-test-0 -n vcjob -f
... time stamp 2023.03.22-15:46:08 pre trained ckpt model /job/code/output/./checkpoint/ckpt_0/resnet-25_24.ckpt loading [WARNING] ME(587:140550222669632,MainProcess):2023-03-22-15:46:10.751.140 [mindspore/train/model.py:1095] For LossCallBack callback, {'step_end'} methods may not be supported in later version, Use methods prefixed with 'on_train' or 'on_eval' instead when using customized callbacks. [WARNING] MD(587,7fd02affd700,python):2023-03-22-15:48:23.424.793 [mindspore/ccsrc/minddata/dataset/engine/datasetops/data_queue_op.cc:896] DetectPerBatchTime] Bad performance attention, it takes more than 25 seconds to fetch a batch of data from dataset pipeline, which might result `GetNext` timeout problem. You may test dataset processing performance(with creating dataset iterator) and optimize it. epoch: 26 step: 24, loss is 4.916446 Train epoch time: 136417.633 ms, per step time: 5684.068 ms epoch: 27 step: 24, loss is 5.306696 Train epoch time: 5546.347 ms, per step time: 231.098 ms epoch: 28 step: 24, loss is 5.1335387 Train epoch time: 4439.452 ms, per step time: 184.977 ms epoch: 29 step: 24, loss is 4.938741 Train epoch time: 5313.017 ms, per step time: 221.376 ms epoch: 30 step: 24, loss is 5.128438 Train epoch time: 8922.200 ms, per step time: 371.758 ms ...
... time stamp 2023.03.22-19:40:54 pre trained ckpt model /job/code/output/./checkpoint/ckpt_0/resnet_1-12_24_breakpoint.ckpt loading [WARNING] MD(858,7f2e467fc700,python):2023-03-22-19:43:11.947.889 [mindspore/ccsrc/minddata/dataset/engine/datasetops/data_queue_op.cc:896] DetectPerBatchTime] Bad performance attention, it takes more than 25 seconds to fetch a batch of data from dataset pipeline, which might result `GetNext` timeout problem. You may test dataset processing performance(with creating dataset iterator) and optimize it. epoch: 13 step: 24, loss is 6.4576316 Train epoch time: 141532.140 ms, per step time: 5897.172 ms epoch: 14 step: 24, loss is 6.228643 Train epoch time: 1431.532 ms, per step time: 59.647 ms epoch: 15 step: 24, loss is 6.267328 Train epoch time: 2965.660 ms, per step time: 123.569 ms ...