当节点发生故障时，Volcano会将该训练任务调度到其他满足条件的节点上继续运行。

故障时重调度情况

登录管理节点，执行以下命令查看训练任务运行情况。

~# kubectl get pods -A -o wide

回显示例如下。该示例表示当前node1节点上发生故障，此时Volcano组件已经将训练任务调度到node2节点上。

NAMESPACE        NAME                                       READY   STATUS              RESTARTS   AGE     IP                NODE         NOMINATED NODE   READINESS GATES
……
vcjob            mindx-dls-test-default-test-0              0/1     ContainerCreating   0          7s      <none>            node1   <none>           <none>
vcjob            mindx-dls-test-default-test-1              0/1     ContainerCreating   0          7s      <none>            node2   <none>           <none>
……

查看其中一个Pod运行情况

执行以下命令查看单个Pod的训练任务运行情况。

~# kubectl logs mindx-dls-test-default-test-0 -n vcjob -f

回显示例如下表示发生故障时，使用最近保存的第25步的checkpoint文件恢复，实现训练任务第25个epoch开始继续训练。

...
time stamp 2023.03.22-15:46:08 pre trained ckpt model /job/code/output/./checkpoint/ckpt_0/resnet-25_24.ckpt loading
[WARNING] ME(587:140550222669632,MainProcess):2023-03-22-15:46:10.751.140 [mindspore/train/model.py:1095] For LossCallBack callback, {'step_end'} methods may not be supported in later version, Use methods prefixed with 'on_train' or 'on_eval' instead when using customized callbacks.
[WARNING] MD(587,7fd02affd700,python):2023-03-22-15:48:23.424.793 [mindspore/ccsrc/minddata/dataset/engine/datasetops/data_queue_op.cc:896] DetectPerBatchTime] Bad performance attention, it takes more than 25 seconds to fetch a batch of data from dataset pipeline, which might result `GetNext` timeout problem. You may test dataset processing performance(with creating dataset iterator) and optimize it.
epoch: 26 step: 24, loss is 4.916446
Train epoch time: 136417.633 ms, per step time: 5684.068 ms
epoch: 27 step: 24, loss is 5.306696
Train epoch time: 5546.347 ms, per step time: 231.098 ms
epoch: 28 step: 24, loss is 5.1335387
Train epoch time: 4439.452 ms, per step time: 184.977 ms
epoch: 29 step: 24, loss is 4.938741
Train epoch time: 5313.017 ms, per step time: 221.376 ms
epoch: 30 step: 24, loss is 5.128438
Train epoch time: 8922.200 ms, per step time: 371.758 ms
...

若使用临终遗言功能，则回显示例如下。该示例表示发生故障时，使用故障发生时的临终checkpoint文件恢复，实现训练任务从第13个epoch开始继续训练

...
time stamp 2023.03.22-19:40:54 pre trained ckpt model /job/code/output/./checkpoint/ckpt_0/resnet_1-12_24_breakpoint.ckpt loading
[WARNING] MD(858,7f2e467fc700,python):2023-03-22-19:43:11.947.889 [mindspore/ccsrc/minddata/dataset/engine/datasetops/data_queue_op.cc:896] DetectPerBatchTime] Bad performance attention, it takes more than 25 seconds to fetch a batch of data from dataset pipeline, which might result `GetNext` timeout problem. You may test dataset processing performance(with creating dataset iterator) and optimize it.
epoch: 13 step: 24, loss is 6.4576316
Train epoch time: 141532.140 ms, per step time: 5897.172 ms
epoch: 14 step: 24, loss is 6.228643
Train epoch time: 1431.532 ms, per step time: 59.647 ms
epoch: 15 step: 24, loss is 6.267328
Train epoch time: 2965.660 ms, per step time: 123.569 ms
...

查看断点续训运行结果

故障时重调度情况

查看其中一个Pod运行情况