当节点发生故障时,Volcano会将该训练任务调度到其他满足条件的节点上继续运行。
kubectl get pods -A -o wide
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES …… vcjob mindx-dls-test-default-test-0 0/1 ContainerCreating 0 7s <none> node2 <none> <none> vcjob mindx-dls-test-default-test-1 0/1 ContainerCreating 0 7s <none> node3 <none> <none> ……
kubectl logs mindx-dls-test-default-test-0 -n vcjob -f
... 2023-06-09 22:17:33,441:INFO:--> pre_trained: /job/code/mindspore/output/resnet50/imagenet2012/ckpt_0/resnet50-39_48.ckpt 2023-06-09 22:17:33,441:INFO:--> run_eval: False 2023-06-09 22:17:33,441:INFO:--> eval_dataset_path: 2023-06-09 22:17:33,441:INFO:--> parameter_server: False 2023-06-09 22:17:33,441:INFO:--> filter_weight: False 2023-06-09 22:17:33,441:INFO:--> save_best_ckpt: True 2023-06-09 22:17:33,441:INFO:--> eval_start_epoch: 40 2023-06-09 22:17:33,441:INFO:--> eval_interval: 1 2023-06-09 22:17:33,441:INFO:--> enable_cache: False 2023-06-09 22:17:33,441:INFO:--> cache_session_id: 2023-06-09 22:17:33,441:INFO:--> mode_name: GRAPH 2023-06-09 22:17:33,441:INFO:--> boost_mode: O0 2023-06-09 22:17:33,441:INFO:--> conv_init: XavierUniform 2023-06-09 22:17:33,441:INFO:--> dense_init: TruncatedNormal 2023-06-09 22:17:33,442:INFO:--> all_reduce_fusion_config: [85, 160] 2023-06-09 22:17:33,442:INFO:--> train_image_size: 224 2023-06-09 22:17:33,442:INFO:--> eval_image_size: 224 2023-06-09 22:17:33,442:INFO:--> device_id: 0 2023-06-09 22:17:33,442:INFO:--> width: 224 2023-06-09 22:17:33,442:INFO:--> height: 224 2023-06-09 22:17:33,442:INFO:--> file_name: resnet50 2023-06-09 22:17:33,442:INFO:--> file_format: MINDIR 2023-06-09 22:17:33,442:INFO:--> ckpt_file: 2023-06-09 22:17:33,442:INFO:--> network_dataset: resnet50_imagenet2012 2023-06-09 22:17:33,442:INFO:--> save_graphs: False 2023-06-09 22:17:33,442:INFO:--> save_graphs_path: ./graphs 2023-06-09 22:17:33,442:INFO:--> has_trained_epoch: 0 2023-06-09 22:17:33,442:INFO:--> has_trained_step: 0 2023-06-09 22:17:33,442:INFO:--> result_path: 2023-06-09 22:17:33,442:INFO:--> label_path: 2023-06-09 22:17:33,442:INFO:--> config_path: /job/code/mindspore/config/resnet50_imagenet2012_config.yaml 2023-06-09 22:17:33,442:INFO:--> rank_id: 0 2023-06-09 22:17:33,442:INFO:--> save_ckpt_dir: /job/code/mindspore/output/resnet50/imagenet2012/ckpt 2023-06-09 22:17:33,442:INFO:--> log_dir: /job/code/mindspore/output/resnet50/imagenet2012/log 2023-06-09 22:17:33,442:INFO:--> logger: <LOGGER resnet (NOTSET)> 2023-06-09 22:17:33,442:INFO: [WARNING] DEVICE(312,fffd6e363470,python):2023-06-09-22:17:33.999.925 [mindspore/ccsrc/plugin/device/ascend/hal/hardware/ge_graph_executor.cc:128] RunGEInitGraph] Can not find init_subgraph.kernel_graph_0 sub graph, don't need data init subgraph in INFER mode. [WARNING] DEVICE(312,fffd6e363470,python):2023-06-09-22:17:43.733.157 [mindspore/ccsrc/plugin/device/ascend/hal/hardware/ge_graph_executor.cc:128] RunGEInitGraph] Can not find init_subgraph.kernel_graph_1 sub graph, don't need data init subgraph in INFER mode. ....2023-06-09 22:18:45,025:INFO:epoch: [40/90] loss: 3.465011, epoch time: 71.582 s, per step time: 1491.285 ms 2023-06-09 22:18:49,453:INFO:epoch: [41/90] loss: 3.396700, epoch time: 4.428 s, per step time: 92.245 ms .2023-06-09 22:19:02,685:INFO:epoch: [42/90] loss: 3.297215, epoch time: 13.232 s, per step time: 275.659 ms 2023-06-09 22:19:07,323:INFO:epoch: [43/90] loss: 3.289656, epoch time: 4.638 s, per step time: 96.622 ms 2023-06-09 22:19:11,746:INFO:epoch: [44/90] loss: 3.266534, epoch time: 4.423 s, per step time: 92.139 ms 2023-06-09 22:19:16,913:INFO:epoch: [45/90] loss: 3.180886, epoch time: 5.167 s, per step time: 107.650 ms 2023-06-09 22:19:21,377:INFO:epoch: [46/90] loss: 2.895963, epoch time: 4.464 s, per step time: 92.997 ms 2023-06-09 22:19:25,798:INFO:epoch: [47/90] loss: 2.815258, epoch time: 4.420 s, per step time: 92.090 ms 2023-06-09 22:19:31,122:INFO:epoch: [48/90] loss: 2.826911, epoch time: 5.324 s, per step time: 110.918 ms 2023-06-09 22:19:35,591:INFO:epoch: [49/90] loss: 2.712467, epoch time: 4.469 s, per step time: 93.098 ms ...