查看运行结果
当节点发生故障时,Volcano会将该训练任务删除,Resilience Controller根据可用资源修改任务资源需求,Volcano调度到剩余可用资源上继续运行。
弹性训练情况
- 登录管理节点,执行以下命令查看训练任务运行情况。
~# kubectl get pods -A -o wide
以全部资源为2节点16卡,下发2节点16卡任务为例,回显示例如下。该回显表示训练任务正常执行时的任务运行情况。1 2 3 4 5
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES …… vcjob mindx-dls-test-default-test-0 1/1 Running 0 47s 192.168.70.82 Node-1 <none> <none> vcjob mindx-dls-test-default-test-1 1/1 Running 0 47s 192.168.39.9 Node-2 <none> <none> ……
- 当Node-1发生NPU网络故障时,Volcano删除任务。执行以下命令查看训练任务终止情况。
kubectl get pods -A -o wide
回显示例如下,表示训练任务被删除。
1 2 3 4 5
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES …… vcjob mindx-dls-test-default-test-0 0/1 Terminating 0 6m59s 192.168.70.82 Node-1 <none> <none> vcjob mindx-dls-test-default-test-1 1/1 Terminating 0 6m59s 192.168.39.9 Node-2 <none> <none> ……
- 等待一段时间,执行以下命令查看训练任务弹性伸缩情况。
kubectl get pods -A -o wide
回显示例如下,表示训练任务根据当前可用节点数将2节点16卡任务伸缩为1节点8卡任务。
1 2 3 4
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES …… vcjob mindx-dls-test-default-test-0 1/1 Running 0 107s 192.168.70.86 Node-2 <none> <none> ……
查看单个Pod运行情况
执行以下命令,查看单个Pod的训练任务运行情况。
kubectl logs mindx-dls-test-default-test-0 -n vcjob -f
- 回显示例如下表示发生故障时,使用最近保存的第39步的checkpoint文件恢复,实现训练任务第40个epoch开始继续训练。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
... 2023-06-09 22:17:33,441:INFO:--> pre_trained: /job/code/mindspore/output/resnet50/imagenet2012/ckpt_0/resnet50-39_48.ckpt 2023-06-09 22:17:33,441:INFO:--> run_eval: False 2023-06-09 22:17:33,441:INFO:--> eval_dataset_path: 2023-06-09 22:17:33,441:INFO:--> parameter_server: False 2023-06-09 22:17:33,441:INFO:--> filter_weight: False 2023-06-09 22:17:33,441:INFO:--> save_best_ckpt: True 2023-06-09 22:17:33,441:INFO:--> eval_start_epoch: 40 2023-06-09 22:17:33,441:INFO:--> eval_interval: 1 2023-06-09 22:17:33,441:INFO:--> enable_cache: False 2023-06-09 22:17:33,441:INFO:--> cache_session_id: 2023-06-09 22:17:33,441:INFO:--> mode_name: GRAPH 2023-06-09 22:17:33,441:INFO:--> boost_mode: O0 2023-06-09 22:17:33,441:INFO:--> conv_init: XavierUniform 2023-06-09 22:17:33,441:INFO:--> dense_init: TruncatedNormal 2023-06-09 22:17:33,442:INFO:--> all_reduce_fusion_config: [85, 160] 2023-06-09 22:17:33,442:INFO:--> train_image_size: 224 2023-06-09 22:17:33,442:INFO:--> eval_image_size: 224 2023-06-09 22:17:33,442:INFO:--> device_id: 0 2023-06-09 22:17:33,442:INFO:--> width: 224 2023-06-09 22:17:33,442:INFO:--> height: 224 2023-06-09 22:17:33,442:INFO:--> file_name: resnet50 2023-06-09 22:17:33,442:INFO:--> file_format: MINDIR 2023-06-09 22:17:33,442:INFO:--> ckpt_file: 2023-06-09 22:17:33,442:INFO:--> network_dataset: resnet50_imagenet2012 2023-06-09 22:17:33,442:INFO:--> save_graphs: False 2023-06-09 22:17:33,442:INFO:--> save_graphs_path: ./graphs 2023-06-09 22:17:33,442:INFO:--> has_trained_epoch: 0 2023-06-09 22:17:33,442:INFO:--> has_trained_step: 0 2023-06-09 22:17:33,442:INFO:--> result_path: 2023-06-09 22:17:33,442:INFO:--> label_path: 2023-06-09 22:17:33,442:INFO:--> config_path: /job/code/mindspore/config/resnet50_imagenet2012_config.yaml 2023-06-09 22:17:33,442:INFO:--> rank_id: 0 2023-06-09 22:17:33,442:INFO:--> save_ckpt_dir: /job/code/mindspore/output/resnet50/imagenet2012/ckpt 2023-06-09 22:17:33,442:INFO:--> log_dir: /job/code/mindspore/output/resnet50/imagenet2012/log 2023-06-09 22:17:33,442:INFO:--> logger: <LOGGER resnet (NOTSET)> 2023-06-09 22:17:33,442:INFO: [WARNING] DEVICE(312,fffd6e363470,python):2023-06-09-22:17:33.999.925 [mindspore/ccsrc/plugin/device/ascend/hal/hardware/ge_graph_executor.cc:128] RunGEInitGraph] Can not find init_subgraph.kernel_graph_0 subgraph, don't need data init subgraph in INFER mode. [WARNING] DEVICE(312,fffd6e363470,python):2023-06-09-22:17:43.733.157 [mindspore/ccsrc/plugin/device/ascend/hal/hardware/ge_graph_executor.cc:128] RunGEInitGraph] Can not find init_subgraph.kernel_graph_1 sub graph, don't need data init subgraph in INFER mode. ....2023-06-09 22:18:45,025:INFO:epoch: [40/90] loss: 3.465011, epoch time: 71.582 s, per step time: 1491.285 ms 2023-06-09 22:18:49,453:INFO:epoch: [41/90] loss: 3.396700, epoch time: 4.428 s, per step time: 92.245 ms .2023-06-09 22:19:02,685:INFO:epoch: [42/90] loss: 3.297215, epoch time: 13.232 s, per step time: 275.659 ms 2023-06-09 22:19:07,323:INFO:epoch: [43/90] loss: 3.289656, epoch time: 4.638 s, per step time: 96.622 ms 2023-06-09 22:19:11,746:INFO:epoch: [44/90] loss: 3.266534, epoch time: 4.423 s, per step time: 92.139 ms 2023-06-09 22:19:16,913:INFO:epoch: [45/90] loss: 3.180886, epoch time: 5.167 s, per step time: 107.650 ms 2023-06-09 22:19:21,377:INFO:epoch: [46/90] loss: 2.895963, epoch time: 4.464 s, per step time: 92.997 ms 2023-06-09 22:19:25,798:INFO:epoch: [47/90] loss: 2.815258, epoch time: 4.420 s, per step time: 92.090 ms 2023-06-09 22:19:31,122:INFO:epoch: [48/90] loss: 2.826911, epoch time: 5.324 s, per step time: 110.918 ms 2023-06-09 22:19:35,591:INFO:epoch: [49/90] loss: 2.712467, epoch time: 4.469 s, per step time: 93.098 ms ...
父主题: 查看结果