Viewing the Running Result
If a node is faulty, Volcano deletes the training job, Resilience Controller modifies the job resource requirements based on available resources, and Volcano schedules the job to the remaining available resources to continue training.
Elastic Training Status
- Log in to the management node and run the following command to check the running statuses of the training jobs:
~# kubectl get pods -A -o wide
Assume that a 16-processor job is delivered on two nodes. The following command output indicates the job running status when the training job is executed properly.1 2 3 4 5
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES ...... vcjob mindx-dls-test-default-test-0 1/1 Running 0 47s 192.168.70.82 Node-1 <none> <none> vcjob mindx-dls-test-default-test-1 1/1 Running 0 47s 192.168.39.9 Node-2 <none> <none> ......
- When an NPU network fault occurs on Node-1, Volcano deletes the job. Run the following command to check the termination status of the training job:
kubectl get pods -A -o wide
If the following information is displayed, the training job is deleted.
1 2 3 4 5
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES ...... vcjob mindx-dls-test-default-test-0 0/1 Terminating 0 6m59s 192.168.70.82 Node-1 <none> <none> vcjob mindx-dls-test-default-test-1 1/1 Terminating 0 6m59s 192.168.39.9 Node-2 <none> <none> ......
- Wait for a while and run the following command to check the scaling status of the job:
kubectl get pods -A -o wide
If the following information is displayed, the original 16-processor job is scaled to an 8-processor job on one node based on the number of available nodes.
1 2 3 4
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES ...... vcjob mindx-dls-test-default-test-0 1/1 Running 0 107s 192.168.70.86 Node-2 <none> <none> ......
Viewing the Running Status of a Single Pod
Run the following command to check the running status of the training job on a single pod:
kubectl logs mindx-dls-test-default-test-0 -n vcjob -f
- The following is a sample command output, which shows that the latest checkpoint file saved in the 39th epoch is used to resume the training job after a fault occurs, and then the training job continues from the 40th epoch.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
... 2023-06-09 22:17:33,441:INFO:--> pre_trained: /job/code/mindspore/output/resnet50/imagenet2012/ckpt_0/resnet50-39_48.ckpt 2023-06-09 22:17:33,441:INFO:--> run_eval: False 2023-06-09 22:17:33,441:INFO:--> eval_dataset_path: 2023-06-09 22:17:33,441:INFO:--> parameter_server: False 2023-06-09 22:17:33,441:INFO:--> filter_weight: False 2023-06-09 22:17:33,441:INFO:--> save_best_ckpt: True 2023-06-09 22:17:33,441:INFO:--> eval_start_epoch: 40 2023-06-09 22:17:33,441:INFO:--> eval_interval: 1 2023-06-09 22:17:33,441:INFO:--> enable_cache: False 2023-06-09 22:17:33,441:INFO:--> cache_session_id: 2023-06-09 22:17:33,441:INFO:--> mode_name: GRAPH 2023-06-09 22:17:33,441:INFO:--> boost_mode: O0 2023-06-09 22:17:33,441:INFO:--> conv_init: XavierUniform 2023-06-09 22:17:33,441:INFO:--> dense_init: TruncatedNormal 2023-06-09 22:17:33,442:INFO:--> all_reduce_fusion_config: [85, 160] 2023-06-09 22:17:33,442:INFO:--> train_image_size: 224 2023-06-09 22:17:33,442:INFO:--> eval_image_size: 224 2023-06-09 22:17:33,442:INFO:--> device_id: 0 2023-06-09 22:17:33,442:INFO:--> width: 224 2023-06-09 22:17:33,442:INFO:--> height: 224 2023-06-09 22:17:33,442:INFO:--> file_name: resnet50 2023-06-09 22:17:33,442:INFO:--> file_format: MINDIR 2023-06-09 22:17:33,442:INFO:--> ckpt_file: 2023-06-09 22:17:33,442:INFO:--> network_dataset: resnet50_imagenet2012 2023-06-09 22:17:33,442:INFO:--> save_graphs: False 2023-06-09 22:17:33,442:INFO:--> save_graphs_path: ./graphs 2023-06-09 22:17:33,442:INFO:--> has_trained_epoch: 0 2023-06-09 22:17:33,442:INFO:--> has_trained_step: 0 2023-06-09 22:17:33,442:INFO:--> result_path: 2023-06-09 22:17:33,442:INFO:--> label_path: 2023-06-09 22:17:33,442:INFO:--> config_path: /job/code/mindspore/config/resnet50_imagenet2012_config.yaml 2023-06-09 22:17:33,442:INFO:--> rank_id: 0 2023-06-09 22:17:33,442:INFO:--> save_ckpt_dir: /job/code/mindspore/output/resnet50/imagenet2012/ckpt 2023-06-09 22:17:33,442:INFO:--> log_dir: /job/code/mindspore/output/resnet50/imagenet2012/log 2023-06-09 22:17:33,442:INFO:--> logger: <LOGGER resnet (NOTSET)> 2023-06-09 22:17:33,442:INFO: [WARNING] DEVICE(312,fffd6e363470,python):2023-06-09-22:17:33.999.925 [mindspore/ccsrc/plugin/device/ascend/hal/hardware/ge_graph_executor.cc:128] RunGEInitGraph] Can not find init_subgraph.kernel_graph_0 subgraph, don't need data init subgraph in INFER mode. [WARNING] DEVICE(312,fffd6e363470,python):2023-06-09-22:17:43.733.157 [mindspore/ccsrc/plugin/device/ascend/hal/hardware/ge_graph_executor.cc:128] RunGEInitGraph] Can not find init_subgraph.kernel_graph_1 sub graph, don't need data init subgraph in INFER mode. ....2023-06-09 22:18:45,025:INFO:epoch: [40/90] loss: 3.465011, epoch time: 71.582 s, per step time: 1491.285 ms 2023-06-09 22:18:49,453:INFO:epoch: [41/90] loss: 3.396700, epoch time: 4.428 s, per step time: 92.245 ms .2023-06-09 22:19:02,685:INFO:epoch: [42/90] loss: 3.297215, epoch time: 13.232 s, per step time: 275.659 ms 2023-06-09 22:19:07,323:INFO:epoch: [43/90] loss: 3.289656, epoch time: 4.638 s, per step time: 96.622 ms 2023-06-09 22:19:11,746:INFO:epoch: [44/90] loss: 3.266534, epoch time: 4.423 s, per step time: 92.139 ms 2023-06-09 22:19:16,913:INFO:epoch: [45/90] loss: 3.180886, epoch time: 5.167 s, per step time: 107.650 ms 2023-06-09 22:19:21,377:INFO:epoch: [46/90] loss: 2.895963, epoch time: 4.464 s, per step time: 92.997 ms 2023-06-09 22:19:25,798:INFO:epoch: [47/90] loss: 2.815258, epoch time: 4.420 s, per step time: 92.090 ms 2023-06-09 22:19:31,122:INFO:epoch: [48/90] loss: 2.826911, epoch time: 5.324 s, per step time: 110.918 ms 2023-06-09 22:19:35,591:INFO:epoch: [49/90] loss: 2.712467, epoch time: 4.469 s, per step time: 93.098 ms ...
Parent topic: Viewing Results