Viewing the Running Result

If a node is faulty, Volcano deletes the training job, Resilience Controller modifies the job resource requirements based on available resources, and Volcano schedules the job to the remaining available resources to continue training.

Elastic Training Status

~# kubectl get pods -A -o wide

Assume that a 16-processor job is delivered on two nodes. The following command output indicates the job running status when the training job is executed properly.

NAMESPACE        NAME                                       READY   STATUS              RESTARTS   AGE     IP                NODE         NOMINATED NODE   READINESS GATES
......
vcjob            mindx-dls-test-default-test-0              1/1     Running   0          47s     192.168.70.82   Node-1   <none>           <none>
vcjob            mindx-dls-test-default-test-1              1/1     Running   0          47s     192.168.39.9    Node-2     <none>           <none>
......

When an NPU network fault occurs on Node-1, Volcano deletes the job. Run the following command to check the termination status of the training job:

 kubectl get pods -A -o wide

If the following information is displayed, the training job is deleted.

NAMESPACE        NAME                                       READY   STATUS              RESTARTS   AGE     IP                NODE         NOMINATED NODE   READINESS GATES
......
vcjob            mindx-dls-test-default-test-0              0/1     Terminating   0          6m59s     192.168.70.82   Node-1   <none>           <none>
vcjob            mindx-dls-test-default-test-1              1/1     Terminating   0          6m59s     192.168.39.9    Node-2     <none>           <none>
......

Wait for a while and run the following command to check the scaling status of the job:

 kubectl get pods -A -o wide

If the following information is displayed, the original 16-processor job is scaled to an 8-processor job on one node based on the number of available nodes.

NAMESPACE        NAME                                       READY   STATUS              RESTARTS   AGE     IP                NODE         NOMINATED NODE   READINESS GATES
......
vcjob            mindx-dls-test-default-test-0              1/1     Running   0          107s    192.168.70.86   Node-2   <none>           <none>
......

Viewing the Running Status of a Single Pod

Run the following command to check the running status of the training job on a single pod:

kubectl logs mindx-dls-test-default-test-0 -n vcjob -f

The following is a sample command output, which shows that the latest checkpoint file saved in the 39th epoch is used to resume the training job after a fault occurs, and then the training job continues from the 40th epoch.

...
2023-06-09 22:17:33,441:INFO:--> pre_trained: /job/code/mindspore/output/resnet50/imagenet2012/ckpt_0/resnet50-39_48.ckpt
2023-06-09 22:17:33,441:INFO:--> run_eval: False
2023-06-09 22:17:33,441:INFO:--> eval_dataset_path: 
2023-06-09 22:17:33,441:INFO:--> parameter_server: False
2023-06-09 22:17:33,441:INFO:--> filter_weight: False
2023-06-09 22:17:33,441:INFO:--> save_best_ckpt: True
2023-06-09 22:17:33,441:INFO:--> eval_start_epoch: 40
2023-06-09 22:17:33,441:INFO:--> eval_interval: 1
2023-06-09 22:17:33,441:INFO:--> enable_cache: False
2023-06-09 22:17:33,441:INFO:--> cache_session_id: 
2023-06-09 22:17:33,441:INFO:--> mode_name: GRAPH
2023-06-09 22:17:33,441:INFO:--> boost_mode: O0
2023-06-09 22:17:33,441:INFO:--> conv_init: XavierUniform
2023-06-09 22:17:33,441:INFO:--> dense_init: TruncatedNormal
2023-06-09 22:17:33,442:INFO:--> all_reduce_fusion_config: [85, 160]
2023-06-09 22:17:33,442:INFO:--> train_image_size: 224
2023-06-09 22:17:33,442:INFO:--> eval_image_size: 224
2023-06-09 22:17:33,442:INFO:--> device_id: 0
2023-06-09 22:17:33,442:INFO:--> width: 224
2023-06-09 22:17:33,442:INFO:--> height: 224
2023-06-09 22:17:33,442:INFO:--> file_name: resnet50
2023-06-09 22:17:33,442:INFO:--> file_format: MINDIR
2023-06-09 22:17:33,442:INFO:--> ckpt_file: 
2023-06-09 22:17:33,442:INFO:--> network_dataset: resnet50_imagenet2012
2023-06-09 22:17:33,442:INFO:--> save_graphs: False
2023-06-09 22:17:33,442:INFO:--> save_graphs_path: ./graphs
2023-06-09 22:17:33,442:INFO:--> has_trained_epoch: 0
2023-06-09 22:17:33,442:INFO:--> has_trained_step: 0
2023-06-09 22:17:33,442:INFO:--> result_path: 
2023-06-09 22:17:33,442:INFO:--> label_path: 
2023-06-09 22:17:33,442:INFO:--> config_path: /job/code/mindspore/config/resnet50_imagenet2012_config.yaml
2023-06-09 22:17:33,442:INFO:--> rank_id: 0
2023-06-09 22:17:33,442:INFO:--> save_ckpt_dir: /job/code/mindspore/output/resnet50/imagenet2012/ckpt
2023-06-09 22:17:33,442:INFO:--> log_dir: /job/code/mindspore/output/resnet50/imagenet2012/log
2023-06-09 22:17:33,442:INFO:--> logger: <LOGGER resnet (NOTSET)>
2023-06-09 22:17:33,442:INFO:
[WARNING] DEVICE(312,fffd6e363470,python):2023-06-09-22:17:33.999.925 [mindspore/ccsrc/plugin/device/ascend/hal/hardware/ge_graph_executor.cc:128] RunGEInitGraph] Can not find init_subgraph.kernel_graph_0 subgraph, don't need data init subgraph in INFER mode.
[WARNING] DEVICE(312,fffd6e363470,python):2023-06-09-22:17:43.733.157 [mindspore/ccsrc/plugin/device/ascend/hal/hardware/ge_graph_executor.cc:128] RunGEInitGraph] Can not find init_subgraph.kernel_graph_1 sub graph, don't need data init subgraph in INFER mode.
....2023-06-09 22:18:45,025:INFO:epoch: [40/90] loss: 3.465011, epoch time: 71.582 s, per step time: 1491.285 ms
2023-06-09 22:18:49,453:INFO:epoch: [41/90] loss: 3.396700, epoch time: 4.428 s, per step time: 92.245 ms
.2023-06-09 22:19:02,685:INFO:epoch: [42/90] loss: 3.297215, epoch time: 13.232 s, per step time: 275.659 ms
2023-06-09 22:19:07,323:INFO:epoch: [43/90] loss: 3.289656, epoch time: 4.638 s, per step time: 96.622 ms
2023-06-09 22:19:11,746:INFO:epoch: [44/90] loss: 3.266534, epoch time: 4.423 s, per step time: 92.139 ms
2023-06-09 22:19:16,913:INFO:epoch: [45/90] loss: 3.180886, epoch time: 5.167 s, per step time: 107.650 ms
2023-06-09 22:19:21,377:INFO:epoch: [46/90] loss: 2.895963, epoch time: 4.464 s, per step time: 92.997 ms
2023-06-09 22:19:25,798:INFO:epoch: [47/90] loss: 2.815258, epoch time: 4.420 s, per step time: 92.090 ms
2023-06-09 22:19:31,122:INFO:epoch: [48/90] loss: 2.826911, epoch time: 5.324 s, per step time: 110.918 ms
2023-06-09 22:19:35,591:INFO:epoch: [49/90] loss: 2.712467, epoch time: 4.469 s, per step time: 93.098 ms
...

Parent topic: Viewing Results