kubectl logs -n 命令空间名称 {pod名字}
如:
kubectl logs -n vcjob mindx-dls-test-default-test-0
... I1123 16:20:11.016411 139889740781376 controller.py:458] train | step: 112 | steps/sec: 4.0 | output: {'train_accuracy': 0.0, 'train_loss': 12.339745} train | step: 112 | steps/sec: 4.0 | output: {'train_accuracy': 0.0, 'train_loss': 12.339745} 2022-11-23 16:20:11.541361: I core/op_executors/npu_concrete_graph.cpp:84] Start consume iterator resource AnonymousIterator0 4 times 2022-11-23 16:20:11.541499: I core/op_executors/npu_concrete_graph.cpp:118] Start run ge graph 445 pin to cpu, loop size 4 2022-11-23 16:20:11.565552: I core/op_executors/npu_concrete_graph.cpp:92] Iterator resource AnonymousIterator0 consume 4 times done with status OK I1123 16:20:12.046172 139889740781376 controller.py:458] train | step: 116 | steps/sec: 4.0 | output: {'train_accuracy': 0.0, 'train_loss': 12.389724} train | step: 116 | steps/sec: 4.0 | output: {'train_accuracy': 0.0, 'train_loss': 12.389724} 2022-11-23 16:20:12.542817: I core/op_executors/npu_concrete_graph.cpp:84] Start consume iterator resource AnonymousIterator0 4 times 2022-11-23 16:20:12.542937: I core/op_executors/npu_concrete_graph.cpp:118] Start run ge graph 445 pin to cpu, loop size 4 2022-11-23 16:20:12.571535: I core/op_executors/npu_concrete_graph.cpp:92] Iterator resource AnonymousIterator0 consume 4 times done with status OK I1123 16:20:13.038832 139889740781376 controller.py:458] train | step: 120 | steps/sec: 4.2 | output: {'train_accuracy': 0.0, 'train_loss': 12.421794} train | step: 120 | steps/sec: 4.2 | output: {'train_accuracy': 0.0, 'train_loss': 12.421794} 2022-11-23 16:20:13.559254: I core/op_executors/npu_concrete_graph.cpp:84] Start consume iterator resource AnonymousIterator0 4 times 2022-11-23 16:20:13.559394: I core/op_executors/npu_concrete_graph.cpp:118] Start run ge graph 445 pin to cpu, loop size 4 2022-11-23 16:20:13.604791: I core/op_executors/npu_concrete_graph.cpp:92] Iterator resource AnonymousIterator0 consume 4 times done with status OK I1123 16:20:14.052418 139889740781376 controller.py:458] train | step: 124 | steps/sec: 4.1 | output: {'train_accuracy': 0.0, 'train_loss': 12.335646} train | step: 124 | steps/sec: 4.1 | output: {'train_accuracy': 0.0, 'train_loss': 12.335646} 2022-11-23 16:20:14.555126: I core/op_executors/npu_concrete_graph.cpp:84] Start consume iterator resource AnonymousIterator0 4 times 2022-11-23 16:20:14.555217: I core/op_executors/npu_concrete_graph.cpp:118] Start run ge graph 445 pin to cpu, loop size 4 2022-11-23 16:20:14.601171: I core/op_executors/npu_concrete_graph.cpp:92] Iterator resource AnonymousIterator0 consume 4 times done with status OK I1123 16:20:15.058790 139889740781376 controller.py:458] train | step: 128 | steps/sec: 4.1 | output: {'train_accuracy': 0.0, 'train_loss': 12.415506} train | step: 128 | steps/sec: 4.1 | output: {'train_accuracy': 0.0, 'train_loss': 12.415506} I1123 16:20:15.228246 139889740781376 resnet_ctl_imagenet_main.py:191] Run stats: {'step_timestamp_log': ['BatchTimestamp<batch_index: 0, timestamp: 1669191532.9730577>', 'BatchTimestamp<batch_index: 100, timestamp: 1669191607.7925153>'], 'train_finish_time': 1669191615.2273297, 'avg_exp_per_second': 24.973437296848516} 2022-11-23 16:20:15.232802: I core/npu_logger.cpp:58] Stopping npu stdout receiver of device 0 2022-11-23 16:20:15.232901: I core/npu_device.cpp:122] Stopping iterator resource provider for AnonymousMultiDeviceIterator0 2022-11-23 16:20:15.233013: I core/npu_device.cpp:122] Stopping iterator resource provider for AnonymousIterator0 2022-11-23 16:20:15.235151: I core/npu_wrapper.cpp:230] Stop tensorflow model parser succeed 2022-11-23 16:20:18.289648: I core/npu_wrapper.cpp:240] Stop graph engine succeed ...
drwxr-xr-x 1 root root 4096 Dec 2 11:36 ./ drwxrwxrwx 1 root root 4096 Dec 2 11:36 ../ -rw-r--r--. 1 root root 999 Dec 2 11:36 checkpoint -rw-r--r--. 1 root root 306986892 Dec 2 11:35 ckpt-111.data-00000-of-00001 -rw-r--r--. 1 root root 44311 Dec 2 11:35 ckpt-111.index -rw-r--r--. 1 root root 306986892 Dec 2 11:36 ckpt-128.data-00000-of-00001 -rw-r--r--. 1 root root 44311 Dec 2 11:36 ckpt-128.index
kubectl logs -n 命令空间名称 {pod名字}
如:
kubectl logs -n vcjob mindx-dls-test-default-test-0
[gpu id: 0 ] Test: [77/85] Time 0.117 ( 0.281) Loss 1.073741e+01 (1.078090e+01) Acc@1 0.00 ( 0.02) Acc@5 0.00 ( 0.12) [gpu id: 0 ] Test: [78/85] Time 0.114 ( 0.279) Loss 1.072909e+01 (1.078015e+01) Acc@1 0.00 ( 0.02) Acc@5 0.00 ( 0.12) [gpu id: 0 ] Test: [79/85] Time 0.115 ( 0.277) Loss 1.073733e+01 (1.077953e+01) Acc@1 0.00 ( 0.02) Acc@5 0.20 ( 0.12) [gpu id: 0 ] Test: [80/85] Time 2.385 ( 0.306) Loss 1.087646e+01 (1.078090e+01) Acc@1 0.00 ( 0.02) Acc@5 0.00 ( 0.12) [gpu id: 0 ] Test: [81/85] Time 1.139 ( 0.318) Loss 1.075754e+01 (1.078058e+01) Acc@1 0.00 ( 0.02) Acc@5 0.39 ( 0.12) [gpu id: 0 ] Test: [82/85] Time 0.115 ( 0.315) Loss 1.068419e+01 (1.077925e+01) Acc@1 0.00 ( 0.02) Acc@5 0.20 ( 0.13) [gpu id: 0 ] Test: [83/85] Time 0.129 ( 0.313) Loss 1.075079e+01 (1.077887e+01) Acc@1 0.00 ( 0.02) Acc@5 0.20 ( 0.13) [gpu id: 0 ] Test: [84/85] Time 0.134 ( 0.310) Loss 1.093459e+01 (1.078095e+01) Acc@1 0.00 ( 0.02) Acc@5 0.39 ( 0.13) [gpu id: 0 ] [AVG-ACC] * Acc@1 0.016 Acc@5 0.130 validate acc1 tensor(0.0156, device='npu:0') Complete 90 epoch training, take time:1.05h ...
drwxrwx--- 2 root root 4096 Mar 4 19:28 ./ drwxrwx--- 4 root root 4096 Mar 4 19:28 ../ -rw-rw---- 1 root root 102489869 Mar 4 19:28 checkpoint_npu0model_best.pth.tar -rw-rw---- 1 root root 102489869 Mar 4 19:28 checkpoint_npu0.pth.tar ...
可以参考ModelZoo上,PyTorch框架的ResNet-50模型中的“模型推理”章节,对生成的模型文件进行模型转换处理。
kubectl logs -n 命令空间名称 {pod名字}
如:
kubectl logs -n vcjob mindx-dls-test-default-test-0
... Train epoch time: 58349.597 ms, per step time: 249.357 ms epoch: 83 step: 234, loss is 7.74848e-05 Train epoch time: 3995.383 ms, per step time: 17.074 ms epoch: 84 step: 234, loss is 0.00019958502 Train epoch time: 4005.581 ms, per step time: 17.118 ms epoch: 85 step: 234, loss is 0.020869248 Train epoch time: 7434.552 ms, per step time: 31.772 ms epoch: 86 step: 234, loss is 6.217403e-05 Train epoch time: 4048.635 ms, per step time: 17.302 ms epoch: 87 step: 234, loss is 6.641826e-05 Train epoch time: 4036.470 ms, per step time: 17.250 ms epoch: 88 step: 234, loss is 0.00022485439 Train epoch time: 4032.634 ms, per step time: 17.233 ms epoch: 89 step: 234, loss is 8.325573e-06 Train epoch time: 4009.894 ms, per step time: 17.136 ms epoch: 90 step: 234, loss is 7.480194e-05 Train epoch time: 7445.271 ms, per step time: 31.817 ms
drwx------ 2 root root 4096 Dec 21 15:35 ./ drwxrwxrwx 10 root root 4096 Dec 21 15:26 ../ -r-------- 1 root root 188546464 Dec 21 15:31 resnet-45_234.ckpt -r-------- 1 root root 188546464 Dec 21 15:31 resnet-50_234.ckpt -r-------- 1 root root 188546464 Dec 21 15:32 resnet-55_234.ckpt -r-------- 1 root root 188546464 Dec 21 15:32 resnet-60_234.ckpt -r-------- 1 root root 188546464 Dec 21 15:33 resnet-65_234.ckpt -r-------- 1 root root 188546464 Dec 21 15:33 resnet-70_234.ckpt -r-------- 1 root root 188546464 Dec 21 15:33 resnet-75_234.ckpt -r-------- 1 root root 188546464 Dec 21 15:34 resnet-80_234.ckpt -r-------- 1 root root 188546464 Dec 21 15:34 resnet-85_234.ckpt -r-------- 1 root root 188546464 Dec 21 15:35 resnet-90_234.ckpt -rw------- 1 root root 769071 Dec 21 15:28 resnet-graph.meta