kubectl logs -n 命名空间名称 {Pod名称}
如:
kubectl logs -n vcjob mindx-dls-test-default-test-0
... I1123 16:20:11.016411 139889740781376 controller.py:458] train | step: 112 | steps/sec: 4.0 | output: {'train_accuracy': 0.0, 'train_loss': 12.339745} train | step: 112 | steps/sec: 4.0 | output: {'train_accuracy': 0.0, 'train_loss': 12.339745} 2022-11-23 16:20:11.541361: I core/op_executors/npu_concrete_graph.cpp:84] Start consume iterator resource AnonymousIterator0 4 times 2022-11-23 16:20:11.541499: I core/op_executors/npu_concrete_graph.cpp:118] Start run ge graph 445 pin to cpu, loop size 4 2022-11-23 16:20:11.565552: I core/op_executors/npu_concrete_graph.cpp:92] Iterator resource AnonymousIterator0 consume 4 times done with status OK I1123 16:20:12.046172 139889740781376 controller.py:458] train | step: 116 | steps/sec: 4.0 | output: {'train_accuracy': 0.0, 'train_loss': 12.389724} train | step: 116 | steps/sec: 4.0 | output: {'train_accuracy': 0.0, 'train_loss': 12.389724} 2022-11-23 16:20:12.542817: I core/op_executors/npu_concrete_graph.cpp:84] Start consume iterator resource AnonymousIterator0 4 times 2022-11-23 16:20:12.542937: I core/op_executors/npu_concrete_graph.cpp:118] Start run ge graph 445 pin to cpu, loop size 4 2022-11-23 16:20:12.571535: I core/op_executors/npu_concrete_graph.cpp:92] Iterator resource AnonymousIterator0 consume 4 times done with status OK I1123 16:20:13.038832 139889740781376 controller.py:458] train | step: 120 | steps/sec: 4.2 | output: {'train_accuracy': 0.0, 'train_loss': 12.421794} train | step: 120 | steps/sec: 4.2 | output: {'train_accuracy': 0.0, 'train_loss': 12.421794} 2022-11-23 16:20:13.559254: I core/op_executors/npu_concrete_graph.cpp:84] Start consume iterator resource AnonymousIterator0 4 times 2022-11-23 16:20:13.559394: I core/op_executors/npu_concrete_graph.cpp:118] Start run ge graph 445 pin to cpu, loop size 4 2022-11-23 16:20:13.604791: I core/op_executors/npu_concrete_graph.cpp:92] Iterator resource AnonymousIterator0 consume 4 times done with status OK I1123 16:20:14.052418 139889740781376 controller.py:458] train | step: 124 | steps/sec: 4.1 | output: {'train_accuracy': 0.0, 'train_loss': 12.335646} train | step: 124 | steps/sec: 4.1 | output: {'train_accuracy': 0.0, 'train_loss': 12.335646} 2022-11-23 16:20:14.555126: I core/op_executors/npu_concrete_graph.cpp:84] Start consume iterator resource AnonymousIterator0 4 times 2022-11-23 16:20:14.555217: I core/op_executors/npu_concrete_graph.cpp:118] Start run ge graph 445 pin to cpu, loop size 4 2022-11-23 16:20:14.601171: I core/op_executors/npu_concrete_graph.cpp:92] Iterator resource AnonymousIterator0 consume 4 times done with status OK I1123 16:20:15.058790 139889740781376 controller.py:458] train | step: 128 | steps/sec: 4.1 | output: {'train_accuracy': 0.0, 'train_loss': 12.415506} train | step: 128 | steps/sec: 4.1 | output: {'train_accuracy': 0.0, 'train_loss': 12.415506} I1123 16:20:15.228246 139889740781376 resnet_ctl_imagenet_main.py:191] Run stats: {'step_timestamp_log': ['BatchTimestamp<batch_index: 0, timestamp: 1669191532.9730577>', 'BatchTimestamp<batch_index: 100, timestamp: 1669191607.7925153>'], 'train_finish_time': 1669191615.2273297, 'avg_exp_per_second': 24.973437296848516} 2022-11-23 16:20:15.232802: I core/npu_logger.cpp:58] Stopping npu stdout receiver of device 0 2022-11-23 16:20:15.232901: I core/npu_device.cpp:122] Stopping iterator resource provider for AnonymousMultiDeviceIterator0 2022-11-23 16:20:15.233013: I core/npu_device.cpp:122] Stopping iterator resource provider for AnonymousIterator0 2022-11-23 16:20:15.235151: I core/npu_wrapper.cpp:230] Stop tensorflow model parser succeed 2022-11-23 16:20:18.289648: I core/npu_wrapper.cpp:240] Stop graph engine succeed ...
drwxr-xr-x 1 root root 4096 Dec 2 11:36 ./ drwxrwxrwx 1 root root 4096 Dec 2 11:36 ../ -rw-r--r--. 1 root root 999 Dec 2 11:36 checkpoint -rw-r--r--. 1 root root 306986892 Dec 2 11:35 ckpt-111.data-00000-of-00001 -rw-r--r--. 1 root root 44311 Dec 2 11:35 ckpt-111.index -rw-r--r--. 1 root root 306986892 Dec 2 11:36 ckpt-128.data-00000-of-00001 -rw-r--r--. 1 root root 44311 Dec 2 11:36 ckpt-128.index
kubectl logs -n 命名空间名称 {pod名称}
如:
kubectl logs -n vcjob mindx-dls-test-default-test-0
[gpu id: 0 ] Test: [77/85] Time 0.117 ( 0.281) Loss 1.073741e+01 (1.078090e+01) Acc@1 0.00 ( 0.02) Acc@5 0.00 ( 0.12) [gpu id: 0 ] Test: [78/85] Time 0.114 ( 0.279) Loss 1.072909e+01 (1.078015e+01) Acc@1 0.00 ( 0.02) Acc@5 0.00 ( 0.12) [gpu id: 0 ] Test: [79/85] Time 0.115 ( 0.277) Loss 1.073733e+01 (1.077953e+01) Acc@1 0.00 ( 0.02) Acc@5 0.20 ( 0.12) [gpu id: 0 ] Test: [80/85] Time 2.385 ( 0.306) Loss 1.087646e+01 (1.078090e+01) Acc@1 0.00 ( 0.02) Acc@5 0.00 ( 0.12) [gpu id: 0 ] Test: [81/85] Time 1.139 ( 0.318) Loss 1.075754e+01 (1.078058e+01) Acc@1 0.00 ( 0.02) Acc@5 0.39 ( 0.12) [gpu id: 0 ] Test: [82/85] Time 0.115 ( 0.315) Loss 1.068419e+01 (1.077925e+01) Acc@1 0.00 ( 0.02) Acc@5 0.20 ( 0.13) [gpu id: 0 ] Test: [83/85] Time 0.129 ( 0.313) Loss 1.075079e+01 (1.077887e+01) Acc@1 0.00 ( 0.02) Acc@5 0.20 ( 0.13) [gpu id: 0 ] Test: [84/85] Time 0.134 ( 0.310) Loss 1.093459e+01 (1.078095e+01) Acc@1 0.00 ( 0.02) Acc@5 0.39 ( 0.13) [gpu id: 0 ] [AVG-ACC] * Acc@1 0.016 Acc@5 0.130 validate acc1 tensor(0.0156, device='npu:0') Complete 90 epoch training, take time:1.05h ...
drwxrwx--- 2 root root 4096 Mar 4 19:28 ./ drwxrwx--- 4 root root 4096 Mar 4 19:28 ../ -rw-rw---- 1 root root 102489869 Mar 4 19:28 checkpoint_npu0model_best.pth.tar -rw-rw---- 1 root root 102489869 Mar 4 19:28 checkpoint_npu0.pth.tar ...
可以参考ModelZoo上,PyTorch框架的ResNet-50模型中的“模型推理”章节,对生成的模型文件进行模型转换处理。
kubectl logs -n 命名空间名称 {pod名称}
如:
kubectl logs -n vcjob mindx-dls-test-default-test-0
... 2023-06-09 17:55:04,837:INFO:epoch: [70/90] loss: 1.541062, epoch time: 7.563 s, per step time: 157.554 ms 2023-06-09 17:55:10,540:INFO:epoch: [71/90] loss: 1.544771, epoch time: 5.702 s, per step time: 118.796 ms 2023-06-09 17:55:16,347:INFO:epoch: [72/90] loss: 1.506525, epoch time: 5.807 s, per step time: 120.979 ms 2023-06-09 17:55:24,904:INFO:epoch: [73/90] loss: 1.519342, epoch time: 8.556 s, per step time: 178.260 ms 2023-06-09 17:55:29,887:INFO:epoch: [74/90] loss: 1.387423, epoch time: 4.982 s, per step time: 103.783 ms 2023-06-09 17:55:39,785:INFO:epoch: [75/90] loss: 1.440862, epoch time: 9.897 s, per step time: 206.194 ms 2023-06-09 17:55:48,780:INFO:epoch: [76/90] loss: 1.431275, epoch time: 8.995 s, per step time: 187.399 ms 2023-06-09 17:55:55,764:INFO:epoch: [77/90] loss: 1.411003, epoch time: 6.984 s, per step time: 145.492 ms 2023-06-09 17:56:03,962:INFO:epoch: [78/90] loss: 1.457689, epoch time: 8.198 s, per step time: 170.783 ms 2023-06-09 17:56:11,517:INFO:epoch: [79/90] loss: 1.410896, epoch time: 7.554 s, per step time: 157.372 ms 2023-06-09 17:56:16,643:INFO:epoch: [80/90] loss: 1.517990, epoch time: 5.126 s, per step time: 106.789 ms 2023-06-09 17:56:23,364:INFO:epoch: [81/90] loss: 1.342399, epoch time: 6.720 s, per step time: 140.005 ms 2023-06-09 17:56:31,835:INFO:epoch: [82/90] loss: 1.352396, epoch time: 8.471 s, per step time: 176.470 ms 2023-06-09 17:56:36,971:INFO:epoch: [83/90] loss: 1.358075, epoch time: 5.135 s, per step time: 106.984 ms 2023-06-09 17:56:44,259:INFO:epoch: [84/90] loss: 1.400720, epoch time: 7.288 s, per step time: 151.838 ms 2023-06-09 17:56:52,868:INFO:epoch: [85/90] loss: 1.371813, epoch time: 8.608 s, per step time: 179.339 ms 2023-06-09 17:56:57,613:INFO:epoch: [86/90] loss: 1.303416, epoch time: 4.745 s, per step time: 98.858 ms 2023-06-09 17:57:04,177:INFO:epoch: [87/90] loss: 1.290425, epoch time: 6.564 s, per step time: 136.744 ms 2023-06-09 17:57:11,797:INFO:epoch: [88/90] loss: 1.298486, epoch time: 7.619 s, per step time: 158.738 ms 2023-06-09 17:57:16,807:INFO:epoch: [89/90] loss: 1.297104, epoch time: 5.009 s, per step time: 104.363 ms 2023-06-09 17:57:25,568:INFO:epoch: [90/90] loss: 1.401816, epoch time: 8.759 s, per step time: 182.486 ms
drwx------ 2 root root 4096 Dec 21 15:35 ./ drwxrwxrwx 10 root root 4096 Dec 21 15:26 ../ -r-------- 1 root root 188546464 Dec 21 15:31 resnet50-45_234.ckpt -r-------- 1 root root 188546464 Dec 21 15:31 resnet50-50_234.ckpt -r-------- 1 root root 188546464 Dec 21 15:32 resnet50-55_234.ckpt -r-------- 1 root root 188546464 Dec 21 15:32 resnet50-60_234.ckpt -r-------- 1 root root 188546464 Dec 21 15:33 resnet50-65_234.ckpt -r-------- 1 root root 188546464 Dec 21 15:33 resnet50-70_234.ckpt -r-------- 1 root root 188546464 Dec 21 15:33 resnet50-75_234.ckpt -r-------- 1 root root 188546464 Dec 21 15:34 resnet50-80_234.ckpt -r-------- 1 root root 188546464 Dec 21 15:34 resnet50-85_234.ckpt -r-------- 1 root root 188546464 Dec 21 15:35 resnet50-90_234.ckpt -rw------- 1 root root 769071 Dec 21 15:28 resnet50-graph.meta