kubectl logs -n 命令空间名称 {pod名字}
如:
kubectl logs -n vcjob mindx-dls-test-default-test-0
... INFO:tensorflow:global_step/sec: 62.5989I0225 15:18:04.421721 139926009867136 basic_session_run_hooks.py:692] global_step/sec: 62.5989INFO:tensorflow:Benchmark metric: {'name': 'average_examples_per_sec', 'value': 578.4834594239737, 'unit': None, 'global_step': 108236, 'timestamp': '2023-02-25T07:18:04.422173Z', 'extras': []}I0225 15:18:04.422242 139926009867136 logger.py:147] Benchmark metric: {'name': 'average_examples_per_sec', 'value': 578.4834594239737, 'unit': None, 'global_step': 108236, 'timestamp': '2023-02-25T07:18:04.422173Z', 'extras': []}INFO:tensorflow:Benchmark metric: {'name': 'current_examples_per_sec', 'value': 2003.1516069064082, 'unit': None, 'global_step': 108236, 'timestamp': '2023-02-25T07:18:04.422376Z', 'extras': []}I0225 15:18:04.422401 139926009867136 logger.py:147] Benchmark metric: {'name': 'current_examples_per_sec', 'value': 2003.1516069064082, 'unit': None, 'global_step': 108236, 'timestamp': '2023-02-25T07:18:04.422376Z', 'extras': []}INFO:tensorflow:global_step...108236I0225 15:18:04.423494 139926009867136 npu_hook.py:134] global_step...108236INFO:tensorflow:global_step/sec: 62.5636I0225 15:18:06.020128 139926009867136 basic_session_run_hooks.py:692] global_step/sec: 62.5636INFO:tensorflow:Benchmark metric: {'name': 'average_examples_per_sec', 'value': 604.1929530633639, 'unit': None, 'global_step': 108336, 'timestamp': '2023-02-25T07:18:06.020630Z', 'extras': []}I0225 15:18:06.020709 139926009867136 logger.py:147] Benchmark metric: {'name': 'average_examples_per_sec', 'value': 604.1929530633639, 'unit': None, 'global_step': 108336, 'timestamp': '2023-02-25T07:18:06.020630Z', 'extras': []}INFO:tensorflow:Benchmark metric: {'name': 'current_examples_per_sec', 'value': 2001.9531828566544, 'unit': None, 'global_step': 108336, 'timestamp': '2023-02-25T07:18:06.020906Z', 'extras': []}I0225 15:18:06.020947 139926009867136 logger.py:147] Benchmark metric: {'name': 'current_examples_per_sec', 'value': 2001.9531828566544, 'unit': None, 'global_step': 108336, 'timestamp': '2023-02-25T07:18:06.020906Z', 'extras': []}INFO:tensorflow:global_step...108336I0225 15:18:06.022154 139926009867136 npu_hook.py:134] global_step...108336INFO:tensorflow:global_step/sec: 62.5896I0225 15:18:07.617834 139926009867136 basic_session_run_hooks.py:692] global_step/sec: 62.5896INFO:tensorflow:Benchmark metric: {'name': 'average_examples_per_sec', 'value': 628.9956940950959, 'unit': None, 'global_step': 108436, 'timestamp': '2023-02-25T07:18:07.618267Z', 'extras': []}I0225 15:18:07.618332 139926009867136 logger.py:147] Benchmark metric: {'name': 'average_examples_per_sec', 'value': 628.9956940950959, 'unit': None, 'global_step': 108436, 'timestamp': '2023-02-25T07:18:07.618267Z', 'extras': []}INFO:tensorflow:Benchmark metric: {'name': 'current_examples_per_sec', 'value': 2002.932789737743, 'unit': None, 'global_step': 108436, 'timestamp': '2023-02-25T07:18:07.618460Z', 'extras': []}I0225 15:18:07.618484 139926009867136 logger.py:147] Benchmark metric: {'name': 'current_examples_per_sec', 'value': 2002.932789737743, 'unit': None, 'global_step': 108436, 'timestamp': '2023-02-25T07:18:07.618460Z', 'extras': []}INFO:tensorflow:global_step...108436I0225 15:18:07.619483 139926009867136 npu_hook.py:134] global_step...108436INFO:tensorflow:global_step/sec: 62.5702I0225 15:18:09.216053 139926009867136 basic_session_run_hooks.py:692] global_step/sec: 62.5702INFO:tensorflow:Benchmark metric: {'name': 'average_examples_per_sec', 'value': 652.9284031222464, 'unit': None, 'global_step': 108536, 'timestamp': '2023-02-25T07:18:09.216748Z', 'extras': []}I0225 15:18:09.216830 139926009867136 logger.py:147] Benchmark metric: {'name': 'average_examples_per_sec', 'value': 652.9284031222464, 'unit': None, 'global_step': 108536, 'timestamp': '2023-02-25T07:18:09.216748Z', 'extras': []}INFO:tensorflow:Benchmark metric: {'name': 'current_examples_per_sec', 'value': 2001.9221283557783, 'unit': None, 'global_step': 108536, 'timestamp': '2023-02-25T07:18:09.216962Z', 'extras': []}I0225 15:18:09.216987 139926009867136 logger.py:147] Benchmark metric: {'name': 'current_examples_per_sec', 'value': 2001.9221283557783, 'unit': None, 'global_step': 108536, 'timestamp': '2023-02-25T07:18:09.216962Z', 'extras': []}INFO:tensorflow:global_step...108536I0225 15:18:09.218228 139926009867136 npu_hook.py:134] global_step...108536 ...
drwxr-xr-x 3 root root 4096 Feb 27 21:36 ./ drwxr-xr-x 26 root root 4096 Feb 27 21:16 ../ -rw-r--r-- 1 root root 89 Feb 27 21:36 checkpoint drwxr-xr-x 2 root root 4096 Feb 27 21:19 eval/ -rw-r--r-- 1 root root 48825019 Feb 27 21:36 events.out.tfevents.1677503855.default-test-tensorflow-chief-0 -rw-r--r-- 1 root root 4872113 Feb 27 21:35 graph.pbtxt -rw-r--r-- 1 root root 204685136 Feb 27 21:17 model.ckpt-0.data-00000-of-00001 -rw-r--r-- 1 root root 16366 Feb 27 21:17 model.ckpt-0.index -rw-r--r-- 1 root root 2159934 Feb 27 21:17 model.ckpt-0.meta -rw-r--r-- 1 root root 204685136 Feb 27 21:19 model.ckpt-1172.data-00000-of-00001 -rw-r--r-- 1 root root 16366 Feb 27 21:19 model.ckpt-1172.index -rw-r--r-- 1 root root 2159958 Feb 27 21:20 model.ckpt-1172.meta -rw-r--r-- 1 root root 204685136 Feb 27 21:21 model.ckpt-2343.data-00000-of-00001 -rw-r--r-- 1 root root 16366 Feb 27 21:21 model.ckpt-2343.index -rw-r--r-- 1 root root 2159958 Feb 27 21:21 model.ckpt-2343.meta
kubectl logs -n 命令空间名称 {pod名字}
如:
kubectl logs -n vcjob mindx-dls-test-default-test-0
[npu id: 0 ] Test: [1362/1369] Time 0.015 ( 0.025) Loss 6.911377e+00 (6.912836e+00) Acc@1 0.00 ( 0.00) Acc@5 0.00 ( 0.00) [npu id: 0 ] Test: [1363/1369] Time 0.010 ( 0.024) Loss 6.904175e+00 (6.912830e+00) Acc@1 0.00 ( 0.00) Acc@5 0.00 ( 0.00) [npu id: 0 ] Test: [1364/1369] Time 0.010 ( 0.024) Loss 6.907715e+00 (6.912826e+00) Acc@1 0.00 ( 0.00) Acc@5 0.00 ( 0.00) [npu id: 0 ] Test: [1365/1369] Time 0.010 ( 0.024) Loss 6.904297e+00 (6.912820e+00) Acc@1 0.00 ( 0.00) Acc@5 0.00 ( 0.00) [npu id: 0 ] Test: [1366/1369] Time 0.014 ( 0.024) Loss 6.907349e+00 (6.912816e+00) Acc@1 0.00 ( 0.00) Acc@5 0.00 ( 0.00) [npu id: 0 ] Test: [1367/1369] Time 0.014 ( 0.024) Loss 6.904785e+00 (6.912810e+00) Acc@1 0.00 ( 0.00) Acc@5 0.00 ( 0.00) [npu id: 0 ] Test: [1368/1369] Time 0.011 ( 0.024) Loss 6.908813e+00 (6.912807e+00) Acc@1 0.00 ( 0.00) Acc@5 0.00 ( 0.00) [npu id: 0 ] [AVG-ACC] * Acc@1 0.000 Acc@5 0.000 THPModule_npu_shutdown success.
drwxrwx--- 2 root root 4096 Mar 4 19:28 ./ drwxrwx--- 4 root root 4096 Mar 4 19:28 ../ -rw-rw---- 1 root root 102489869 Mar 4 19:28 checkpoint_npu0model_best.pth.tar -rw-rw---- 1 root root 102489869 Mar 4 19:28 checkpoint_npu0.pth.tar ...
可以参考ModelZoo上,PyTorch框架的ResNet-50模型中的“模型推理”章节,对生成的模型文件进行模型转换处理。
kubectl logs -n 命令空间名称 {pod名字}
如:
kubectl logs -n vcjob mindx-dls-test-default-test-0
... Train epoch time: 14614.743 ms, per step time: 2435.791 ms epoch: 83 step: 6, loss is 6.024412 Train epoch time: 8068.264 ms, per step time: 1344.711 ms epoch: 84 step: 6, loss is 6.022005 Train epoch time: 6966.450 ms, per step time: 1161.075 ms epoch: 85 step: 6, loss is 6.052724 Train epoch time: 8519.337 ms, per step time: 1419.889 ms epoch: 86 step: 6, loss is 5.9838204 Train epoch time: 2567.209 ms, per step time: 427.868 ms epoch: 87 step: 6, loss is 5.8725405 Train epoch time: 2248.745 ms, per step time: 374.791 ms epoch: 88 step: 6, loss is 5.9476185 Train epoch time: 1757.267 ms, per step time: 292.878 ms epoch: 89 step: 6, loss is 5.899315 Train epoch time: 2016.577 ms, per step time: 336.096 ms epoch: 90 step: 6, loss is 5.9367642 Train epoch time: 6509.458 ms, per step time: 1084.910 ms
total 739548 drwx------ 2 root root 4096 Feb 17 17:11 ./ drwx------ 3 root root 4096 Feb 17 17:00 ../ -r-------- 1 root root 189137224 Feb 17 17:06 resnet-10_1875.ckpt -r-------- 1 root root 189137224 Feb 17 17:08 resnet-15_1875.ckpt -r-------- 1 root root 189137224 Feb 17 17:11 resnet-20_1875.ckpt -r-------- 1 root root 189137224 Feb 17 17:03 resnet-5_1875.ckpt -rw------- 1 root root 720701 Feb 17 17:01 resnet-graph.meta