查看训练结果

Tensorflow

  1. 在执行如下命令,查看训练结果。

    kubectl logs -n  命令空间名称 {pod名字}

    如:

    kubectl logs  -n vcjob mindx-dls-test-default-test-0

  2. 查看训练日志,如果出现如下内容表示训练成功。

    ...
    INFO:tensorflow:global_step/sec: 62.5989I0225 15:18:04.421721 139926009867136 basic_session_run_hooks.py:692] global_step/sec: 62.5989INFO:tensorflow:Benchmark metric: {'name': 'average_examples_per_sec', 'value': 578.4834594239737, 'unit': None, 'global_step': 108236, 'timestamp': '2023-02-25T07:18:04.422173Z', 'extras': []}I0225 15:18:04.422242 139926009867136 logger.py:147] Benchmark metric: {'name': 'average_examples_per_sec', 'value': 578.4834594239737, 'unit': None, 'global_step': 108236, 'timestamp': '2023-02-25T07:18:04.422173Z', 'extras': []}INFO:tensorflow:Benchmark metric: {'name': 'current_examples_per_sec', 'value': 2003.1516069064082, 'unit': None, 'global_step': 108236, 'timestamp': '2023-02-25T07:18:04.422376Z', 'extras': []}I0225 15:18:04.422401 139926009867136 logger.py:147] Benchmark metric: {'name': 'current_examples_per_sec', 'value': 2003.1516069064082, 'unit': None, 'global_step': 108236, 'timestamp': '2023-02-25T07:18:04.422376Z', 'extras': []}INFO:tensorflow:global_step...108236I0225 15:18:04.423494 139926009867136 npu_hook.py:134] global_step...108236INFO:tensorflow:global_step/sec: 62.5636I0225 15:18:06.020128 139926009867136 basic_session_run_hooks.py:692] global_step/sec: 62.5636INFO:tensorflow:Benchmark metric: {'name': 'average_examples_per_sec', 'value': 604.1929530633639, 'unit': None, 'global_step': 108336, 'timestamp': '2023-02-25T07:18:06.020630Z', 'extras': []}I0225 15:18:06.020709 139926009867136 logger.py:147] Benchmark metric: {'name': 'average_examples_per_sec', 'value': 604.1929530633639, 'unit': None, 'global_step': 108336, 'timestamp': '2023-02-25T07:18:06.020630Z', 'extras': []}INFO:tensorflow:Benchmark metric: {'name': 'current_examples_per_sec', 'value': 2001.9531828566544, 'unit': None, 'global_step': 108336, 'timestamp': '2023-02-25T07:18:06.020906Z', 'extras': []}I0225 15:18:06.020947 139926009867136 logger.py:147] Benchmark metric: {'name': 'current_examples_per_sec', 'value': 2001.9531828566544, 'unit': None, 'global_step': 108336, 'timestamp': '2023-02-25T07:18:06.020906Z', 'extras': []}INFO:tensorflow:global_step...108336I0225 15:18:06.022154 139926009867136 npu_hook.py:134] global_step...108336INFO:tensorflow:global_step/sec: 62.5896I0225 15:18:07.617834 139926009867136 basic_session_run_hooks.py:692] global_step/sec: 62.5896INFO:tensorflow:Benchmark metric: {'name': 'average_examples_per_sec', 'value': 628.9956940950959, 'unit': None, 'global_step': 108436, 'timestamp': '2023-02-25T07:18:07.618267Z', 'extras': []}I0225 15:18:07.618332 139926009867136 logger.py:147] Benchmark metric: {'name': 'average_examples_per_sec', 'value': 628.9956940950959, 'unit': None, 'global_step': 108436, 'timestamp': '2023-02-25T07:18:07.618267Z', 'extras': []}INFO:tensorflow:Benchmark metric: {'name': 'current_examples_per_sec', 'value': 2002.932789737743, 'unit': None, 'global_step': 108436, 'timestamp': '2023-02-25T07:18:07.618460Z', 'extras': []}I0225 15:18:07.618484 139926009867136 logger.py:147] Benchmark metric: {'name': 'current_examples_per_sec', 'value': 2002.932789737743, 'unit': None, 'global_step': 108436, 'timestamp': '2023-02-25T07:18:07.618460Z', 'extras': []}INFO:tensorflow:global_step...108436I0225 15:18:07.619483 139926009867136 npu_hook.py:134] global_step...108436INFO:tensorflow:global_step/sec: 62.5702I0225 15:18:09.216053 139926009867136 basic_session_run_hooks.py:692] global_step/sec: 62.5702INFO:tensorflow:Benchmark metric: {'name': 'average_examples_per_sec', 'value': 652.9284031222464, 'unit': None, 'global_step': 108536, 'timestamp': '2023-02-25T07:18:09.216748Z', 'extras': []}I0225 15:18:09.216830 139926009867136 logger.py:147] Benchmark metric: {'name': 'average_examples_per_sec', 'value': 652.9284031222464, 'unit': None, 'global_step': 108536, 'timestamp': '2023-02-25T07:18:09.216748Z', 'extras': []}INFO:tensorflow:Benchmark metric: {'name': 'current_examples_per_sec', 'value': 2001.9221283557783, 'unit': None, 'global_step': 108536, 'timestamp': '2023-02-25T07:18:09.216962Z', 'extras': []}I0225 15:18:09.216987 139926009867136 logger.py:147] Benchmark metric: {'name': 'current_examples_per_sec', 'value': 2001.9221283557783, 'unit': None, 'global_step': 108536, 'timestamp': '2023-02-25T07:18:09.216962Z', 'extras': []}INFO:tensorflow:global_step...108536I0225 15:18:09.218228 139926009867136 npu_hook.py:134] global_step...108536
    ...

  3. 进入模型输出目录,查看生成的模型文件。

    drwxr-xr-x  3 root root      4096 Feb 27 21:36 ./
    drwxr-xr-x 26 root root      4096 Feb 27 21:16 ../
    -rw-r--r--  1 root root        89 Feb 27 21:36 checkpoint
    drwxr-xr-x  2 root root      4096 Feb 27 21:19 eval/
    -rw-r--r--  1 root root  48825019 Feb 27 21:36 events.out.tfevents.1677503855.default-test-tensorflow-chief-0
    -rw-r--r--  1 root root   4872113 Feb 27 21:35 graph.pbtxt
    -rw-r--r--  1 root root 204685136 Feb 27 21:17 model.ckpt-0.data-00000-of-00001
    -rw-r--r--  1 root root     16366 Feb 27 21:17 model.ckpt-0.index
    -rw-r--r--  1 root root   2159934 Feb 27 21:17 model.ckpt-0.meta
    -rw-r--r--  1 root root 204685136 Feb 27 21:19 model.ckpt-1172.data-00000-of-00001
    -rw-r--r--  1 root root     16366 Feb 27 21:19 model.ckpt-1172.index
    -rw-r--r--  1 root root   2159958 Feb 27 21:20 model.ckpt-1172.meta
    -rw-r--r--  1 root root 204685136 Feb 27 21:21 model.ckpt-2343.data-00000-of-00001
    -rw-r--r--  1 root root     16366 Feb 27 21:21 model.ckpt-2343.index
    -rw-r--r--  1 root root   2159958 Feb 27 21:21 model.ckpt-2343.meta

Pytorch

  1. 在执行如下命令,查看训练结果。

    kubectl logs -n  命令空间名称 {pod名字}

    如:

    kubectl logs  -n vcjob mindx-dls-test-default-test-0

  2. 查看训练日志,如果出现如下内容表示训练成功。

    [npu id: 0 ] Test: [1362/1369] Time  0.015 ( 0.025) Loss 6.911377e+00 (6.912836e+00) Acc@1   0.00 (  0.00) Acc@5   0.00 (  0.00)
    [npu id: 0 ] Test: [1363/1369] Time  0.010 ( 0.024) Loss 6.904175e+00 (6.912830e+00) Acc@1   0.00 (  0.00) Acc@5   0.00 (  0.00)
    [npu id: 0 ] Test: [1364/1369] Time  0.010 ( 0.024) Loss 6.907715e+00 (6.912826e+00) Acc@1   0.00 (  0.00) Acc@5   0.00 (  0.00)
    [npu id: 0 ] Test: [1365/1369] Time  0.010 ( 0.024) Loss 6.904297e+00 (6.912820e+00) Acc@1   0.00 (  0.00) Acc@5   0.00 (  0.00)
    [npu id: 0 ] Test: [1366/1369] Time  0.014 ( 0.024) Loss 6.907349e+00 (6.912816e+00) Acc@1   0.00 (  0.00) Acc@5   0.00 (  0.00)
    [npu id: 0 ] Test: [1367/1369] Time  0.014 ( 0.024) Loss 6.904785e+00 (6.912810e+00) Acc@1   0.00 (  0.00) Acc@5   0.00 (  0.00)
    [npu id: 0 ] Test: [1368/1369] Time  0.011 ( 0.024) Loss 6.908813e+00 (6.912807e+00) Acc@1   0.00 (  0.00) Acc@5   0.00 (  0.00)
    [npu id: 0 ] [AVG-ACC] * Acc@1 0.000 Acc@5 0.000
    THPModule_npu_shutdown success.

  3. 进入模型输出目录,查看生成的模型文件。

    drwxrwx--- 2 root root      4096 Mar  4 19:28 ./
    drwxrwx--- 4 root root      4096 Mar  4 19:28 ../
    -rw-rw---- 1 root root 102489869 Mar  4 19:28 checkpoint_npu0model_best.pth.tar
    -rw-rw---- 1 root root 102489869 Mar  4 19:28 checkpoint_npu0.pth.tar
    ...

    可以参考ModelZoo上,PyTorch框架的ResNet-50模型中的“模型推理”章节,对生成的模型文件进行模型转换处理

MindSpore

  1. 在执行如下命令,查看训练结果。

    kubectl logs -n  命令空间名称 {pod名字}

    如:

    kubectl logs  -n vcjob mindx-dls-test-default-test-0

  2. 查看训练日志,如果出现如下内容表示训练成功。

    ...
    Train epoch time: 14614.743 ms, per step time: 2435.791 ms
    epoch: 83 step: 6, loss is 6.024412
    Train epoch time: 8068.264 ms, per step time: 1344.711 ms
    epoch: 84 step: 6, loss is 6.022005
    Train epoch time: 6966.450 ms, per step time: 1161.075 ms
    epoch: 85 step: 6, loss is 6.052724
    Train epoch time: 8519.337 ms, per step time: 1419.889 ms
    epoch: 86 step: 6, loss is 5.9838204
    Train epoch time: 2567.209 ms, per step time: 427.868 ms
    epoch: 87 step: 6, loss is 5.8725405
    Train epoch time: 2248.745 ms, per step time: 374.791 ms
    epoch: 88 step: 6, loss is 5.9476185
    Train epoch time: 1757.267 ms, per step time: 292.878 ms
    epoch: 89 step: 6, loss is 5.899315
    Train epoch time: 2016.577 ms, per step time: 336.096 ms
    epoch: 90 step: 6, loss is 5.9367642
    Train epoch time: 6509.458 ms, per step time: 1084.910 ms

  3. 进入模型输出目录,查看生成的模型文件。

    total 739548
    drwx------ 2 root root      4096 Feb 17 17:11 ./
    drwx------ 3 root root      4096 Feb 17 17:00 ../
    -r-------- 1 root root 189137224 Feb 17 17:06 resnet-10_1875.ckpt
    -r-------- 1 root root 189137224 Feb 17 17:08 resnet-15_1875.ckpt
    -r-------- 1 root root 189137224 Feb 17 17:11 resnet-20_1875.ckpt
    -r-------- 1 root root 189137224 Feb 17 17:03 resnet-5_1875.ckpt
    -rw------- 1 root root    720701 Feb 17 17:01 resnet-graph.meta