查看训练结果

Tensorflow

  1. 在执行如下命令,查看训练结果。

    kubectl logs -n  命令空间名称 {pod名字}

    如:

    kubectl logs  -n vcjob mindx-dls-test-default-test-0

  2. 查看训练日志,如果出现如下内容表示训练成功。

    ...
    I1123 16:20:11.016411 139889740781376 controller.py:458] train | step:    112 | steps/sec:    4.0 | output: {'train_accuracy': 0.0, 'train_loss': 12.339745}
    train | step:    112 | steps/sec:    4.0 | output: {'train_accuracy': 0.0, 'train_loss': 12.339745}
    2022-11-23 16:20:11.541361: I core/op_executors/npu_concrete_graph.cpp:84] Start consume iterator resource AnonymousIterator0 4 times
    2022-11-23 16:20:11.541499: I core/op_executors/npu_concrete_graph.cpp:118] Start run ge graph 445 pin to cpu, loop size 4
    2022-11-23 16:20:11.565552: I core/op_executors/npu_concrete_graph.cpp:92] Iterator resource AnonymousIterator0 consume 4 times done with status OK
    I1123 16:20:12.046172 139889740781376 controller.py:458] train | step:    116 | steps/sec:    4.0 | output: {'train_accuracy': 0.0, 'train_loss': 12.389724}
    train | step:    116 | steps/sec:    4.0 | output: {'train_accuracy': 0.0, 'train_loss': 12.389724}
    2022-11-23 16:20:12.542817: I core/op_executors/npu_concrete_graph.cpp:84] Start consume iterator resource AnonymousIterator0 4 times
    2022-11-23 16:20:12.542937: I core/op_executors/npu_concrete_graph.cpp:118] Start run ge graph 445 pin to cpu, loop size 4
    2022-11-23 16:20:12.571535: I core/op_executors/npu_concrete_graph.cpp:92] Iterator resource AnonymousIterator0 consume 4 times done with status OK
    I1123 16:20:13.038832 139889740781376 controller.py:458] train | step:    120 | steps/sec:    4.2 | output: {'train_accuracy': 0.0, 'train_loss': 12.421794}
    train | step:    120 | steps/sec:    4.2 | output: {'train_accuracy': 0.0, 'train_loss': 12.421794}
    2022-11-23 16:20:13.559254: I core/op_executors/npu_concrete_graph.cpp:84] Start consume iterator resource AnonymousIterator0 4 times
    2022-11-23 16:20:13.559394: I core/op_executors/npu_concrete_graph.cpp:118] Start run ge graph 445 pin to cpu, loop size 4
    2022-11-23 16:20:13.604791: I core/op_executors/npu_concrete_graph.cpp:92] Iterator resource AnonymousIterator0 consume 4 times done with status OK
    I1123 16:20:14.052418 139889740781376 controller.py:458] train | step:    124 | steps/sec:    4.1 | output: {'train_accuracy': 0.0, 'train_loss': 12.335646}
    train | step:    124 | steps/sec:    4.1 | output: {'train_accuracy': 0.0, 'train_loss': 12.335646}
    2022-11-23 16:20:14.555126: I core/op_executors/npu_concrete_graph.cpp:84] Start consume iterator resource AnonymousIterator0 4 times
    2022-11-23 16:20:14.555217: I core/op_executors/npu_concrete_graph.cpp:118] Start run ge graph 445 pin to cpu, loop size 4
    2022-11-23 16:20:14.601171: I core/op_executors/npu_concrete_graph.cpp:92] Iterator resource AnonymousIterator0 consume 4 times done with status OK
    I1123 16:20:15.058790 139889740781376 controller.py:458] train | step:    128 | steps/sec:    4.1 | output: {'train_accuracy': 0.0, 'train_loss': 12.415506}
    train | step:    128 | steps/sec:    4.1 | output: {'train_accuracy': 0.0, 'train_loss': 12.415506}
    I1123 16:20:15.228246 139889740781376 resnet_ctl_imagenet_main.py:191] Run stats:
    {'step_timestamp_log': ['BatchTimestamp<batch_index: 0, timestamp: 1669191532.9730577>', 'BatchTimestamp<batch_index: 100, timestamp: 1669191607.7925153>'], 'train_finish_time': 1669191615.2273297, 'avg_exp_per_second': 24.973437296848516}
    2022-11-23 16:20:15.232802: I core/npu_logger.cpp:58] Stopping npu stdout receiver of device 0
    2022-11-23 16:20:15.232901: I core/npu_device.cpp:122] Stopping iterator resource provider for AnonymousMultiDeviceIterator0
    2022-11-23 16:20:15.233013: I core/npu_device.cpp:122] Stopping iterator resource provider for AnonymousIterator0
    2022-11-23 16:20:15.235151: I core/npu_wrapper.cpp:230] Stop tensorflow model parser succeed
    2022-11-23 16:20:18.289648: I core/npu_wrapper.cpp:240] Stop graph engine succeed
    ...

  3. 进入模型输出目录,查看生成的模型文件。

    drwxr-xr-x 1 root root       4096 Dec  2 11:36 ./
    drwxrwxrwx 1 root root       4096 Dec  2 11:36 ../
    -rw-r--r--. 1 root root       999 Dec  2 11:36 checkpoint
    -rw-r--r--. 1 root root 306986892 Dec  2 11:35 ckpt-111.data-00000-of-00001
    -rw-r--r--. 1 root root     44311 Dec  2 11:35 ckpt-111.index
    -rw-r--r--. 1 root root 306986892 Dec  2 11:36 ckpt-128.data-00000-of-00001
    -rw-r--r--. 1 root root     44311 Dec  2 11:36 ckpt-128.index

Pytorch

  1. 在执行如下命令,查看训练结果。

    kubectl logs -n  命令空间名称 {pod名字}

    如:

    kubectl logs  -n vcjob mindx-dls-test-default-test-0

  2. 查看训练日志,如果出现如下内容表示训练成功。

    [gpu id: 0 ] Test: [77/85]      Time  0.117 ( 0.281)    Loss 1.073741e+01 (1.078090e+01)        Acc@1   0.00 (  0.02)   Acc@5   0.00 (  0.12)
    [gpu id: 0 ] Test: [78/85]      Time  0.114 ( 0.279)    Loss 1.072909e+01 (1.078015e+01)        Acc@1   0.00 (  0.02)   Acc@5   0.00 (  0.12)
    [gpu id: 0 ] Test: [79/85]      Time  0.115 ( 0.277)    Loss 1.073733e+01 (1.077953e+01)        Acc@1   0.00 (  0.02)   Acc@5   0.20 (  0.12)
    [gpu id: 0 ] Test: [80/85]      Time  2.385 ( 0.306)    Loss 1.087646e+01 (1.078090e+01)        Acc@1   0.00 (  0.02)   Acc@5   0.00 (  0.12)
    [gpu id: 0 ] Test: [81/85]      Time  1.139 ( 0.318)    Loss 1.075754e+01 (1.078058e+01)        Acc@1   0.00 (  0.02)   Acc@5   0.39 (  0.12)
    [gpu id: 0 ] Test: [82/85]      Time  0.115 ( 0.315)    Loss 1.068419e+01 (1.077925e+01)        Acc@1   0.00 (  0.02)   Acc@5   0.20 (  0.13)
    [gpu id: 0 ] Test: [83/85]      Time  0.129 ( 0.313)    Loss 1.075079e+01 (1.077887e+01)        Acc@1   0.00 (  0.02)   Acc@5   0.20 (  0.13)
    [gpu id: 0 ] Test: [84/85]      Time  0.134 ( 0.310)    Loss 1.093459e+01 (1.078095e+01)        Acc@1   0.00 (  0.02)   Acc@5   0.39 (  0.13)
    [gpu id: 0 ] [AVG-ACC] * Acc@1 0.016 Acc@5 0.130
    validate acc1 tensor(0.0156, device='npu:0')
    Complete 90 epoch training, take time:1.05h
    ...

  3. 进入模型输出目录,查看生成的模型文件。

    drwxrwx--- 2 root root      4096 Mar  4 19:28 ./
    drwxrwx--- 4 root root      4096 Mar  4 19:28 ../
    -rw-rw---- 1 root root 102489869 Mar  4 19:28 checkpoint_npu0model_best.pth.tar
    -rw-rw---- 1 root root 102489869 Mar  4 19:28 checkpoint_npu0.pth.tar
    ...

    可以参考ModelZoo上,PyTorch框架的ResNet-50模型中的“模型推理”章节,对生成的模型文件进行模型转换处理

MindSpore

  1. 在执行如下命令,查看训练结果。

    kubectl logs -n  命令空间名称 {pod名字}

    如:

    kubectl logs  -n vcjob mindx-dls-test-default-test-0

  2. 查看训练日志,如果出现如下内容表示训练成功。

    ...
    Train epoch time: 58349.597 ms, per step time: 249.357 ms
    epoch: 83 step: 234, loss is 7.74848e-05
    Train epoch time: 3995.383 ms, per step time: 17.074 ms
    epoch: 84 step: 234, loss is 0.00019958502
    Train epoch time: 4005.581 ms, per step time: 17.118 ms
    epoch: 85 step: 234, loss is 0.020869248
    Train epoch time: 7434.552 ms, per step time: 31.772 ms
    epoch: 86 step: 234, loss is 6.217403e-05
    Train epoch time: 4048.635 ms, per step time: 17.302 ms
    epoch: 87 step: 234, loss is 6.641826e-05
    Train epoch time: 4036.470 ms, per step time: 17.250 ms
    epoch: 88 step: 234, loss is 0.00022485439
    Train epoch time: 4032.634 ms, per step time: 17.233 ms
    epoch: 89 step: 234, loss is 8.325573e-06
    Train epoch time: 4009.894 ms, per step time: 17.136 ms
    epoch: 90 step: 234, loss is 7.480194e-05
    Train epoch time: 7445.271 ms, per step time: 31.817 ms

  3. 进入模型输出目录,查看生成的模型文件。

    drwx------  2 root root      4096 Dec 21 15:35 ./
    drwxrwxrwx 10 root root      4096 Dec 21 15:26 ../
    -r--------  1 root root 188546464 Dec 21 15:31 resnet-45_234.ckpt
    -r--------  1 root root 188546464 Dec 21 15:31 resnet-50_234.ckpt
    -r--------  1 root root 188546464 Dec 21 15:32 resnet-55_234.ckpt
    -r--------  1 root root 188546464 Dec 21 15:32 resnet-60_234.ckpt
    -r--------  1 root root 188546464 Dec 21 15:33 resnet-65_234.ckpt
    -r--------  1 root root 188546464 Dec 21 15:33 resnet-70_234.ckpt
    -r--------  1 root root 188546464 Dec 21 15:33 resnet-75_234.ckpt
    -r--------  1 root root 188546464 Dec 21 15:34 resnet-80_234.ckpt
    -r--------  1 root root 188546464 Dec 21 15:34 resnet-85_234.ckpt
    -r--------  1 root root 188546464 Dec 21 15:35 resnet-90_234.ckpt
    -rw-------  1 root root    769071 Dec 21 15:28 resnet-graph.meta