查看整卡调度或静态vNPU调度结果

TensorFlow

  1. 在执行如下命令,查看训练结果。

    kubectl logs -n  命名空间名称 {Pod名称}

    如:

    kubectl logs -n vcjob mindx-dls-test-default-test-0

  2. 查看训练日志,如果出现如下内容表示训练成功。

    ...
    I1123 16:20:11.016411 139889740781376 controller.py:458] train | step:    112 | steps/sec:    4.0 | output: {'train_accuracy': 0.0, 'train_loss': 12.339745}
    train | step:    112 | steps/sec:    4.0 | output: {'train_accuracy': 0.0, 'train_loss': 12.339745}
    2022-11-23 16:20:11.541361: I core/op_executors/npu_concrete_graph.cpp:84] Start consume iterator resource AnonymousIterator0 4 times
    2022-11-23 16:20:11.541499: I core/op_executors/npu_concrete_graph.cpp:118] Start run ge graph 445 pin to cpu, loop size 4
    2022-11-23 16:20:11.565552: I core/op_executors/npu_concrete_graph.cpp:92] Iterator resource AnonymousIterator0 consume 4 times done with status OK
    I1123 16:20:12.046172 139889740781376 controller.py:458] train | step:    116 | steps/sec:    4.0 | output: {'train_accuracy': 0.0, 'train_loss': 12.389724}
    train | step:    116 | steps/sec:    4.0 | output: {'train_accuracy': 0.0, 'train_loss': 12.389724}
    2022-11-23 16:20:12.542817: I core/op_executors/npu_concrete_graph.cpp:84] Start consume iterator resource AnonymousIterator0 4 times
    2022-11-23 16:20:12.542937: I core/op_executors/npu_concrete_graph.cpp:118] Start run ge graph 445 pin to cpu, loop size 4
    2022-11-23 16:20:12.571535: I core/op_executors/npu_concrete_graph.cpp:92] Iterator resource AnonymousIterator0 consume 4 times done with status OK
    I1123 16:20:13.038832 139889740781376 controller.py:458] train | step:    120 | steps/sec:    4.2 | output: {'train_accuracy': 0.0, 'train_loss': 12.421794}
    train | step:    120 | steps/sec:    4.2 | output: {'train_accuracy': 0.0, 'train_loss': 12.421794}
    2022-11-23 16:20:13.559254: I core/op_executors/npu_concrete_graph.cpp:84] Start consume iterator resource AnonymousIterator0 4 times
    2022-11-23 16:20:13.559394: I core/op_executors/npu_concrete_graph.cpp:118] Start run ge graph 445 pin to cpu, loop size 4
    2022-11-23 16:20:13.604791: I core/op_executors/npu_concrete_graph.cpp:92] Iterator resource AnonymousIterator0 consume 4 times done with status OK
    I1123 16:20:14.052418 139889740781376 controller.py:458] train | step:    124 | steps/sec:    4.1 | output: {'train_accuracy': 0.0, 'train_loss': 12.335646}
    train | step:    124 | steps/sec:    4.1 | output: {'train_accuracy': 0.0, 'train_loss': 12.335646}
    2022-11-23 16:20:14.555126: I core/op_executors/npu_concrete_graph.cpp:84] Start consume iterator resource AnonymousIterator0 4 times
    2022-11-23 16:20:14.555217: I core/op_executors/npu_concrete_graph.cpp:118] Start run ge graph 445 pin to cpu, loop size 4
    2022-11-23 16:20:14.601171: I core/op_executors/npu_concrete_graph.cpp:92] Iterator resource AnonymousIterator0 consume 4 times done with status OK
    I1123 16:20:15.058790 139889740781376 controller.py:458] train | step:    128 | steps/sec:    4.1 | output: {'train_accuracy': 0.0, 'train_loss': 12.415506}
    train | step:    128 | steps/sec:    4.1 | output: {'train_accuracy': 0.0, 'train_loss': 12.415506}
    I1123 16:20:15.228246 139889740781376 resnet_ctl_imagenet_main.py:191] Run stats:
    {'step_timestamp_log': ['BatchTimestamp<batch_index: 0, timestamp: 1669191532.9730577>', 'BatchTimestamp<batch_index: 100, timestamp: 1669191607.7925153>'], 'train_finish_time': 1669191615.2273297, 'avg_exp_per_second': 24.973437296848516}
    2022-11-23 16:20:15.232802: I core/npu_logger.cpp:58] Stopping npu stdout receiver of device 0
    2022-11-23 16:20:15.232901: I core/npu_device.cpp:122] Stopping iterator resource provider for AnonymousMultiDeviceIterator0
    2022-11-23 16:20:15.233013: I core/npu_device.cpp:122] Stopping iterator resource provider for AnonymousIterator0
    2022-11-23 16:20:15.235151: I core/npu_wrapper.cpp:230] Stop tensorflow model parser succeed
    2022-11-23 16:20:18.289648: I core/npu_wrapper.cpp:240] Stop graph engine succeed
    ...

  3. 进入模型输出目录,查看生成的模型文件。

    drwxr-xr-x 1 root root       4096 Dec  2 11:36 ./
    drwxrwxrwx 1 root root       4096 Dec  2 11:36 ../
    -rw-r--r--. 1 root root       999 Dec  2 11:36 checkpoint
    -rw-r--r--. 1 root root 306986892 Dec  2 11:35 ckpt-111.data-00000-of-00001
    -rw-r--r--. 1 root root     44311 Dec  2 11:35 ckpt-111.index
    -rw-r--r--. 1 root root 306986892 Dec  2 11:36 ckpt-128.data-00000-of-00001
    -rw-r--r--. 1 root root     44311 Dec  2 11:36 ckpt-128.index

PyTorch

  1. 在执行如下命令,查看训练结果。

    kubectl logs -n  命名空间名称 {pod名称}

    如:

    kubectl logs -n vcjob mindx-dls-test-default-test-0

  2. 查看训练日志,如果出现如下内容表示训练成功。

    [gpu id: 0 ] Test: [77/85]      Time  0.117 ( 0.281)    Loss 1.073741e+01 (1.078090e+01)        Acc@1   0.00 (  0.02)   Acc@5   0.00 (  0.12)
    [gpu id: 0 ] Test: [78/85]      Time  0.114 ( 0.279)    Loss 1.072909e+01 (1.078015e+01)        Acc@1   0.00 (  0.02)   Acc@5   0.00 (  0.12)
    [gpu id: 0 ] Test: [79/85]      Time  0.115 ( 0.277)    Loss 1.073733e+01 (1.077953e+01)        Acc@1   0.00 (  0.02)   Acc@5   0.20 (  0.12)
    [gpu id: 0 ] Test: [80/85]      Time  2.385 ( 0.306)    Loss 1.087646e+01 (1.078090e+01)        Acc@1   0.00 (  0.02)   Acc@5   0.00 (  0.12)
    [gpu id: 0 ] Test: [81/85]      Time  1.139 ( 0.318)    Loss 1.075754e+01 (1.078058e+01)        Acc@1   0.00 (  0.02)   Acc@5   0.39 (  0.12)
    [gpu id: 0 ] Test: [82/85]      Time  0.115 ( 0.315)    Loss 1.068419e+01 (1.077925e+01)        Acc@1   0.00 (  0.02)   Acc@5   0.20 (  0.13)
    [gpu id: 0 ] Test: [83/85]      Time  0.129 ( 0.313)    Loss 1.075079e+01 (1.077887e+01)        Acc@1   0.00 (  0.02)   Acc@5   0.20 (  0.13)
    [gpu id: 0 ] Test: [84/85]      Time  0.134 ( 0.310)    Loss 1.093459e+01 (1.078095e+01)        Acc@1   0.00 (  0.02)   Acc@5   0.39 (  0.13)
    [gpu id: 0 ] [AVG-ACC] * Acc@1 0.016 Acc@5 0.130
    validate acc1 tensor(0.0156, device='npu:0')
    Complete 90 epoch training, take time:1.05h
    ...

  3. 进入模型输出目录,查看生成的模型文件。

    drwxrwx--- 2 root root      4096 Mar  4 19:28 ./
    drwxrwx--- 4 root root      4096 Mar  4 19:28 ../
    -rw-rw---- 1 root root 102489869 Mar  4 19:28 checkpoint_npu0model_best.pth.tar
    -rw-rw---- 1 root root 102489869 Mar  4 19:28 checkpoint_npu0.pth.tar
    ...

    可以参考ModelZoo上,PyTorch框架的ResNet-50模型中的“模型推理”章节,对生成的模型文件进行模型转换处理。

MindSpore

  1. 在执行如下命令,查看训练结果。

    kubectl logs -n  命名空间名称 {pod名称}

    如:

    kubectl logs -n vcjob mindx-dls-test-default-test-0

  2. 查看训练日志,如果出现如下内容表示训练成功。

    ...
    2023-06-09 17:55:04,837:INFO:epoch: [70/90] loss: 1.541062, epoch time: 7.563 s, per step time: 157.554 ms
    2023-06-09 17:55:10,540:INFO:epoch: [71/90] loss: 1.544771, epoch time: 5.702 s, per step time: 118.796 ms
    2023-06-09 17:55:16,347:INFO:epoch: [72/90] loss: 1.506525, epoch time: 5.807 s, per step time: 120.979 ms
    2023-06-09 17:55:24,904:INFO:epoch: [73/90] loss: 1.519342, epoch time: 8.556 s, per step time: 178.260 ms
    2023-06-09 17:55:29,887:INFO:epoch: [74/90] loss: 1.387423, epoch time: 4.982 s, per step time: 103.783 ms
    2023-06-09 17:55:39,785:INFO:epoch: [75/90] loss: 1.440862, epoch time: 9.897 s, per step time: 206.194 ms
    2023-06-09 17:55:48,780:INFO:epoch: [76/90] loss: 1.431275, epoch time: 8.995 s, per step time: 187.399 ms
    2023-06-09 17:55:55,764:INFO:epoch: [77/90] loss: 1.411003, epoch time: 6.984 s, per step time: 145.492 ms
    2023-06-09 17:56:03,962:INFO:epoch: [78/90] loss: 1.457689, epoch time: 8.198 s, per step time: 170.783 ms
    2023-06-09 17:56:11,517:INFO:epoch: [79/90] loss: 1.410896, epoch time: 7.554 s, per step time: 157.372 ms
    2023-06-09 17:56:16,643:INFO:epoch: [80/90] loss: 1.517990, epoch time: 5.126 s, per step time: 106.789 ms
    2023-06-09 17:56:23,364:INFO:epoch: [81/90] loss: 1.342399, epoch time: 6.720 s, per step time: 140.005 ms
    2023-06-09 17:56:31,835:INFO:epoch: [82/90] loss: 1.352396, epoch time: 8.471 s, per step time: 176.470 ms
    2023-06-09 17:56:36,971:INFO:epoch: [83/90] loss: 1.358075, epoch time: 5.135 s, per step time: 106.984 ms
    2023-06-09 17:56:44,259:INFO:epoch: [84/90] loss: 1.400720, epoch time: 7.288 s, per step time: 151.838 ms
    2023-06-09 17:56:52,868:INFO:epoch: [85/90] loss: 1.371813, epoch time: 8.608 s, per step time: 179.339 ms
    2023-06-09 17:56:57,613:INFO:epoch: [86/90] loss: 1.303416, epoch time: 4.745 s, per step time: 98.858 ms
    2023-06-09 17:57:04,177:INFO:epoch: [87/90] loss: 1.290425, epoch time: 6.564 s, per step time: 136.744 ms
    2023-06-09 17:57:11,797:INFO:epoch: [88/90] loss: 1.298486, epoch time: 7.619 s, per step time: 158.738 ms
    2023-06-09 17:57:16,807:INFO:epoch: [89/90] loss: 1.297104, epoch time: 5.009 s, per step time: 104.363 ms
    2023-06-09 17:57:25,568:INFO:epoch: [90/90] loss: 1.401816, epoch time: 8.759 s, per step time: 182.486 ms

  3. 进入模型输出目录,查看生成的模型文件。

    drwx------  2 root root      4096 Dec 21 15:35 ./
    drwxrwxrwx 10 root root      4096 Dec 21 15:26 ../
    -r--------  1 root root 188546464 Dec 21 15:31 resnet50-45_234.ckpt
    -r--------  1 root root 188546464 Dec 21 15:31 resnet50-50_234.ckpt
    -r--------  1 root root 188546464 Dec 21 15:32 resnet50-55_234.ckpt
    -r--------  1 root root 188546464 Dec 21 15:32 resnet50-60_234.ckpt
    -r--------  1 root root 188546464 Dec 21 15:33 resnet50-65_234.ckpt
    -r--------  1 root root 188546464 Dec 21 15:33 resnet50-70_234.ckpt
    -r--------  1 root root 188546464 Dec 21 15:33 resnet50-75_234.ckpt
    -r--------  1 root root 188546464 Dec 21 15:34 resnet50-80_234.ckpt
    -r--------  1 root root 188546464 Dec 21 15:34 resnet50-85_234.ckpt
    -r--------  1 root root 188546464 Dec 21 15:35 resnet50-90_234.ckpt
    -rw-------  1 root root    769071 Dec 21 15:28 resnet50-graph.meta