以本地NFS,主机名以Ubuntu为例说明。
tail -f /data/atlas_dls/output/logs
[gpu id: 0 ] Test: [77/85] Time 0.117 ( 0.281) Loss 1.073741e+01 (1.078090e+01) Acc@1 0.00 ( 0.02) Acc@5 0.00 ( 0.12) [gpu id: 0 ] Test: [78/85] Time 0.114 ( 0.279) Loss 1.072909e+01 (1.078015e+01) Acc@1 0.00 ( 0.02) Acc@5 0.00 ( 0.12) [gpu id: 0 ] Test: [79/85] Time 0.115 ( 0.277) Loss 1.073733e+01 (1.077953e+01) Acc@1 0.00 ( 0.02) Acc@5 0.20 ( 0.12) [gpu id: 0 ] Test: [80/85] Time 2.385 ( 0.306) Loss 1.087646e+01 (1.078090e+01) Acc@1 0.00 ( 0.02) Acc@5 0.00 ( 0.12) [gpu id: 0 ] Test: [81/85] Time 1.139 ( 0.318) Loss 1.075754e+01 (1.078058e+01) Acc@1 0.00 ( 0.02) Acc@5 0.39 ( 0.12) [gpu id: 0 ] Test: [82/85] Time 0.115 ( 0.315) Loss 1.068419e+01 (1.077925e+01) Acc@1 0.00 ( 0.02) Acc@5 0.20 ( 0.13) [gpu id: 0 ] Test: [83/85] Time 0.129 ( 0.313) Loss 1.075079e+01 (1.077887e+01) Acc@1 0.00 ( 0.02) Acc@5 0.20 ( 0.13) [gpu id: 0 ] Test: [84/85] Time 0.134 ( 0.310) Loss 1.093459e+01 (1.078095e+01) Acc@1 0.00 ( 0.02) Acc@5 0.39 ( 0.13) [gpu id: 0 ] [AVG-ACC] * Acc@1 0.016 Acc@5 0.130 validate acc1 tensor(0.0156, device='npu:0') Complete 90 epoch training, take time:1.05h ...
ls /data/atlas_dls/code/ResNet50_for_PyTorch_1.4_code
drwxrwx--- 2 root root 4096 Mar 4 19:28 ./ drwxrwx--- 4 root root 4096 Mar 4 19:28 ../ -rw-rw---- 1 root root 102489869 Mar 4 19:28 checkpoint_npu0model_best.pth.tar -rw-rw---- 1 root root 102489869 Mar 4 19:28 checkpoint_npu0.pth.tar ...
可以参考ModelZoo上,PyTorch框架的ResNet-50模型中的“模型推理”章节,对生成的模型文件进行模型转换处理。