以本地NFS,主机名以Ubuntu为例说明。
“/data/atlas_dls/output/”目录下的“logs”记录相关训练的FPS数值,本示例单机与分布式目录结构相同。
root@ubuntu:/home# ll /data/atlas_dls/output/ total 16896 drwxr-x--- 2 HwHiAiUser HwHiAiUser 4096 Oct 7 16:06 ./ drwxr-x--- 4 hwMindX HwHiAiUser 4096 Oct 7 15:26 ../ ... -rwxr-x--- 1 HwHiAiUser HwHiAiUser 682 Oct 7 16:06 logs
cat /data/atlas_dls/output/logs
若回显中展示FPS数值,则表示训练成功。
step: 100 epoch: 0.0 FPS: 496.4 loss: 6.605 total_loss: 7.922 lr:0.10000 step: 200 epoch: 0.0 FPS: 1819.2 loss: 6.375 total_loss: 7.672 lr:0.10000 step: 300 epoch: 0.1 FPS: 1898.2 loss: 6.277 total_loss: 7.551 lr:0.10000 step: 400 epoch: 0.1 FPS: 2126.8 loss: 6.242 total_loss: 7.492 lr:0.10000 step: 500 epoch: 0.1 FPS: 2357.4 loss: 6.090 total_loss: 7.320 lr:0.10000 step: 600 epoch: 0.1 FPS: 2370.7 loss: 5.863 total_loss: 7.074 lr:0.10000 step: 700 epoch: 0.1 FPS: 2368.6 loss: 5.902 total_loss: 7.094 lr:0.10000 step: 800 epoch: 0.2 FPS: 2370.0 loss: 5.746 total_loss: 6.918 lr:0.10000 step: 900 epoch: 0.2 FPS: 2371.0 loss: 5.605 total_loss: 6.758 lr:0.10000 step: 1000 epoch: 0.2 FPS: 2365.9 loss: 5.750 total_loss: 6.887 lr:0.10000
ls -l /data/atlas_dls/code/ResNet50_for_TensorFlow_1.7_code/scripts/model_dir
drwxr-xr-x 2 root root 4096 Jan 15 17:58 ./ drwxrwxrwx 5 root root 4096 Jan 15 18:03 ../ -rw-r--r-- 1 root root 81 Jan 15 17:58 checkpoint -rw-r--r-- 1 root root 18649801 Jan 15 18:02 events.out.tfevents.1642240674.mindx-dls-test-default-test-0 -rw-r--r-- 1 root root 8475459 Jan 15 17:58 graph.pbtxt -rw-r--r-- 1 root root 204685136 Jan 15 17:58 model.ckpt-0.data-00000-of-00001 -rw-r--r-- 1 root root 16262 Jan 15 17:58 model.ckpt-0.index -rw-r--r-- 1 root root 4977000 Jan 15 17:58 model.ckpt-0.meta