Starting Single-Device Training
We skip Distributed Training Script Adaptation (Single Device) to preliminarily validate the single-device porting result.
First, determine the parameters in the startup script. To use the same parameters for single CPU training, set distribution_strategy to one_device.
Add the models directory to PYTHONPATH according to the description in official/vision/image_classification/resnet/README.md. The following example assumes that the current directory is /path/to/models:
1 | export PYTHONPATH=$PYTHONPATH:/path/to/models |
As a best practice, we offload a training epoch to the device in iteration offload mode. Set steps_per_loop to the dataset size divided by the batch size. If the dataset size is 64 and batch size is 2 and the evaluation phase is not wanted, steps_per_loop should be 32 (= 64/2) and therefore the environment variable should be set using export NPU_LOOP_SIZE=32. The final startup parameters are as follows (replace /path/to/imagenet_TF/ with your dataset directory). Training is organized by epoch in normal cases. The argument train_steps is used here only for validation convenience.
1 2 3 4 5 6 7 8 9 10 11 12 | cd official/vision/image_classification/resnet/ export PYTHONPATH=$PYTHONPATH:/path/to/models export NPU_LOOP_SIZE=32 python3 resnet_ctl_imagenet_main.py \ --data_dir=/path/to/imagenet_TF/ \ --train_steps=128 \ --distribution_strategy=one_device \ --use_tf_while_loop=true \ --steps_per_loop=32 \ --batch_size=2 \ --epochs_between_evals=1 \ --skip_eval |
After this command is executed, the script training is complete on the NPU.