Starting Single-Device Training

We skip Distributed Training Script Adaptation (Single Device) to preliminarily validate the single-device porting result.

First, determine the parameters in the startup script. To use the same parameters for single CPU training, set distribution_strategy to one_device.

Add the models directory to PYTHONPATH according to the description in official/vision/image_classification/resnet/README.md. The following example assumes that the current directory is /path/to/models:

export PYTHONPATH=$PYTHONPATH:/path/to/models

As a best practice, we offload a training epoch to the device in iteration offload mode. Set steps_per_loop to the dataset size divided by the batch size. If the dataset size is 64 and batch size is 2 and the evaluation phase is not wanted, steps_per_loop should be 32 (= 64/2) and therefore the environment variable should be set using export NPU_LOOP_SIZE=32. The final startup parameters are as follows (replace /path/to/imagenet_TF/ with your dataset directory). Training is organized by epoch in normal cases. The argument train_steps is used here only for validation convenience.

cd official/vision/image_classification/resnet/
export PYTHONPATH=$PYTHONPATH:/path/to/models
export NPU_LOOP_SIZE=32
python3 resnet_ctl_imagenet_main.py \
--data_dir=/path/to/imagenet_TF/ \
--train_steps=128 \
--distribution_strategy=one_device \
--use_tf_while_loop=true \
--steps_per_loop=32 \
--batch_size=2 \
--epochs_between_evals=1 \
--skip_eval

After this command is executed, the script training is complete on the NPU.

Parent topic: Manual Porting and Training