Starting Distributed Training

You can run single-device training and distributed training using the same script.

Some extra tweaking on the startup parameters is needed to run distributed training. However, such tweaking is not unique to NPU training. You always need to increase the global batch size proportionally when you scale up the training devices. For example, assume that you have used batch size 32 for single-device training and you connect the device with another seven devices to form a cluster, you can increase the batch size to 32 x 8 to accelerate training.

The preceding single-device training example uses batch size 2. We increase the batch size to 16 (= 8 x 2) in eight-device training. The change of batch size directly affects the total number of training steps per epoch. Assume that 64 samples are passed every epoch. When the batch size is 2, set steps_per_loop to 32 (= 64/2), indicating that every 32 steps on a single device complete one training epoch. However, in 8-device training, the batch size is increased to 16 and therefore steps_per_loop should be set to 4 (= 64/16), indicating that every 4 steps on a single device complete one training epoch — an 8x performance boost.

In distributed training, as multiple training processes will be started, you can write the startup command line into the script. The following 8-device training script (train.sh) is for reference only.

export RANK_TABLE_FILE=/path/to/rank_table.json
export RANK_SIZE=8
export RANK_ID=$1
export ASCEND_DEVICE_ID=$2
export NPU_LOOP_SIZE=4
python3 resnet_ctl_imagenet_main.py \
--data_dir=/path/to/imagenet_TF/ \
--train_steps=16 \
--distribution_strategy=one_device \
--use_tf_while_loop=true \
--steps_per_loop=4 \
--batch_size=16 \
--epochs_between_evals=1 \
--skip_eval

Replace /path/to/rank_table.json with the NPU distributed configuration file that meets your setup requirements.
Replace /path/to/imagenet_TF/ with the actual dataset directory.
In this example, the resource information of the Ascend AI Processor is configured in the configuration file (that is, a ranktable file). For details about the configuration file, see Preparing the Ranktable Resource Configuration File. Alternatively, you can use environment variables to specify resource information of the Ascend AI Processor. For details, see Training Execution (Setting Environment Variables).

Add the models directory to PYTHONPATH according to the description in official/vision/image_classification/resnet/README.md. For example, if the models directory is /path/to/models, set the environment variable as follows:

export PYTHONPATH=$PYTHONPATH:/path/to/models

Next, run the following commands to start 8-device NPU training.

nohup bash train.sh 0 0 &
nohup bash train.sh 1 1 &
nohup bash train.sh 2 2 &
nohup bash train.sh 3 3 &
nohup bash train.sh 4 4 &
nohup bash train.sh 5 5 &
nohup bash train.sh 6 6 &
nohup bash train.sh 7 7 &

Parent topic: Manual Porting and Training