How Do I Save Checkpoints on a Particular Device?
In distributed training, to save checkpoints only on a particular device, edit your training script as follows:
Original TensorFlow code
1 2 3 4 5 6 | self._classifier=tf.estimator.Estimator( model_fn=cnn_model_fn, model_dir=self._model_dir, config=tf.estimator.RunConfig( save_checkpoints_steps=50 if hvd.rank() == 0 else None, keep_checkpoint_max=1)) |
Code after porting:
1 2 3 4 5 6 | self._classifier=NPUEstimator( model_fn=cnn_model_fn, model_dir=self._model_dir, config=tf.estimator.NPURunConfig( save_checkpoints_steps=50 if get_rank_id() == 0 else 0, keep_checkpoint_max=1)) |
Parent topic: Common Operations