Supported RunConfig Options
This section describes the support for native RunConfig options of TensorFlow in the NPURunConfig class.
Options Supported by NPURunConfig
Option |
Description |
|---|---|
model_dir |
Model directory. Default value: None. If model_dir set in NPURunConfig is different from that in NPUEstimator, an error is reported. If either NPURunConfig or NPUEstimator is configured with model_dir, the configured path applies. If neither NPURunConfig nor NPUEstimator is configured with model_dir, a model_dir_xxxxxxxxxx directory is created in the current script execution path to save the model file. |
tf_random_seed |
Seed of the initialization variable. Default value: None. |
save_summary_steps |
Interval (in steps) for saving the summary. Defaults to 0. Applies only to the scenario where iterations_per_loop = 1. If iterations_per_loop > 1, the configured value may not be saved. For details about how to save information, see "Log and Summary Operators." |
save_checkpoints_steps |
Interval (in steps) for saving the checkpoints. Default value: None.
To save the checkpoint data on only a specific device, modify the training script as follows: Original TensorFlow code: self._classifier=tf.estimator.Estimator(
model_fn=cnn_model_fn,
model_dir=self._model_dir,
config=tf.estimator.RunConfig(
save_checkpoints_steps=50 if hvd.rank() == 0 else None,
keep_checkpoint_max=1))
Code after porting: self._classifier=NPUEstimator(
model_fn=cnn_model_fn,
model_dir=self._model_dir,
config=tf.estimator.NPURunConfig(
save_checkpoints_steps=50 if get_rank_id() == 0 else 0,
keep_checkpoint_max=1))
|
save_checkpoints_secs |
Interval (in seconds) for saving the checkpoints. Default value: None. This option is mutually exclusive with save_checkpoints_steps. |
session_config |
ConfigProto object of session configuration. Default value: None. |
keep_checkpoint_max |
Maximum number of checkpoint files that can be stored. Defaults to 5. |
keep_checkpoint_every_n_hours |
Checkpoint file saving duration in hours. Defaults to 10000. This function can be disabled. To use this function, set keep_checkpoint_max to a large value. |
log_step_count_steps |
Interval (in steps) for recording global_step and loss values. Defaults to 100. Applies only to the scenario where iterations_per_loop = 1. If iterations_per_loop > 1, the configured value may not be saved. For details about how to save information, see "Log and Summary Operators." |
Options Not Supported by NPURunConfig
The following options in RunConfig are not supported in NPURunConfig.
Option |
Description |
|---|---|
train_distribute |
Distributed training enable. The distributed configuration is specified by experimental_distribute. This option is used only by TensorFlow Adapter. You are advised not to set it. |
device_fn |
Function of the Device field of each operation. |
protocol |
(Optional) Protocol used to start the server. If the option is empty, the gRPC is used by default. |
eval_distribute |
Distributed evaluation enable. The distributed configuration is specified by experimental_distribute. This option is used only by TensorFlow Adapter. You are advised not to set it. |
experimental_distribute |
Distributed configuration. |