init

Function

Initializes the Rec SDK model training framework.

Prototype

1
def init(**kwargs)

**kwargs Parameters

Parameter

Type

Mandatory/Optional

Description

max_steps

int

Optional

Total number of training steps. The default value is -1, indicating that the training ends after the training data is used up. The value ranges from –1 to 2147483647.

train_steps

int

Optional

Number of training steps for test prediction. The default value is -1, indicating that prediction is performed after all training datasets are trained. The value ranges from -1 to 2,147,483,647.

eval_steps

int

Optional

Number of test prediction steps. The default value is -1, indicating that the training continues after all test datasets are predicted. The value ranges from -1 to 2,147,483,647.

if_load

bool

Optional

Whether to load a model. The default value is False.

Value:

  • True: model loaded
  • False: model not loaded

use_dynamic

bool

Optional

Whether to use the dynamic shape function. The default value is True.

Value:

  • True: dynamic shape enabled
  • False: dynamic shape disabled

use_dynamic_expansion

bool

Optional

Whether to enable dynamic capacity expansion of the on-chip memory. The default value is False.

Value:

  • True: dynamic capacity expansion enabled
  • False: dynamic capacity expansion disabled

bind_cpu

bool

Optional

Whether to enable automatic CPU core binding. The default value is True.

Value:

  • True: automatic CPU core binding enabled
  • False: automatic CPU core binding disabled

save_steps

int

Optional

Saves data after save_steps is trained. The value ranges from –1 to 2147483647. The default value –1 indicates that all training data is saved after training.

save_checkpoint_due_time

int

Optional

Interval for saving the full model, in seconds.

The value ranges from 1 to 2147483647. Generally, the value of save_checkpoint_due_time is greater than that of save_delta_checkpoints_secs.

This parameter is mandatory when is_incremental_checkpoint is set to True.

NOTE:

When both incremental saving and loading and the SSD mode are enabled, if this parameter is set to a small value, data competition may occur, causing program segment errors.

save_delta_checkpoints_secs

int

Optional

Interval for saving the incremental model, in seconds.

The value ranges from 1 to 2147483647. Generally, the value of save_checkpoint_due_time is greater than that of save_delta_checkpoints_secs.

This parameter is mandatory when is_incremental_checkpoint is set to True.

NOTE:

When both incremental saving and loading and the SSD mode are enabled, if this parameter is set to a small value, data competition may occur, causing program segment errors.

is_incremental_checkpoint

bool

Optional

Whether to save and load the incremental model. The default value is False.

  • True: enabled
  • False: disabled

restore_model_version

int

Optional

Step of the model to be loaded. If this parameter is not passed, the latest model is loaded by default. If this parameter is set to a specific step, the model at the corresponding step is loaded.

The value ranges from 0 to 2147483647

recent_key_count_threshold

int

Optional

Minimum number of key occurrences during the incremental saving period. This parameter is used for low-frequency filtering. When the incremental model is saved, the keys whose occurrence frequency is less than the value of this parameter are filtered out. The default value is 0. The value ranges from 0 to 2147483647

use_lccl

bool

Optional

When a multi-device job is running and the communication bandwidth usage is low, you can use the Low Latency Collective Communication Library (LCCL) function to accelerate collective communication. After this function is enabled, the following LCCL operators are enabled in some scenarios. Only the non-scale-out mode of the single-server on-chip memory is supported. For details about how to use this function, see LCCL Communication Optimization Operators and Samples.

  • All2All operator
  • GatherAll operator (fused Gather&AllToAll operator)
  • GatherUss operator (fused Gather&UnsortedSegmentSum operator)

The default value is False, indicating that this function is disabled.

  • When sess.run is used for training, the number of steps for sess to perform train/eval/save must be the same as the value of train_steps/eval_steps/save_steps.
  • When Estimator is used for training:
    • The value of save_steps must be the same as that of save_checkpoints_steps when the NPURunConfig object is defined, and cannot be set to –1 in TensorFlow.
    • The value of max_steps must be the same as that of max_steps passed to est.train()/tf.estimator.TrainSpec(), and cannot be set to –1 in TensorFlow.
    • In train_and_evaluate mode, the requirements for save_steps and max_steps are the same as those described above. The value of train_steps must be the same as that of save_steps. The value of eval_steps must be the same as that of steps passed to tf.estimator.EvalSpec(), and cannot be set to –1 in TensorFlow.
  • If kwargs is used to pass other parameters that are not described, Rec SDK does not use these parameters.
  • Use the actual values of max_steps, train steps, and eval steps, and their values cannot be 0 at the same time.
  • If use_dynamic_expansion is set to True, select an optimizer of the ByAddr type, such as SGDByAddr and LazyAdamByAddress.
  • Multi-round evaluation is not supported in the train_and_evaluate scenario.
  • The values of max_steps, train_steps, eval_steps, and save_steps must be the same as those in the actual training process. If they are inconsistent, the training may fail or the training accuracy may be affected.

Return Value

  • Success: None
  • Failure: An exception is thrown.

Example

1
2
from mx_rec.util.initialize import init
init(max_steps=200, train_steps=100, eval_steps=10, save_steps=100, use_dynamic=True, use_dynamic_expansion=False)

See Also

For details about the API call sequence and example, see Porting and Training.