API List

TF Adapter provides APIs for users to develop training or online inference scripts based on the deep learning framework TensorFlow 1.15.

Figure 1 TF Adapter

API path: ${install_path}/python/site-packages/npu_bridge

Table 1 TF Adapter APIs

Function

API Name

Description

Session Configuration

Session Configuration Options

TF Adapter provides a series of session configurations for function debugging, performance improvement, and precision improvement. Developers can use these session configurations when performing model training or online inference on the Ascend AI Processor.

npu.npu_config

NPURunConfig Constructor

When performing model training or online inference in Estimator mode on the Ascend AI Processor, you can use the constructor of the NPURunConfig class to specify the running configuration of the Estimator.

ProfilingConfig Constructor

Configures the Profiling function.

MemoryConfig Constructor

Configures the system memory usage mode.

DumpConfig Constructor

Configures the dump function.

ExperimentalConfig Constructor

Extended parameter for debugging and may be changed in later versions. It cannot be used in commercial products.

npu.npu_estimator

NPUEstimator Constructor

Constructor of the NPUEstimator class. The NPUEstimator class inherits the Estimator class of TensorFlow and can call the native APIs of the base class to train and evaluate TensorFlow models.

NPUEstimatorSpec Constructor

Constructor of the NPUEstimatorSpec class. The NPUEstimatorSpec class inherits the EstimatorSpec class of the TensorFlow and can call the native APIs of the base class to define specific model objects.

npu_strategy

NPUStrategy Constructor

Constructs an object of class NPUStrategy NPUStrategy inherits the tf.distribute.Strategy class and can call the native APIs of the base class to implement distributed training in the NPU environment.

npu_hook

NPUCheckpointSaverHook Constructor

Constructs an object of class NPUCheckpointSaverHook, which is used to save the checkpoint file. The NPUCheckpointSaverHook class inherits the CheckpointSaverHook class and can call the native APIs of the base class to record model information during training.

NPUOutputTensorHook Constructor

Constructs an object of class NPUOutputTensorHook. NPUOutputTensorHook is a hook for training, evaluation, and prediction of NPUEstimator, and it can call the user-defined output_fn every N step or at the end to print the output tensors. The NPUOutputTensorHook class inherits the LoggingTensorHook class and can call native APIs of the base class.

TellMeStepOrLossHook Constructor

Constructs an object of the TellMeStepOrLossHook class. TellMeStepOrLossHook is used to notify the bottom-layer software of the serial number of the current step and the total number of steps or the current loss and the target loss.

npu_optimizer

NPUDistributedOptimizer Constructor

Constructs an object of class NPUDistributedOptimizer, which wraps around a single-server training optimizer to an NPU distributed training optimizer.

NPUOptimizer Constructor

Constructs an object of class NPUOptimizer, which combines the NPUDistributedOptimizer and NPULossScaleOptimizer optimizers. It provides the following functions:
  • Loss scaling: Loss scaling can be enabled during mixed precision training to solve the underflow problem caused by a small float16 representation range.
  • Distributed training: With an NPU distributed training optimizer wrapped from a single-server training optimizer, calculated gradients can be aggregated in single-server single-device, single-server multi-device, and multi-server multi-device networking modes.
  • By changing a computation dependency relationship, a computation operation that does not depend on the last AR (gradient aggregation fragment) is scheduled to be performed in parallel with the last AR, to optimize communication tailing.

KerasDistributeOptimizer Constructor

Constructs an object of class KerasDistributeOptimizer, which wraps around the single-server training optimizer constructed by tf.Keras to an NPU distributed training optimizer.

npu_distributed_optimizer_wrapper

Adds the AllReduce operation of NPU to the input gradient function of the optimizer and combines them into one function and returns the optimizer. This API is used only in distributed scenarios.

npu_allreduce

Performs AllReduce and update operations on gradients after the gradient computing is complete.

npu_callbacks

NPUBroadcastGlobalVariablesCallback Constructor

Broadcasts variables in Keras scenarios to ensure that the initial values of variables on each device are the same in distributed scenarios.

npu_bridge.estimator.npu.npu_loss_scale_optimizer

NPULossScaleOptimizer Constructor

Constructor of the NPULossScaleOptimizer class, which is used to enable loss scaling in mixed precision training when the overflow/underflow mode of floating-point computation is saturation mode. Loss scaling solves the underflow problem caused by the small float16 representation range.

npu.npu_loss_scale_manager

FixedLossScaleManager Constructor

Constructor of the FixedLossScaleManager class, which is used to define the static LossScale parameter during training when the overflow/underflow mode of floating-point computation is saturation mode.

ExponentialUpdateLossScaleManager Constructor

Constructor of the ExponentialUpdateLossScaleManager class, which is used to define the dynamic LossScale parameter during training and dynamically obtains and updates the value of LossScale by defining the loss_scale variable when the overflow/underflow mode of floating-point computation is saturation mode.

npu_ops

dropout

(Functionally the same as tf.nn.dropout.) Scales the elements of an input tensor by 1/keep_prob where keep_prob indicates the preservation probability of the input tensor and outputs a tensor of the same shape of the input tensor. Elements not rescaled are set to 0.

LARSV2

Scales gradients based on the norm of weight and the norm of gradient at different levels using different learning rates. It is used to improve the training accuracy in large batch size scenarios and is used for large-scale cluster training to reduce the training time.

initialize_system

Excludes the GE initialization time in the training time statistics. Generally, this API is not required for training. Before using the collective communication API, call this API to initialize the collective communication.

shutdown_system

Shuts down all devices. This API is used in conjunction with initialize_system.

npu_onnx_graph_op

Loads an ONNX model as an operator and executes it on the Ascend AI Processor through the TensorFlow framework.

npu_rnn

npu_dynamic_rnn

Creates a high-performance neural network specified by RNNCell.

npu_dynamic_rnn

DynamicRNN Constructor

Used for RNN training and inference with TensorFlow.

DynamicGRUV2 Constructor

Used for RNN training and inference with TensorFlow.

npu_scope

without_npu_compile_scope

Configures operators built on the host in mixed computing scenarios.

keep_dtype_scope

Specifies the operators that preserve the original precision. If the operator precision in an original network model is not supported by the Ascend AI Processor, the system automatically uses the high precision supported by the operators for compute.

npu_weight_prefetch_scope

Identifies the operators whose weight data will be prefetched into a buffer pool and specifies the ID and size of the buffer pool.

subgraph_multi_dims_scope

Specifies the scope of the operator for which subgraph-wide dynamic shape profiles are to be applied in the online inference scenario.

util

set_iteration_per_loop

Sets the number of iterations per training loop in sess.run mode, that is, the number of training iterations executed on the device side in each sess.run() call. This API can save unnecessary interactions between the host and device and reduce the training time consumption.

create_iteration_per_loop_var

This API is used in conjunction with load_iteration_per_loop_var to set the number of iterations per training loop every sess.run() call on the device side. This API is used to modify a graph and set the number of iterations per loop using load_iteration_per_loop_var.

load_iteration_per_loop_var

This API is used in conjunction with create_iteration_per_loop_var to set the number of iterations per training loop every sess.run() call on the device side.

set_graph_exec_config

Sets the compilation and execution options for a computational graph. After this API is called, configured attributes are added to the fetch node.

keep_tensors_dtypes

Specifies the operators that preserve the original precision.

set_op_input_tensor_multi_dims

Applies to subgraph-wide dynamic dimension size and specifies the input shape of an operator and dimension size profiles in the online inference scenario.

keras_to_npu

model_to_npu_estimator

Converts the model constructed by using Keras to an NPUEstimator object.

npu_plugin

set_device_sat_mode

Sets the process-level overflow mode for floating-point computation.

profiler

Profiler Constructor

Constructor of the Profiler class, which is used to enable the profiling function locally. For example, you can collect the profile data of a local subgraph on the TensorFlow network or a specified step.