API List

TF Adapter provides APIs for users to develop training or online inference scripts based on the deep learning framework TensorFlow 1.15.

Figure 1 TF Adapter

API path: ${TFPLUGIN_INSTALL_PATH}/python/site-packages/npu_bridge

**Table 1** TF Adapter APIs
Category	API	Description
Session configuration	Session Configuration	TF Adapter provides a series of session configurations for function debugging, performance improvement, and precision improvement. Developers can use these session configurations when performing model training or online inference on the Ascend AI Processor.
npu.npu_config	NPURunConfig Constructor	When performing model training or online inference in Estimator mode on the Ascend AI Processor, you can use the constructor of the NPURunConfig class to specify the running configuration of the Estimator.
	ProfilingConfig Constructor	Configures the profiling function.
	MemoryConfig Constructor	Configures the system memory usage mode.
	DumpConfig Constructor	Configures the dump function.
	ExperimentalConfig Constructor	Configures extended parameters for debugging. This API may change in later versions and is not supported for use in commercial products.
npu.npu_estimator	NPUEstimator Constructor	Constructor of the NPUEstimator class. The NPUEstimator class inherits the Estimator class of TensorFlow and can call the native APIs of the base class for the training, evaluation, and inference of TensorFlow models.
npu.npu_estimator	NPUEstimatorSpec Constructor	Constructor of the NPUEstimatorSpec class. The NPUEstimatorSpec class inherits the EstimatorSpec class of the TensorFlow and can call the native APIs of the base class to define specific model objects.
npu_strategy	NPUStrategy Constructor	Constructs an object of class NPUStrategy. NPUStrategy inherits the tf.distribute.Strategy class and can call the native APIs of the base class to implement distributed training in the NPU environment.
npu_hook	NPUCheckpointSaverHook Constructor	Constructs an object of class NPUCheckpointSaverHook, which is used to save the checkpoint file. The NPUCheckpointSaverHook class inherits the CheckpointSaverHook class and can call the native APIs of the base class to record model information during training.
	NPUOutputTensorHook Constructor	Constructs an object of class NPUOutputTensorHook. NPUOutputTensorHook is a hook for training, evaluation, and prediction of NPUEstimator, and it can call the user-defined output_fn every N steps or at the end to print the output tensors. The NPUOutputTensorHook class inherits the LoggingTensorHook class and can call native APIs of the base class.
	TellMeStepOrLossHook Constructor	Constructs an object of class TellMeStepOrLossHook, which is used to notify the bottom-layer software of the serial number of the current step and the total number of steps or the current loss and the target loss.
npu_optimizer	NPUDistributedOptimizer Constructor	Constructs an object of class NPUDistributedOptimizer, which wraps around a single-server training optimizer to an NPU distributed training optimizer.
	NPUOptimizer Constructor	Constructs an object of class NPUOptimizer, which combines the NPUDistributedOptimizer and NPULossScaleOptimizer optimizers. It provides the following functions: Loss scaling: Loss scaling can be enabled during mixed precision training to solve the underflow problem caused by a small float16 representation range. Distributed training: With an NPU distributed training optimizer wrapped from a single-server training optimizer, calculated gradients can be aggregated in single-server single-device, single-server multi-device, and multi-server multi-device networking modes. Communication tailing optimization: By changing a computation dependency relationship, a computation operation that does not depend on the last AR (gradient aggregation fragment) is scheduled to be performed in parallel with the last AR, to optimize communication tailing.
	KerasDistributeOptimizer Constructor	Constructs an object of class KerasDistributeOptimizer, which wraps around the single-server training optimizer constructed by tf.Keras to an NPU distributed training optimizer.
	npu_distributed_optimizer_wrapper	Adds the AllReduce operation of NPU to the input gradient function of the optimizer, combines them into one function, and returns the optimizer. This API is used only in distributed scenarios.
	npu_allreduce	Performs AllReduce and update operations on gradients after the gradient computing is complete.
npu_callbacks	NPUBroadcastGlobalVariablesCallback Constructor	Broadcasts variables in Keras scenarios to ensure that the initial values of variables on each device are the same in distributed scenarios.
npu_bridge.estimator.npu.npu_loss_scale_optimizer	NPULossScaleOptimizer Constructor	Constructs an object of class NPULossScaleOptimizer, which is used to enable loss scaling in mixed precision training when the overflow/underflow mode of floating-point computation is saturation mode. Loss scaling solves the underflow problem caused by the small float16 representation range.
npu.npu_loss_scale_manager	FixedLossScaleManager Constructor	Constructs an object of class FixedLossScaleManager, which is used to define the static LossScale parameter during training when the overflow/underflow mode of floating-point computation is saturation mode.
npu.npu_loss_scale_manager	ExponentialUpdateLossScaleManager Constructor	Constructs an object of class ExponentialUpdateLossScaleManager, which is used to define the dynamic LossScale parameter during training and dynamically obtain and update the value of LossScale by defining the loss_scale variable when the overflow/underflow mode of floating-point computation is saturation mode.
npu_ops	dropout	It has the same functionality as tf.nn.dropout. Elements of the input tensor are randomly set to zero with a probability of 1 – keep_prob. The remaining elements are scaled by a factor of 1/keep_prob to ensure that the output tensor maintains the same shape as the input tensor.
	LARSV2	This operator scales gradients based on the norm of weight and the norm of gradient at different levels using different learning rates. It is used to improve the training precision in large batch size scenarios and is used for large-scale cluster training to reduce the training time.
	initialize_system	Excludes the GE initialization time in the training time statistics. Generally, this API is not required for training. Before using the collective communication API, call this API to initialize the collective communication.
	shutdown_system	Disables all devices. This API is used in conjunction with initialize_system.
	npu_onnx_graph_op	Loads an ONNX model as an operator and executes it on the Ascend AI Processor through the TensorFlow framework.
npu_rnn	npu_dynamic_rnn	Creates a high-performance neural network specified by RNNCell.
npu_dynamic_rnn	DynamicRNN Constructor	Used for RNN training and inference with TensorFlow.
npu_dynamic_rnn	DynamicGRUV2 Constructor	Used for RNN training and inference with TensorFlow.
npu_scope	without_npu_compile_scope	Configures operators built on the host in mixed computing scenarios.
	keep_dtype_scope	Specifies the operators that preserve the original precision. If the operator precision in an original network model is not supported by the Ascend AI Processor, the system automatically uses the high precision supported by the operators for compute.
	npu_weight_prefetch_scope	Identifies the operators whose weight data will be prefetched into a buffer pool and specifies the ID and size of the buffer pool.
	subgraph_multi_dims_scope	Specifies the scope of the operator for which subgraph-wide dynamic shape profiles are to be applied in the online inference scenario.
util	set_iteration_per_loop	Sets the number of iterations per training loop in sess.run mode, that is, the number of training iterations executed on the device side in each sess.run() call. This API can save unnecessary interactions between the host and device and reduce the training time consumption.
	create_iteration_per_loop_var	This API is used in conjunction with load_iteration_per_loop_var to set the number of iterations per training loop every sess.run() call on the device side. This API is used to modify a graph and set the number of iterations per loop using load_iteration_per_loop_var.
	load_iteration_per_loop_var	This API is used in conjunction with create_iteration_per_loop_var to set the number of iterations per training loop every sess.run() call on the device side.
	set_graph_exec_config	Sets the compilation and execution options for a computational graph. After this API is called, configured attributes are added to the fetch node.
	keep_tensors_dtypes	Specifies the operators that preserve the original precision.
	set_op_input_tensor_multi_dims	Applies to subgraph-wide dynamic shape profiles. Specifies the input shape of the operator and shape profiles.
keras_to_npu	model_to_npu_estimator	Converts the model constructed by using Keras to an NPUEstimator object.
npu_plugin	set_device_sat_mode	Sets the process-level overflow/underflow mode for floating-point computation.
scoped_graph_manager	ScopedGraphManager	Unloads the variable initialization graph in one go and releases the memory held by constant nodes in the graph.
profiler	Profiler Constructor	Constructs an object of the Profiler class, which is used to enable the profiling function locally. For example, you can collect the profile data of a local subgraph on the TensorFlow network or a specified step.
hccl.hccl_ops	allreduce	Performs the reduction operation on the input data of all ranks in a group and sends the result to the output buffer of all ranks. The reduction operation type is specified by the reduction parameter. This API operates the collective communication operator AllReduce.
	allgather	Re-sorts the inputs of all ranks in the communicator by rank ID, combines the inputs, and sends the results to the outputs of all ranks.
	broadcast	Broadcasts the data of the root rank to other ranks in the communicator.
	reduce_scatter	Performs the sum operation (or other reduction operations) on the inputs of all ranks, and then distributes the result evenly to the output buffers of ranks according to the rank IDs. Each process receives 1/rank_size portion of data from other processes for reduction.
	reduce	Performs the sum operation (or other reduction operations) on the data of all ranks and sends the result to the specified position on the root rank.
	send	Sends data to a rank within a collective communication group.
	receive	Receives data from a rank within a collective communication group.
	alltoallv	Sends data (with the customized data size) to all ranks in the collective communicator and receives data from all ranks.
	alltoallvc	Sends data (with the customized data size) to all ranks in the collective communicator and receives data from all ranks. alltoallvc passes the RX and TX parameters of all ranks through the argument send_count_matrix, which outperforms alltoallv.