Overview

In deep learning, as the number of datasets and parameters increases, so do the time and hardware resources required for training, which brings a potential bottleneck. Distributed training is a popular optimization technique for training, which has lower requirements of hardware resources such as memory and compute performance. In distributed training, a training job is partitioned and distributed across Ascend AI Processors for improved training efficiency. Training jobs exchange and summary information via collective communication.

Precautions

Before porting distributed training scripts, complete code adaptation for basic processes such as data preprocessing, model build, and training execution by referring to Using Single-Server Single-Device Scripts.

Distributed APIs Supported by TF Adapter

In TensorFlow, tf.distribute.Strategy is generally used for distributed training. For details, visit https://www.tensorflow.org/guide/distributed_training. Currently, the Ascend AI Processor does not support the preceding distributed policies. Therefore, the original TensorFlow training script needs to be modified to support distributed training on Ascend AI Processor.

TF Adapter supports the following distributed APIs:

  • npu_distributed_optimizer_wrapper: Combines the native TensorFlow's gradient training optimizer and NPU AllReduce operations into one function to implement gradient calculation and aggregation between devices.
  • npu_allreduce: Performs AllReduce and update operations on gradients after gradient calculation is complete in the scenario where the original TensorFlow script uses the gradient calculation interface, for example, tf.gradients.
  • Communicator management APIs: Include create_group, destroy_group, get_rank_size, and get_rank_id. For details, see HCCL API (Python).
  • Collective communication APIs: Include allreduce, allgather, broadcast, reduce_scatter, reduce, alltoallv, and alltoallvc. For details, see HCCL API (Python).
  • Point-to-point communication APIs: Include send and receive. For details, see HCCL API (Python).