APIs

HCCL Python APIs are used to implement framework adaptation in graph mode. Currently, they are only used to implement distributed optimization of the TensorFlow network in the Ascend AI Processor. The distributed optimizers NPUDistributedOptimizer and npu_distributed_optimizer_wrapper provided by TF Adapter enable users to automatically complete gradient aggregation without sensing AllReduce, implementing data parallel training. In addition, to meet users' requirements for flexibility, HCCL provides various common APIs, such as those for rank management, gradient splitting, and collective communication prototype.

APIs

Table 1 lists the Python APIs provided by HCCL.

  • The rank management APIs are defined in the api.py file in ${install_path}/python/site-packages/hccl/manage.
  • The gradient splitting APIs are defined in the api.py file in ${install_path}/python/site-packages/hccl/split.
  • The collective communication APIs are defined in the hccl_ops.py file in ${TFPLUGIN_INSTALL_PATH}/npu_bridge/hccl.
Table 1 HCCL (Python) API list

API

Description

Rank management

create_group

Creates a user-defined group for collective communication.

destroy_group

Destroys a user-defined group for collective communication.

get_rank_size

Obtains the number of ranks (that is, the number of devices) in a group.

get_local_rank_size

Obtains the number of local ranks on the server where the devices in the group are located.

get_rank_id

Obtains the rank ID of a device in a group.

get_local_rank_id

Obtains the local rank ID of a device in a group.

get_world_rank_from_group_rank

Obtains the world rank ID based on the rank ID of the process in the group.

get_group_rank_from_world_rank

Obtains the group rank ID of the process in the group using the world rank ID.

Gradient splitting

set_split_strategy_by_idx

Sets a backward gradient splitting strategy in a collective communication group based on the gradient index ID to implement AllReduce fusion and optimize the collective communication performance.

set_split_strategy_by_size

Sets a backward gradient splitting strategy in a collective communication group based on the proportion of gradient data to implement AllReduce fusion and optimize the collective communication performance.

Collective communication

allreduce

Performs the reduction operation on the input data of all ranks in a group and sends the result to the output buffer of all ranks. The reduction operation type is specified by the reduction parameter. This API operates the collective communication operator AllReduce.

allgather

Re-sorts the inputs of all ranks in the communicator by rank ID, combines the inputs, and sends the results to the outputs of all ranks.

broadcast

Broadcasts the data of the root rank in the communicator to other ranks.

reduce_scatter

Performs the sum operation (or other reduction operations) on the inputs of all ranks, and then distributes the result evenly to the output buffers of ranks according to the rank IDs. Each process receives 1/ranksize portion of data from other processes for reduction.

reduce

Performs the sum operation (or other reduction operations) on the data of all ranks and sends the result to the specified position on the root rank.

alltoallv

Sends data (with the customized data size) to all ranks in the collective communicator and receives data from all ranks.

alltoallvc

Sends data (with the customized data size) to all ranks in the collective communicator and receives data from all ranks.

alltoallvc passes the RX and TX parameters of all ranks through the argument send_count_matrix, which outperforms alltoallv.

Point-to-point communication

send

Sends data to a rank within a collective communication group.

receive

Receives data from a rank within a collective communication group.

Concepts

Concept

Description

group

Indicates the process groups that participate in collective communication. The process groups include:

  • hccl_world_group: default global group, including all ranks that participate in collective communication. This group is created using the ranktable file.
  • User-defined group: a subset of the process groups contained in the hccl_world_group. The ranks in the ranktable file can be defined as different groups through the create_group API, and the collective communication algorithms can be executed in parallel.

rank

A communication entity in the group. Each rank is assigned a unique ID ranging from 0 to n – 1 (n is the number of NPUs).

rank size

  • Rank size: indicates the number of ranks in a group.
  • Local rank size: indicates the number of ranks in a group on the server where the processes are located.

rank id

  • Rank ID: indicates the ID of a process in a group. The value ranges from 0 to (rank size – 1). For a user-defined group, the rank starts from 0 in the group. For hccl_world_group, the rank ID is the same as the world rank ID.
  • World rank ID: indicates the rank ID of a process in hccl_world_group. The value ranges from 0 to (rank size – 1).
  • Local rank ID: indicates the rank ID of a process in a group on the server where the process is located. The value ranges from 0 to (local rank size – 1).