create_group

Description

Creates a user-defined group for collective communication.

If a user-defined group is created without calling this API, all devices involved in cluster training are created as a global hccl_world_group by default.

group indicates the process groups that participate in collective communication.
  • hccl_world_group: default global group (created by HCCL automatically), including all ranks that participate in collective communication.
  • Customized groups: a subset of process groups contained in hccl_world_group.

Prototype

def create_group(group, rank_num, rank_ids)

Parameters

Parameter

Input/Output

Description

group

Input

A string containing a maximum of 128 bytes, including the end character.

Group name, which is the identifier of a collective communication group. The group name cannot be the default global group name hccl_world_group. If the group name specified by the user is hccl_world_group, the group fails to be created.

rank_num

Input

An int.

Number of ranks in a group.

The maximum value is 32768.

rank_ids

Input

A list.

List of world_rank_ids that form the group.

Different types of boards have different restrictions.

For the Atlas Training Series Product:
  • In single-server scenarios, rank_ids must meet the following requirements:

    The number of ranks must be 1, 2, 4, or 8. Devices 0-3 and devices 4-7 form two networks respectively. If the number of ranks is 2 or 4, Ascend AI Processors selected must belong to the same cluster.

  • In server cluster scenarios, rank_ids must meet the following requirements:
    • The number of ranks selected for each server must be the same (the number must be 1, 2, 4, or 8).
    • When the number of ranks for each server is 2 or 4, Ascend AI Processors selected must belong to the same cluster. That is, the remainders of rank IDs divided by 8 are all less than 4, or all greater than or equal to 4.

    Example:

    Assume that a group is created for three servers. The rank IDs of the three servers are as follows:

    {0,1,2,3,4,5,6,7}

    {8,9,10,11,12,13,14,15}

    {16,17,18,19,20,21,22,23}

    rank_ids that meets the requirement may be:

    rank_ids=[1,9,17]

    rank_ids=[1,2,9,10,17,18]

    rank_ids=[4,5,6,7,12,13,14,15,20,21,22,23]

Supplementary notes:

It is recommended that rank_ids be sorted based on the physical connection sequence of devices, that is, devices that are physically close to each other are arranged together. For example, if device_ip is set in ascending order based on the physical connection sequence, you are advised to set rank_ids in ascending order.

Returns

None

Constraints

  • This API must be called after the initialization of collective communication is complete.
  • The caller rank must be within the range defined by the group argument passed to this API call. Otherwise, the API call fails.

Applicability

Atlas Training Series Product

Example

The following is only a code snippet and cannot be executed. For details about how to call the HCCL Python APIs to perform collective communication, see Sample Code.

1
2
from npu_bridge.npu_init import *
create_group("myGroup", 4, [0, 1, 2, 3])