HcclCreateSubCommConfig

Applicability

Product

Supported

Atlas A3 training products / Atlas A3 inference products

Atlas A2 training products / Atlas A2 inference products

Atlas 200I/500 A2 inference products

Atlas inference products

Atlas training products

For Atlas A2 training products / Atlas A2 inference products , only the Atlas 800T A2 training server, Atlas 900 A2 PoD cluster basic unit, and Atlas 200T A2 Box16 heterogeneous subrack are supported.

Description

Splits an existing global communicator into sub-communicators with specific configurations.

In this way, a sub-communicator can be created without socket link setup and rank information exchange, which can be used to create a communicator fast in the case of service faults.

If the load is unbalanced between devices on the network, the link setup of the sub-communicator created using this API may time out due to asynchronous communication between devices. In this case, you can use the environment variable HCCL_CONNECT_TIMEOUT to increase the timeout for link setup between devices. For example:

export HCCL_CONNECT_TIMEOUT=600

Prototype

1
HcclResult HcclCreateSubCommConfig(HcclComm *comm, uint32_t rankNum, uint32_t *rankIds, uint64_t subCommId, uint32_t subCommRankId, HcclCommConfig *config, HcclComm *subComm)

Parameters

Parameter

Input/Output

Description

comm

Input

Global communicator to be split.

For details about the definition of the HcclComm type, see HcclComm.

rankNum

Input

Number of ranks in the sub-communicator to be split.

rankIds

Input

Array consisting of the IDs of the ranks in the sub-communicator in the global communicator.

Note: The array should be ordered. The subscript of each rank in the array is mapped to its rank ID in the sub-communicator.

subCommId

Input

ID of the current sub-communicator, which is user-defined.

  • If the sub-communicator name hcclCommName is not configured in the config parameter, the system uses {Global communicator name}_sub_{subCommId} as the sub-communicator name. In this case, ensure that the value of subCommId is unique in the global communicator.
  • If the sub-communicator name hcclCommName is configured in the config parameter, the configuration in config is used and this parameter is not verified anymore.

subCommRankId

Input

Rank ID of the current rank in the sub-communicator.

Set this parameter to the subscript index of the current rank in the rankIds array.

config

Input

Communicator configuration options, including the buffer size, deterministic computing switch, communicator name, and location for expanding the orchestration of the communication algorithm. Configuration parameters must fall within the valid value range. For details on the parameters and their priorities in HcclCommConfig, see HcclCommConfig.

Note that the input config must be initialized by calling HcclCommConfigInit first.

subComm

Output

Pointer to the initialized sub-communicator.

For details about the definition of the HcclComm type, see HcclComm.

Returns

HcclResult: HCCL_SUCCESS on success; else, failure.

Constraints

  • When ranks in the same sub-communicator call this API, the rankNum, rankIds, subCommId, and config parameters passed must be the same.
  • For ranks that do not need to create a sub-communicator, pass rankIds=nullptr and subCommId=0xFFFFFFFF. In this scenario, the subCommId parameter is not verified.
  • Sub-communicators can only be generated by splitting the global communicator. Sub-communicators cannot be further split.

Example

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
// Initialize the global communicator.
HcclComm globalHcclComm;
HcclCommInitClusterInfo(rankTableFile, devId, &globalHcclComm);
// Configure the communicator.
HcclCommConfig config;
HcclCommConfigInit(&config);
config.hcclBufferSize = 50;
strcpy(config.hcclCommName, "comm_1");
// Initialize the sub-communicators.
HcclComm hcclComm;
uint32_t rankIds[4] = {0, 1, 2, 3};  // Rank list of the sub-communicators.
// Set the ID of the current rank in the sub-communicator to 0.
HcclCreateSubCommConfig(&globalHcclComm, 4, rankIds, 1, 0, &config, &hcclComm);