Configuring HCCL Link Setup
If faults occur in the HCCL link setup phase, process-level rescheduling or process-level online recovery will fail. If HCCL link setup is required in other training phases in addition to the training initialization phase, you can set up the link in advance to avoid faults during the setup process.
PyTorch Single-Operator Scenario
In the PyTorch single-operator scenario, HCCL links are set up in lazy loading mode. After a Torch communication group is set up, its first operator triggers the creation of the HCCL communicator. After the creation, the inter-card link is set up. Therefore, to ensure all communicators are linked during training initialization, a communication operator must be dispatched to each group at that stage.
Example of creating communication groups:
rank = 0 # Set the rank of a process.
sub_ranks = [0, 1, 2] # Assume there are three communication groups labeled 0, 1, and 2..
groupX = torch.distributed.new_group(ranks=sub_ranks,...) # Create communication group X.
test_tensor = torch.ones(1).to(f'npu:{rank}') * (rank + 1) # Build a test data tensor.
torch.distributed.all_reduce(test_tensor, op=dist.ReduceOp.SUM, group=groupX) # Execute the AllReduce operator in communication group X.
Parent topic: Configuring Training Recovery