HCCL Cluster Communication Failure in Multi-Server Training
Symptom
The HCCL cluster communication failed.
Possible Cause
- The NPU device IP addresses of multiple nodes cannot ping each other.
- The TLS configurations of NPU devices on multiple nodes are different.
- Others
Solution
- Check whether the NPU device IP addresses of multiple nodes can ping each other. The following uses a two-node cluster (nodes A and B) with eight devices on each node as an example.
- Query the device IP address of node A.
1for i in {0..7}; do hccn_tool -i $i -ip -g ; done
- Ping the device IP address of node A on node B.
hccn_tool -i 0 -ping -g address 192.x.x.x
192.x.x.x indicates the device IP address of rank 0 on node A. 0 indicates that the device of rank 0 on node B is used to ping the corresponding IP address.
If the command output contains "0.00% packet loss", the IP address can be pinged. If the IP address cannot be pinged, check the network configuration.
If the device IP address is set to an IPv6 address, the command for querying the device IP address is different from that for pinging the device. The following is an example:
- Query the device IP address:
for i in {0..7}; do hccn_tool -i $i -ip -inet6 -g; done - Ping a specified device IP address:
hccn_tool -i 0 -ping -inet6 -g ipv6_address x:x:x:x
- Query the device IP address:
- Query the device IP address of node A.
- Check whether the TLS configurations of the NPU devices on multiple nodes are the same.
Run the following command on the two nodes to check the configurations:
1for i in {0..7}; do hccn_tool -i $i -tls -g |grep switch; done
If they are different, modify the TLS configuration.

For details about how to set the TLS status and modify certificate information, see "HCCL Initialization Fails Due to Inconsistent TLS Information on Servers Involved in Collective Communication" in HCCL User Guide.
- Refer to "FAQs" in HCCL User Guide for other collective communication issues.