建链超时(EI0006)
HCCL建链超时受HCCL_CONNECT_TIMEOUT环境变量配置影响,若在超时时间内对端无法响应业务建链请求,则会上报"socket timeout",同时如果远端由于超时等故障退出,已经建好的链路在等待数据交换的过程中也可能会伴随“recv fail”的报错。
问题现象
在CANN日志中存在关键字"wait socket establish timeout",如下所示:
[ERROR] HCCL(3977573,all_reduce_test):2025-11-22-17:37:04.447.996 [hccl_socket_manager.cc:816] [3977663][Wait][LinkEstablish]wait socket establish timeout, role[0] rank[0] timeout[120 s] [ERROR] HCCL(3977573,all_reduce_test):2025-11-22-17:37:04.448.799 [hccl_socket_manager.cc:880] [3977663][Wait][LinksEstablishCompleted] is failed. ret[9]. [ERROR] HCCL(3977573,all_reduce_test):2025-11-22-17:37:04.448.942 [hccl_socket_manager.cc:642] [3977663] _________________________LINK_ERROR_INFO___________________________ [ERROR] HCCL(3977573,all_reduce_test):2025-11-22-17:37:04.448.947 [hccl_socket_manager.cc:643] [3977663] | comm error, device[0] [ERROR] HCCL(3977573,all_reduce_test):2025-11-22-17:37:04.448.950 [hccl_socket_manager.cc:645] [3977663] | dest_ip(user_rank) | dest_port | src_ip(user_rank) | src_port | MyRole | Status | TlsStatus | [ERROR] HCCL(3977573,all_reduce_test):2025-11-22-17:37:04.448.954 [hccl_socket_manager.cc:647] [3977663] |----------------------|---------------|----------------------|--------------|------------|------------|----------------| [ERROR] HCCL(3977573,all_reduce_test):2025-11-22-17:37:04.448.967 [hccl_socket_manager.cc:599] [3977663] | 192.168.2.198(1) | 16666 | 192.168.2.199(0) | 0 | server | time out | DISABLE | [ERROR] HCCL(3977573,all_reduce_test):2025-11-22-17:37:04.448.974 [hccl_socket_manager.cc:855] [3977663][Create][Sockets]Wait links establish completed failed, local role is client. ret[9] [ERROR] HCCL(3977573,all_reduce_test):2025-11-22-17:37:04.448.980 [transport_manager.cc:962] [3977663][SetMachinePara]call trace: hcclRet -> 9 [ERROR] HCCL(3977573,all_reduce_test):2025-11-22-17:37:04.448.987 [transport_manager.cc:839] [3977663][CreateLink]errNo[0x0000000005000009]SetMachinePara error. [ERROR] HCCL(3977573,all_reduce_test):2025-11-22-17:37:04.449.290 [detect_connect_anomalies.cc:79] [3977663]-------------------CONNECT TIMEOUT DETECT RESULT----------------------- [ERROR] HCCL(3977573,all_reduce_test):2025-11-22-17:37:04.449.296 [detect_connect_anomalies.cc:80] [3977663]if BELOW DETECT EVENT num ≥ 1: [ERROR] HCCL(3977573,all_reduce_test):2025-11-22-17:37:04.449.299 [detect_connect_anomalies.cc:82] [3977663] The error above was caused by a failure at the site in the cluster where the events happened.Please confirm whether the link between SRCRANK and DSTRANK or the DSTRANK process is healthy. [ERROR] HCCL(3977573,all_reduce_test):2025-11-22-17:37:04.449.303 [detect_connect_anomalies.cc:83] [3977663]if BELOW DETECT EVENT num = 0: [ERROR] HCCL(3977573,all_reduce_test):2025-11-22-17:37:04.449.306 [detect_connect_anomalies.cc:84] [3977663] please prioritize investigating the consistency of cluster script behaviors. [ERROR] HCCL(3977573,all_reduce_test):2025-11-22-17:37:04.449.309 [detect_connect_anomalies.cc:87] [3977663]NOTE: The detection results are only used to assist in locating the problem and may not represent the actual fault site in some complex scenarios. Please continue to analyze and confirm based on the current detected fault site(if BELOW DETECT EVENT num ≥ 1). [ERROR] HCCL(3977573,all_reduce_test):2025-11-22-17:37:04.449.312 [detect_connect_anomalies.cc:88] [3977663]-------------------------DETECT EVENT LIST----------------------------------------- [ERROR] HCCL(3977573,all_reduce_test):2025-11-22-17:37:04.449.315 [detect_connect_anomalies.cc:89] [3977663]----------------------------------------------------------------------------------- [ERROR] HCCL(3977573,all_reduce_test):2025-11-22-17:37:25.551.918 [detect_connect_anomalies.cc:529] [3978631]DETECT EVENT[1]:Rank[127.10.0.1/0]: srcRank[127.10.0.1/0] connect destRank[127.10.0.1/1] fail. [ERROR] HCCL(3977573,all_reduce_test):2025-11-22-17:37:25.551.938 [detect_connect_anomalies.cc:530] [3978631]----------------------------------------------------------------------------------------------------- [ERROR] HCCL(3977573,all_reduce_test):2025-11-22-17:37:25.551.947 [detect_connect_anomalies.cc:462] [3978631][CreateClientConnect]GetStatus fail, ret[9] [ERROR] HCCL(3977573,all_reduce_test):2025-11-22-17:37:34.466.044 [transport_manager.cc:886] [3977663][Create][DestLink]Transport init error! createLink para:rank[0]-localUserrank[0]-localIpAddr[127.10.0.1/0], remoteRank[1]-remoteUserrank[1]-remoteIpAddr[127.10.0.1/1], machineType[0], linkMode[1], isUsedRdma[0], tag[AllReduce_127.10.0.1%enp_60000_0_1763804103901731] [ERROR] HCCL(3977573,all_reduce_test):2025-11-22-17:37:34.466.312 [transport_manager.cc:460] [3977573][Create]errNo[0x0000000005000006] transport create fail in thread, local rank[0] remote rank[1] [ERROR] HCCL(3977573,all_reduce_test):2025-11-22-17:37:34.567.220 [hccl_communicator_host.cc:5777] [3977573][AllocAlgResource]Alloc transports failed, tag[AllReduce_127.10.0.1%enp_60000_0_1763804103901731ringAllReduceMeshSmallCountExecutor_host] [ERROR] HCCL(3977573,all_reduce_test):2025-11-22-17:37:34.567.234 [hccl_communicator_host.cc:3991] [3977573][HcclCommunicator][ExecOp] AllocAlgResource failed, algName=[AllReduceMeshSmallCountExecutor] [ERROR] HCCL(3977573,all_reduce_test):2025-11-22-17:37:34.567.259 [hccl_communicator_host.cc:2706] [3977573][AllReduceOutPlace]call trace: hcclRet -> 6 [ERROR] HCCL(3977573,all_reduce_test):2025-11-22-17:37:34.567.280 [hccl_comm.cc:278] [3977573][HcclComm][AllReduce_90.91.65.186%enp189s0f0_60000_0_1763804103901731]errNo[0x0000000000000006] index[0] [ERROR] HCCL(3977573,all_reduce_test):2025-11-22-17:37:34.567.290 [op_base_host.cc:153] [3977573][HcclAllReduce]call trace: hcclRet -> 6
根据日志确认需排查的建链对端
- 若报错日志中有"DETECT EVENT LIST"打印,可先重点关注日志中失败的建链对,如上日志示例中,需先排查"DETECT EVENT[1]"异常事件显示的127.10.0.1节点的device7和127.10.0.1节点的device6之间的建链失败根因;
- 若报错日志中无"DETECT EVENT LIST"打印,那么可从报错日志的"LINK_ERROR_INFO"表格中获取建链两端的device ip,同时可从"Transport init error! createLink para:"关键日志信息中获取本端和对端所在的节点信息,格式为[hostIp/deviceId],如下所示:
grep -r "Transport init error! createLink para:"
[ERROR] HCCL(3215542,all_reduce_test):2025-11-20-18:18:03.114.306 [transport_manager.cc:886] [3215599][Create][DestLink]Transport init error! createLink para:rank[2]-localUserrank[2]-localIpAddr[127.10.0.1/2], remoteRank[1]-remoteUserrank[1]-remoteIpAddr[127.10.0.1/1], machineType[1], linkMode[1], isUsedRdma[0], tag[AllReduce_127.10.0.1%enp_60000_0_1763633852475745
- localUserrank:本端rank编号
- localIpAddr:本端的节点Ip信息
- remoteUserrank:对端rank编号
- remoteIpAddr:对端的节点Ip信息
- tag:通信算子标识符
获取到需要排查的建链失败对端信息之后,便可结合两端的CANN日志做进一步分析。
确认对端行为排查是否有卡间行为不一致
由于参数面建链是一个两端的互动流程,需要两端在超时时间内均发起建链请求才能创建成功,否则因为等待超时而报错,因此可以根据本端的报错信息中找到对端的节点信息,查看对端的日志做进一步的判断:

排查点1:
若对端没有任何报错日志,说明对端可能没有同步下发对应的通信算子,因此本端无法等待到对端的建链请求反馈,最终等待超时。
需从业务上排查两端的通信算子下发行为是否一致。
排查点2:
若对端发生了除了参数面建链超时外的其他报错,则需要先排查对端的报错原因。
排查点3:
若对端也发生了参数面建链超时报错,但对端的报错信息中并不在和本端建链,而是和其他节点建链,则需要按照流程先排查对端的参数面建链超时原因。
排查点4:
若对端也在和本端参数面建链超时,可先排查两端的报错时间是否超过了建链等待时间,如超过了建链超时时间,需要业务上排查两端通信算子下发超时时间的根因。
建链等待时间可通过HCCL_CONNECT_TIMEOUT指定,默认为120秒,可在CANN日志的run目录下通过grep -r "HCCL_CONNECT_TIMEOUT" run/plog/查询当前业务配置的超时时间。
排查点5:
若对端和本端的参数面建链超时在建链超时时间内,则需要进一步排查两端的网络连通性:
- 排查两端的tls开关是否一致,若两端的tls开关不一致,则socket创建时会校验失败导致两端均建链超时,可以通过以下方法确认两端的tls开关:
- 报错日志的LINK_ERROR_INFO表格中的status表示的是当前卡的tls状态,UNKNOWN表示未获取到,DISABLE表示未开启,ENABLE表示开启。
- 在节点的log日志中执行grep -r "TLS SWITCH" log/run/device-*获取tls状态:
run/device-0/device-2849330_20251024153927364.log:[INFO] HCCP(2988,hccp_service.bin):2025-10-24-15:39:26.133.826 [rs_ssl.c:1529]tid:2988,rs_ssl_init(1529) : TLS SWITCH (1) run/device-1/device-2849331_20251024153928174.log:[INFO] HCCP(30877,hccp_service.bin):2025-10-24-15:39:25.142.466 [rs_ssl.c:1529]tid:30877,rs_ssl_init(1529) : TLS SWITCH (0)
- 通过hccn_tool工具查看节点的tls配置for i in {0..7}; do hccn_tool -i $i -tls -g ; done | grep switch:
# for i in {0..1}; do hccn_tool -i $i -tls -g ; done | grep switch dev_id:0, tls switch[0](0:disable, 1:enable), tls preconfigured[1](0:non-preset, 1:preset), tls alarm time threshold[60]days dev_id:1, tls switch[1](0:disable, 1:enable), tls preconfigured[1](0:non-preset, 1:preset), tls alarm time threshold[60]days
- 若建链的两端在不同的节点上,则需要检查本端和对端的device网口之间的网络连通性,使用hccn_tool命令在其中一个节点ping另外一个节点的device ip:
hccn_tool -i {node} -ping -g address {对端ip}若两个rank之间ping不通或者有网口是down的,请联系实验室管理员排查对应网卡及交换机的配置。
需注意:
- 当前故障链路产生探测失败事件的阈值默认为20s,用户可以通过HCCL_DFS_CONFIG环境变量中“connection_fault_detction_time”的字段进行调整,置为0则为关闭此功能。如果在集群规模很大,或伴随严重的卡间不同步现象时,有可能会需要增大此配置来保证探测结果正确性。
- 部分复杂业务场景下,可能出现建链超时、执行超时同时出现在单次业务中,我们需要基于探测的结果进行多次跳转才能定位到故障点,因此请以探测节点的日志确认是否已经到达根节点。故障根节点一般会有其他报错、或无任何异常日志,或和其他rank互等超时。