开启HCCL算子重执行特性后(通过环境变量HCCL_OP_RETRY_ENABLE开启),执行时建链失败,且ERROR日志中(日志级别ERROR)存在报错信息"[OpRetryConnection][RecvAckTag] Recv unmatched ack",日志示例如下:
1 2 3 4 5 6 7 8 9 10 11 12 |
[OpRetryConnection][RecvAckTag] Recv unmatched ack [1381978191] expect [64] call trace: hcclRet -> 4 [OpRetryConnection][Init] Connect to server failed, serverIp_[90.90.97.29] serverPort [60000] [OpRetryConnection][Init] group[hccl_world_group] rankId [9] rankSize [64] serverIp [90.90.97.29] localIp [90.90.97.29] rootRank [0] failed There maybe some reasons to cause this error: 1. The port may have been used so we will bind error. OpRetry used port range [60000-60015] 2. Somebody may have already listen on those ports, so we connect to wrong server and we will meet: 'Recv unmatched ack' error You may can set system reserved port to avoid this error by sysctl -w net ipv4.ip_local_reserved_ports=60000-60015 call trace: hcclRet -> 4 call trace: hcclRet |
上述描述的是64卡场景的通信域建链失败,且其中一个进程建链发生了“Recv unmatched ack”错误。
该报错的常见原因为HCCL使用默认端口通信,由于默认端口被占用从而导致HCCL连接了错误的Server:
sysctl -w net.ipv4.ip_local_reserved_ports=60000-60015
# 例如指定HCCL使用以17777端口开始的连续16个端口 export HCCL_IF_BASE_PORT=17777 # 预留以17777-17792共16个端口 sysctl -w net.ipv4.ip_local_reserved_ports=17777-17792