参数一致性校验(EI0005)
问题现象
在打屏日志中存在关键字"The arguments for collective communication are inconsistent between ranks",如下所示:
EI0005: 2024-04-24-06:32:27.781.599 The arguments for collective communication are inconsistent between ranks:tag[HcomAllReduce_6629421139219749105_0], parameter[count], local[16512], remote [8320]
Solution: Check whether the training script and ranktable of each NPU are consistent.
TraceBack (most recent call last):
Transport init error. Reason: [Create] [DestLink]Create Dest error! createLink para:rank[5]-localUserrank[4]-localIpAddr[127.10.0.1], dst_rank[6]-remoteUserrank[7]-remote_ip_addr[127.10.0.1]
Transport init error. Reason: [Create] [DestLink]Create Dest error! createLink para:rank[5]-localUserrank[4]-localIpAddr[127.10.0.1], dst_rank[4]-remoteUserrank[5]-remote_ip_addr[127.10.0.1]
call hccl op:HcomAllReduce(HcomAllReduce) load task fail[FUNC:Distribute][FILE:hccl_task_info.cc] [LINE:329]
[[{[node Ge0p3_0]}]]
或在CANN日志中存在关键字"CMD information *** check fail",如下所示:
[ERROR] HCCL(3743927,all_reduce_test):2025-10-25-16:11:16.831.640 [rank_consistentcy_checker.cc:429] [3743951][RankConsistentcyChecker][ReportCmdInfoCheckFailed]CMD information tag check fail. local[AllGather_127.10.0.1%enp_60000_0_1761379874757928], remote[AllReduce_127.10.0.1%enp_60000_0_1761379874757928] [ERROR] HCCL(3743927,all_reduce_test):2025-10-25-16:11:16.831.666 [rank_consistentcy_checker.cc:439] [3743951][RankConsistentcyChecker][ReportCmdInfoCheckFailed]CMD information cmdType check fail. local[6], remote[2] [ERROR] HCCL(3743927,all_reduce_test):2025-10-25-16:11:16.831.679 [rank_consistentcy_checker.cc:439] [3743951][RankConsistentcyChecker][ReportCmdInfoCheckFailed]CMD information op check fail. local[255], remote[0]
问题根因
参数面建链时,在socket建立完成后会进行两端的参数一致性校验,校验的范围包括算子标识符tag、算子类型opType、数据量dataSize、HCCL Buffer的大小cclbufferSize、数据类型datatype等,可根据报错里的信息确定不一致的数据。例如下述示例中,两端的算子标识符tag不一致,导致通信算子在建链时一致性校验不通过,local和remote中的数据为两端不一致的数据。
其中参数不一致的两端节点信息可以通过"Transport init error! createLink para:"报错日志确认,比如执行grep -r "Transport init error! createLink para:",查看结果如下:
[ERROR] HCCL(3215542,all_reduce_test):2025-11-20-18:18:03.114.306 [transport_manager.cc:886] [3215599][Create][DestLink]Transport init error! createLink para:rank[2]-localUserrank[2]-localIpAddr[127.10.0.1/2], remoteRank[1]-remoteUserrank[1]-remoteIpAddr[127.10.0.1/1], machineType[1], linkMode[1], isUsedRdma[0], tag[AllReduce_127.10.0.1%enp_60000_0_1763633852475745
- localUserrank:本端rank编号。
- localIpAddr:本端的节点Ip信息。
- remoteUserrank:对端rank编号。
- remoteIpAddr:对端的节点Ip信息。
- tag:通信算子标识符。
解决方法
需根据报错信息从业务上排查参数校验不一致的两端下发的算子不一致的根因。
父主题: 参数面建链阶段