昇腾社区首页
中文
注册

EJ0004打屏报错

问题现象

训练拉起失败,报EJ0004错误,查看日志,HCCP首报错如下:

1
[ERROR] HCCP(1609,python3):2023-10-19-12:33:13.313.594 [ra_host.c:544]tid:1609,ra_rdev_init_check_ip(544) : [check][ip]fail, ret(-16) the IP address(29.4.84.36) in the ranktable is inconsistent with the IP(29.4.76.85)address of the network adapter, please make sure they're consistent. num(1)

搜索Error日志:grep -rn ERROR | grep HCCP | grep inconsistent

发现当前节点ranktable中的ip和device ip不一致,如下所示。
1
2
3
4
5
6
7
8
plog/plog-1601_20231019123251864.txt:21:[ERROR] HCCP(1601,python3):2023-10-19-12:33:11.645.514 [ra_host.c:544]tid:1601,ra_rdev_init_check_ip(544) : [check][ip]fail, ret(192) the IP address(29.4.47.87) in the ranktable is inconsistent with the IP(29.4.76.155)address of the network adapter, please make sure they're consistent. num(1)
plog/plog-1590_20231019123251766.txt:21:[ERROR] HCCP(1590,python3):2023-10-19-12:33:11.313.520 [ra_host.c:544]tid:1590,ra_rdev_init_check_ip(544) : [check][ip]fail, ret(160) the IP address(29.4.159.125) in the ranktable is inconsistent with the IP(29.4.67.170)address of the network adapter, please make sure they're consistent. num(1)
plog/plog-1612_20231019123251802.txt:21:[ERROR] HCCP(1612,python3):2023-10-19-12:33:11.365.521 [ra_host.c:544]tid:1612,ra_rdev_init_check_ip(544) : [check][ip]fail, ret(161) the IP address(29.4.156.176) in the ranktable is inconsistent with the IP(29.4.69.45)address of the network adapter, please make sure they're consistent. num(1)
plog/plog-1585_20231019123251689.txt:21:[ERROR] HCCP(1585,python3):2023-10-19-12:33:11.377.587 [ra_host.c:544]tid:1585,ra_rdev_init_check_ip(544) : [check][ip]fail, ret(-162) the IP address(29.4.59.254) in the ranktable is inconsistent with the IP(29.4.0.99)address of the network adapter, please make sure they're consistent. num(1)
plog/plog-1593_20231019123251798.txt:21:[ERROR] HCCP(1593,python3):2023-10-19-12:33:11.785.581 [ra_host.c:544]tid:1593,ra_rdev_init_check_ip(544) : [check][ip]fail, ret(-80) the IP address(29.4.187.47) in the ranktable is inconsistent with the IP(29.4.132.121)address of the network adapter, please make sure they're consistent. num(1)
plog/plog-1609_20231019123251900.txt:21:[ERROR] HCCP(1609,python3):2023-10-19-12:33:13.313.594 [ra_host.c:544]tid:1609,ra_rdev_init_check_ip(544) : [check][ip]fail, ret(-16) the IP address(29.4.84.36) in the ranktable is inconsistent with the IP(29.4.76.85)address of the network adapter, please make sure they're consistent. num(1)
plog/plog-1598_20231019123251821.txt:21:[ERROR] HCCP(1598,python3):2023-10-19-12:33:11.735.192 [ra_host.c:544]tid:1598,ra_rdev_init_check_ip(544) : [check][ip]fail, ret(64) the IP address(29.4.146.67) in the ranktable is inconsistent with the IP(29.4.184.7)address of the network adapter, please make sure they're consistent. num(1)
plog/plog-1606_20231019123251824.txt:21:[ERROR] HCCP(1606,python3):2023-10-19-12:33:11.525.543 [ra_host.c:544]tid:1606,ra_rdev_init_check_ip(544) : [check][ip]fail, ret(96) the IP address(29.4.21.78) in the ranktable is inconsistent with the IP(29.4.85.223)address of the network adapter, please make sure they're consistent. num(1)

原因分析

此错误通常由于业务配置错误导致。

解决方法

请检查ranktable配置,确认是否存在以下问题:
  • 存在重复的rank id。

    若存在重复rank id,请修改,确保rank id全局唯一。

  • 配置的Device IP地址与实际Device的网卡地址不一致。

    可通过如下命令查询Device IP:

    for i in `seq 0 7`; do echo "===================> dev$i, NPU$((i+1))"; hccn_tool -i $i -ip -g; done

    若不一致,请修改配置的Device IP地址为实际的网卡地址。