EJ0004打屏报错
问题现象
训练拉起失败,报EJ0004错误,查看日志,HCCP首报错如下:
[ERROR] HCCP(1609,python3):2023-10-19-12:33:13.313.594 [ra_host.c:544]tid:1609,ra_rdev_init_check_ip(544) : [check][ip]fail, ret(-16) the IP address(29.4.84.36) in the ranktable is inconsistent with the IP(29.4.76.85)address of the network adapter, please make sure they're consistent. num(1)
搜索Error日志:grep -rn ERROR | grep HCCP | grep inconsistent
发现当前节点ranktable中的ip和device ip不一致,如下所示。
plog/plog-1601_20231019123251864.txt:21:[ERROR] HCCP(1601,python3):2023-10-19-12:33:11.645.514 [ra_host.c:544]tid:1601,ra_rdev_init_check_ip(544) : [check][ip]fail, ret(192) the IP address(29.4.47.87) in the ranktable is inconsistent with the IP(29.4.76.155)address of the network adapter, please make sure they're consistent. num(1) plog/plog-1590_20231019123251766.txt:21:[ERROR] HCCP(1590,python3):2023-10-19-12:33:11.313.520 [ra_host.c:544]tid:1590,ra_rdev_init_check_ip(544) : [check][ip]fail, ret(160) the IP address(29.4.159.125) in the ranktable is inconsistent with the IP(29.4.67.170)address of the network adapter, please make sure they're consistent. num(1) plog/plog-1612_20231019123251802.txt:21:[ERROR] HCCP(1612,python3):2023-10-19-12:33:11.365.521 [ra_host.c:544]tid:1612,ra_rdev_init_check_ip(544) : [check][ip]fail, ret(161) the IP address(29.4.156.176) in the ranktable is inconsistent with the IP(29.4.69.45)address of the network adapter, please make sure they're consistent. num(1) plog/plog-1585_20231019123251689.txt:21:[ERROR] HCCP(1585,python3):2023-10-19-12:33:11.377.587 [ra_host.c:544]tid:1585,ra_rdev_init_check_ip(544) : [check][ip]fail, ret(-162) the IP address(29.4.59.254) in the ranktable is inconsistent with the IP(29.4.0.99)address of the network adapter, please make sure they're consistent. num(1) plog/plog-1593_20231019123251798.txt:21:[ERROR] HCCP(1593,python3):2023-10-19-12:33:11.785.581 [ra_host.c:544]tid:1593,ra_rdev_init_check_ip(544) : [check][ip]fail, ret(-80) the IP address(29.4.187.47) in the ranktable is inconsistent with the IP(29.4.132.121)address of the network adapter, please make sure they're consistent. num(1) plog/plog-1609_20231019123251900.txt:21:[ERROR] HCCP(1609,python3):2023-10-19-12:33:13.313.594 [ra_host.c:544]tid:1609,ra_rdev_init_check_ip(544) : [check][ip]fail, ret(-16) the IP address(29.4.84.36) in the ranktable is inconsistent with the IP(29.4.76.85)address of the network adapter, please make sure they're consistent. num(1) plog/plog-1598_20231019123251821.txt:21:[ERROR] HCCP(1598,python3):2023-10-19-12:33:11.735.192 [ra_host.c:544]tid:1598,ra_rdev_init_check_ip(544) : [check][ip]fail, ret(64) the IP address(29.4.146.67) in the ranktable is inconsistent with the IP(29.4.184.7)address of the network adapter, please make sure they're consistent. num(1) plog/plog-1606_20231019123251824.txt:21:[ERROR] HCCP(1606,python3):2023-10-19-12:33:11.525.543 [ra_host.c:544]tid:1606,ra_rdev_init_check_ip(544) : [check][ip]fail, ret(96) the IP address(29.4.21.78) in the ranktable is inconsistent with the IP(29.4.85.223)address of the network adapter, please make sure they're consistent. num(1)
原因分析
此错误通常由于业务配置错误导致。
父主题: HCCP常见问题总结