EJ0004 Error Message
Symptom
The training fails to be started, and the EJ0004 error is reported. The first HCCP error message in the log is as follows:
1 | [ERROR] HCCP(1609,python3):2023-10-19-12:33:13.313.594 [ra_host.c:544]tid:1609,ra_rdev_init_check_ip(544) : [check][ip]fail, ret(-16) the IP address(29.4.84.36) in the ranktable is inconsistent with the IP(29.4.76.85)address of the network adapter, please make sure they're consistent. num(1) |
Search for error logs: grep -rn ERROR | grep HCCP | grep inconsistent.
The IP address in the ranktable file of the current rank is inconsistent with the device IP address, as shown in the following:
1 2 3 4 5 6 7 8 | plog/plog-1601_20231019123251864.txt:21:[ERROR] HCCP(1601,python3):2023-10-19-12:33:11.645.514 [ra_host.c:544]tid:1601,ra_rdev_init_check_ip(544) : [check][ip]fail, ret(192) the IP address(29.4.47.87) in the ranktable is inconsistent with the IP(29.4.76.155)address of the network adapter, please make sure they're consistent. num(1) plog/plog-1590_20231019123251766.txt:21:[ERROR] HCCP(1590,python3):2023-10-19-12:33:11.313.520 [ra_host.c:544]tid:1590,ra_rdev_init_check_ip(544) : [check][ip]fail, ret(160) the IP address(29.4.159.125) in the ranktable is inconsistent with the IP(29.4.67.170)address of the network adapter, please make sure they're consistent. num(1) plog/plog-1612_20231019123251802.txt:21:[ERROR] HCCP(1612,python3):2023-10-19-12:33:11.365.521 [ra_host.c:544]tid:1612,ra_rdev_init_check_ip(544) : [check][ip]fail, ret(161) the IP address(29.4.156.176) in the ranktable is inconsistent with the IP(29.4.69.45)address of the network adapter, please make sure they're consistent. num(1) plog/plog-1585_20231019123251689.txt:21:[ERROR] HCCP(1585,python3):2023-10-19-12:33:11.377.587 [ra_host.c:544]tid:1585,ra_rdev_init_check_ip(544) : [check][ip]fail, ret(-162) the IP address(29.4.59.254) in the ranktable is inconsistent with the IP(29.4.0.99)address of the network adapter, please make sure they're consistent. num(1) plog/plog-1593_20231019123251798.txt:21:[ERROR] HCCP(1593,python3):2023-10-19-12:33:11.785.581 [ra_host.c:544]tid:1593,ra_rdev_init_check_ip(544) : [check][ip]fail, ret(-80) the IP address(29.4.187.47) in the ranktable is inconsistent with the IP(29.4.132.121)address of the network adapter, please make sure they're consistent. num(1) plog/plog-1609_20231019123251900.txt:21:[ERROR] HCCP(1609,python3):2023-10-19-12:33:13.313.594 [ra_host.c:544]tid:1609,ra_rdev_init_check_ip(544) : [check][ip]fail, ret(-16) the IP address(29.4.84.36) in the ranktable is inconsistent with the IP(29.4.76.85)address of the network adapter, please make sure they're consistent. num(1) plog/plog-1598_20231019123251821.txt:21:[ERROR] HCCP(1598,python3):2023-10-19-12:33:11.735.192 [ra_host.c:544]tid:1598,ra_rdev_init_check_ip(544) : [check][ip]fail, ret(64) the IP address(29.4.146.67) in the ranktable is inconsistent with the IP(29.4.184.7)address of the network adapter, please make sure they're consistent. num(1) plog/plog-1606_20231019123251824.txt:21:[ERROR] HCCP(1606,python3):2023-10-19-12:33:11.525.543 [ra_host.c:544]tid:1606,ra_rdev_init_check_ip(544) : [check][ip]fail, ret(96) the IP address(29.4.21.78) in the ranktable is inconsistent with the IP(29.4.85.223)address of the network adapter, please make sure they're consistent. num(1) |
Possible Cause
This error is usually caused by incorrect configuration of the ranktable file.
Solution
Check the ranktable file configuration to determine whether the following problems exist:
- There are duplicate rank IDs.
If there are duplicate rank IDs, modify them to ensure that each rank ID is globally unique.
- The configured IP address of the device is inconsistent with the actual MAC address of the device.
You can run the following command to query the device IP address:
for i in `seq 0 7`; do echo "===================> dev$i, NPU$((i+1))"; hccn_tool -i $i -ip -g; done
If no, change the configured IP address of the device to the actual MAC address.
Parent topic: HCCP FAQs