EJ0004 Error Message

Symptom

The training fails to be started, and the EJ0004 error is reported. The first HCCP error message in the log is as follows:

1
[ERROR] HCCP(1609,python3):2023-10-19-12:33:13.313.594 [ra_host.c:544]tid:1609,ra_rdev_init_check_ip(544) : [check][ip]fail, ret(-16) the IP address(29.4.84.36) in the ranktable is inconsistent with the IP(29.4.76.85)address of the network adapter, please make sure they're consistent. num(1)

Search for error logs: grep -rn ERROR | grep HCCP | grep inconsistent.

The IP address in the ranktable file of the current rank is inconsistent with the device IP address, as shown in the following:
1
2
3
4
5
6
7
8
plog/plog-1601_20231019123251864.txt:21:[ERROR] HCCP(1601,python3):2023-10-19-12:33:11.645.514 [ra_host.c:544]tid:1601,ra_rdev_init_check_ip(544) : [check][ip]fail, ret(192) the IP address(29.4.47.87) in the ranktable is inconsistent with the IP(29.4.76.155)address of the network adapter, please make sure they're consistent. num(1)
plog/plog-1590_20231019123251766.txt:21:[ERROR] HCCP(1590,python3):2023-10-19-12:33:11.313.520 [ra_host.c:544]tid:1590,ra_rdev_init_check_ip(544) : [check][ip]fail, ret(160) the IP address(29.4.159.125) in the ranktable is inconsistent with the IP(29.4.67.170)address of the network adapter, please make sure they're consistent. num(1)
plog/plog-1612_20231019123251802.txt:21:[ERROR] HCCP(1612,python3):2023-10-19-12:33:11.365.521 [ra_host.c:544]tid:1612,ra_rdev_init_check_ip(544) : [check][ip]fail, ret(161) the IP address(29.4.156.176) in the ranktable is inconsistent with the IP(29.4.69.45)address of the network adapter, please make sure they're consistent. num(1)
plog/plog-1585_20231019123251689.txt:21:[ERROR] HCCP(1585,python3):2023-10-19-12:33:11.377.587 [ra_host.c:544]tid:1585,ra_rdev_init_check_ip(544) : [check][ip]fail, ret(-162) the IP address(29.4.59.254) in the ranktable is inconsistent with the IP(29.4.0.99)address of the network adapter, please make sure they're consistent. num(1)
plog/plog-1593_20231019123251798.txt:21:[ERROR] HCCP(1593,python3):2023-10-19-12:33:11.785.581 [ra_host.c:544]tid:1593,ra_rdev_init_check_ip(544) : [check][ip]fail, ret(-80) the IP address(29.4.187.47) in the ranktable is inconsistent with the IP(29.4.132.121)address of the network adapter, please make sure they're consistent. num(1)
plog/plog-1609_20231019123251900.txt:21:[ERROR] HCCP(1609,python3):2023-10-19-12:33:13.313.594 [ra_host.c:544]tid:1609,ra_rdev_init_check_ip(544) : [check][ip]fail, ret(-16) the IP address(29.4.84.36) in the ranktable is inconsistent with the IP(29.4.76.85)address of the network adapter, please make sure they're consistent. num(1)
plog/plog-1598_20231019123251821.txt:21:[ERROR] HCCP(1598,python3):2023-10-19-12:33:11.735.192 [ra_host.c:544]tid:1598,ra_rdev_init_check_ip(544) : [check][ip]fail, ret(64) the IP address(29.4.146.67) in the ranktable is inconsistent with the IP(29.4.184.7)address of the network adapter, please make sure they're consistent. num(1)
plog/plog-1606_20231019123251824.txt:21:[ERROR] HCCP(1606,python3):2023-10-19-12:33:11.525.543 [ra_host.c:544]tid:1606,ra_rdev_init_check_ip(544) : [check][ip]fail, ret(96) the IP address(29.4.21.78) in the ranktable is inconsistent with the IP(29.4.85.223)address of the network adapter, please make sure they're consistent. num(1)

Possible Cause

This error is usually caused by incorrect configuration of the ranktable file.

Solution

Check the ranktable file configuration to determine whether the following problems exist:
  • There are duplicate rank IDs.

    If there are duplicate rank IDs, modify them to ensure that each rank ID is globally unique.

  • The configured IP address of the device is inconsistent with the actual MAC address of the device.

    You can run the following command to query the device IP address:

    for i in `seq 0 7`; do echo "===================> dev$i, NPU$((i+1))"; hccn_tool -i $i -ip -g; done

    If no, change the configured IP address of the device to the actual MAC address.