What Do I Do If the HCCL Fails to Initialize the NIC and the HCCP Returns the Error Code ret[-17]?
Symptom
The Huawei Collective Communication Library (HCCL) fails to initialize the NIC, and the Huawei Collective Communication Process (HCCP) returns the error code ra rdev init failed, ret [-17].

Possible Cause
During initialization, HCCL initializes the device NIC based on the device IP address in the ranktable file. If the device IP address used for initialization is different from the actual NIC IP address, HCCP fails to initialize the NIC and returns the error code -17.
Solution
- Obtain the rank ID of the device and its device_ip configuration in the ranktable file as follows:
In the user-mode host log (EVENT-level logging needs to be enabled), grep the keyword Entry-HcomInit. The content in identify is the rank ID.
- Check the device IP address of the server. If the value of device_ip in the ranktable file is different from that in the query result, change the value of device_ip in the ranktable file to the query result.
You can use hccn_tool to view the device NIC information.
1 2 3 4 5 6 7 8 9 10
hccn_tool -i 0 -ip -g hccn_tool -i 1 -ip -g hccn_tool -i 2 -ip -g hccn_tool -i 3 -ip -g hccn_tool -i 4 -ip -g hccn_tool -i 5 -ip -g hccn_tool -i 6 -ip -g hccn_tool -i 7 -ip -g Or for i in {0..7}; do hccn_tool -i $i -ip -g ; done
Parent topic: FAQs