昇腾社区首页
中文
注册
开发者
下载

EI0015 Ranktable_Detecte_Failed

Symptom

The ranktable information detection failed, Reason:[%s].

Possible Cause

Failed to detect remote rank information, it may be due to a socket connection failure.

Solution

  1. Use or check whether the HCCL_SOCKET_IFNAME selects the correct and reachable network interface card.
  2. Check the rank service processes with other errors or no errors in the cluster.
  3. If this error is reported for all NPUs, check whether the time difference between the earliest and latest errors is greater than the connect timeout interval (120s by default). If so, adjust the timeout interval by using the HCCL_CONNECT_TIMEOUT environment variable.