常见于算子执行阶段,屏显日志关键报错信息“EI0002: The wait execution of the Notify register times out.”,示例如下:
1 2 3 4 |
Error. Message is EI0002:The wait execution of the Notify register times out. Reason: The Notify register has not received the Notify record from remote rank [4].base information: [streamID:[14],taskID[6], taskType[Notify Wait], tag[AllReduce_6629421139219749105_0].] task information: [notify id:[0x000000700000058], stage:[fffffff], remote rank:[4].] …… there are(is) 1 abnormal device(s): serverId[192.168.100.111], deviceId[4], Heartbeat Lost Occurred, Possible Reason: 1. Process has exited, 2. Network Disconnected |
plog日志中查询Notify,有如下类似信息:
HCCL算子的task会在指定集群的每个Device上执行,并通过notify进行状态同步,若任何一个rank或者通信链路在执行前或执行中发生异常,都会导致集群同步失败,剩余卡会出现notify wait timeout。常见的原因主要有:
收集所有卡的plog日志后,按如下步骤排查:
搜索命令样例:grep –rn “err cqe” | grep HCCL
关键日志如下:
1
|
19987:[ERROR] HCCL(85111,python3):2023-02-18-09:44:55.431.692 [heartbeat.cc:547][85111][94635][Heartbeat]cqe err status[12],time:[2023-02-18 09:44:55.369944],ip:[*.*.*.*] |