训练进程执行时,Device报错如下:
[ERROR] HCCP(16381,hccp_service.bin):2023-11-06-17:27:19.734.960 [rs_rdma.c:683]tid:16419,rs_send_wr(683) : send exp failed qpn 67, ret -12 [ERROR] HCCP(16381,hccp_service.bin):2023-11-06-17:27:19.734.976 [ra_adp.c:576]tid:16419,ra_rs_send_wr(576) : send wr failed ret[-12].
Host报错如下:
[ERROR] HCCP(5713,python3.7):2023-11-06-09:11:08.576.793 [ra_hdc.c:952]tid:5713,ra_hdc_send_wr(952) : [send][ra_hdc_wr]ra hdc message process failed ret(-12), phy_id(0)
对应驱动message日志看到CQ overflow:
Nov 6 17:11:09 (none) kern.warn kernel: [306962.911468] hns3 0000:71:00.0 hns_0: CQ 0x3 overflow
CQ溢出,报错CQ overflow,导致下发WR失败。
MTU检查命令示例:
for i in `seq 0 7`; do echo "===================> dev$i, NPU$((i+1))"; hccn_tool -i $i -mtu -g; done
若不一致,可执行“hccn_tool [-i %d] -mtu -s [size %d]”命令设置。