hostfile与测试命令不匹配报retcode 11

问题现象

多机场景下执行HCCL Test测试工具时,报错:This is an error in init_hcclComm.retcode: 11,如下所示:

1
2
3
4
5
6
Authorized users only. All activities may be monitored and reported.
the minbytes is 8192, maxbytes is 67108864, iters is 20, warmup_iters is 5
hccl interface return err ./common/src/hccl_test_common.cc:503, retcode: 11
hccl interface return err ./common/src/hccl_test_common.cc:503, retcode:11
hccl interface return err ./common/src/hccl_test_common.cc:503, retcode: 11
This is an error in init_hcclComm.

原因分析

mpirun命令示例:

mpirun -f hostfile -n 16 ./bin/all_reduce_test -b 8K -e 4G -f 2 -d fp32 -o sum -p 8

解决步骤

修改hostfile配置文件或者mpirun测试命令,使二者配置的总卡数以及每个节点使用的卡数保持一致。