多机场景下执行HCCL Test测试工具时,报错:This is an error in init_hcclComm.retcode: 11,如下所示:
1 2 3 4 5 6 | Authorized users only. All activities may be monitored and reported. the minbytes is 8192, maxbytes is 67108864, iters is 20, warmup_iters is 5 hccl interface return err ./common/src/hccl_test_common.cc:503, retcode: 11 hccl interface return err ./common/src/hccl_test_common.cc:503, retcode:11 hccl interface return err ./common/src/hccl_test_common.cc:503, retcode: 11 This is an error in init_hcclComm. |
mpirun命令示例:
mpirun -f hostfile -n 16 ./bin/all_reduce_test -b 8K -e 4G -f 2 -d fp32 -o sum -p 8
修改hostfile配置文件或者mpirun测试命令,使二者配置的总卡数以及每个节点使用的卡数保持一致。