执行HCCL Test测试命令时,HCCL Test工具已启动成功,但打印出数据量,时间,带宽的表头后,后续执行报错,报错如下所示:
the minbytes is 8192, maxbytes is 2147483648, iters is 20, warmup_iters is 5 hccl interface return errreturn err ./common/src/hccl_test_common.cc:538, retcode: 7 This is an error in init_hcclComm. hccl interface return errreturn err ./common/src/hccl_test_common.cc:538, retcode: 7 This is an error in init_hcclComm. hccl interface return errreturn err ./common/src/hccl_test_common.cc:538, retcode: 7 This is an error in init_hcclComm. hccl interface return errreturn err ./common/src/hccl_test_common.cc:538, retcode: 7 This is an error in init_hcclComm.
集群中与当前节点通信的节点上存在未退出的hccl_test进程。
worker-0 slots=1
worker-1 slots=1
worker-2 slots=1
worker-3 slots=1
... ...
worker-510 slots=1
worker-511 slots=1
其中“worker-0”到“worker-511”是集群中节点的主机名,“slots=1”代表该节点上仅开启一个进程,此Hostfile文件中需要包含参与集合通信的所有节点的信息。
/usr/local/openmpi-4.1.5/bin/mpirun -hostfile hostfile_1 -n 512 pkill -9 -f hccl_test