执行HCCL Test测试工具时,报“This is an error in init_hcclComm”的错误,如下图所示:
1 2 3 4 5 6 7 8 9 10 | [root@node-xxx]# mpirun -n 8./bin/all_reduce_test -b 8K -e 4G -f 2 -o sum -p 8 the minbytes is 8192, maxbytes is 4294967296, iters is 20, warmup_iters is 5 hccl interface return err ./common/src/hccl_test_common.cc:503, retcode:11 This is an error in init_hcclComm. hccl interface return err ./common/src/hccl_test_common.cc:503, retcode: 11 hccl interface return err ./common/src/hccl_test_common.cc:503, retcode: 11 hccl interface return err ./common/src/hccl_test_common.cc:503, retcode: 11 This is an error in init_hcclComm. This is an error in init_hcclComm. This is an error in init_ hcclComm. |
某些卡被进程占用,导致无法使用HCCL Test工具进行测试。
某些场景下,npu-smi info未显示卡被占用,但片上内存使用非常高,此种情况下,也会引起HCCL Test测试工具执行失败。
for i in {0..7}; do hccn_tool -i $i -process -g ; done