Error Message "retcode 11" is Reported due to the Mismatch Between the Hostfile and Test Command

Symptom

When the HCCL Test tool is executed in the multi-server scenario, the error message "This is an error in init_hcclComm.retcode: 11" is displayed, as shown in the following:

1
2
3
4
5
6
Authorized users only. All activities may be monitored and reported.
the minbytes is 8192, maxbytes is 67108864, iters is 20, warmup_iters is 5
hccl interface return errreturn err ./common/src/hccl_test_common.cc:503, retcode: 11
hccl interface return errreturn err ./common/src/hccl_test_common.cc:503, retcode:11
hccl interface return errreturn err ./common/src/hccl_test_common.cc:503, retcode: 11
This is an error in init_hcclComm.

Possible Cause

  • The number of devices used by each node configured in the hostfile file does not match the test command -p, where -p indicates the number of devices used by each node (that is, the number of processes on each node).
  • The number of nodes configured in the hostfile file does not match the number of -n in the mpirun command, where -n indicates the total number of devices (that is, Number of nodes x Number of devices on each node).

The following is an example of the mpirun command:

mpirun -f hostfile -n 16 ./bin/all_reduce_test -b 8K -e 4G -f 2 -d fp32 -o sum -p 8

Solution

Modify the hostfile configuration file or the mpirun test command to ensure that the total number of devices and the number of devices used by each node are the same.