"retcode: 7" Is Returned During the HCCL Test
Symptom
In the cluster scenario, when the HCCL test command is executed, the HCCL Performance Tester is started successfully. However, after the table headers of the data size, time, and bandwidth are printed, the following error messages are reported during subsequent execution.
the minbytes is 8192, maxbytes is 2147483648, iters is 20, warmup_iters is 5 hccl interface return errreturn err ./common/src/hccl_test_common.cc:538, retcode: 7 This is an error in init_hcclComm. hccl interface return errreturn err ./common/src/hccl_test_common.cc:538, retcode: 7 This is an error in init_hcclComm. hccl interface return errreturn err ./common/src/hccl_test_common.cc:538, retcode: 7 This is an error in init_hcclComm. hccl interface return errreturn err ./common/src/hccl_test_common.cc:538, retcode: 7 This is an error in init_hcclComm.
Possible Cause
Some hccl_test processes are still running on the node which is communicating with the current node in the cluster.
Solution
Use MPI to terminate the residual hccl_test processes.
- Prepare the hostfile file (the file named hostfile that is defined in 3) configured when the HCCL Performance Tester is executed.
- Terminate hccl_test processes on all nodes in the cluster.
- In the MPICH installation scenario, the command example is as follows:
mpirun -f hostfile -n 512 pkill -9 -f "all_reduce_test|mpirun"
- -f: MPICH option, indicating the list file of hostfile nodes.
- -n: MPICH option, indicating the total number of NPUs to be terminated, that is, Number of nodes × Number of NPUs participating in training on each node. Change the value as required.
- pkill: Linux command. The subsequent -f is the option, which is used to specify the process name or command-line option mode to be matched. all_reduce_test is the HCCL test command, which can be changed to an actual command.
- In the Open MPI installation scenario, the command example is as follows:
mpirun -hostfile hostfile -n 512 pkill -9 -f "all_reduce_test|openmpi"
- -hostfile: Open MPI option, indicating the list file of hostfile nodes.
- -n: Open MPI option, indicating the total number of NPUs to be terminated, that is, Number of nodes × Number of NPUs participating in training on each node. Change the value as required.
- pkill: Linux command. The subsequent -f is the option, which is used to specify the process name or command-line option mode to be matched. all_reduce_test is the HCCL test command, which can be changed to an actual command.
- In the MPICH installation scenario, the command example is as follows:
- After the preceding steps are complete, run the HCCL Performance Tester again to perform the test.
Parent topic: FAQs