"retcode: 7" Is Returned During the HCCL Test

Symptom

When the HCCL test command is executed in the cluster scenario, the HCCL Performance Tester is started successfully. However, after the table headers of the data size, time, and bandwidth are printed, the following error messages are reported during subsequent execution.

1
2
3
4
5
6
7
8
9
the minbytes is 8192, maxbytes is 2147483648, iters is 20, warmup_iters is 5
hccl interface return err ./common/src/hccl_test_common.cc:538, retcode: 7 
This is an error in init_hcclComm.
hccl interface return err ./common/src/hccl_test_common.cc:538, retcode: 7 
This is an error in init_hcclComm.
hccl interface return err ./common/src/hccl_test_common.cc:538, retcode: 7 
This is an error in init_hcclComm.
hccl interface return err ./common/src/hccl_test_common.cc:538, retcode: 7 
This is an error in init_hcclComm.

Possible Cause

Some hccl_test processes are still running on the node which is communicating with the current node in the cluster.

Solution

Use the MPI to terminate the residual hccl_test processes.

  1. Prepare the hostfile file configured when the HCCL Performance Tester is executed, that is, the hostfile file defined in 4, for example, hostfile.
  2. Terminate the hccl_test processes on all nodes in the cluster.
    • In the MPICH installation scenario, the command example is as follows:

      mpirun -f hostfile -n 512 pkill -9 -f "all_reduce_test|mpirun"

      • -f: MPICH command-line option, indicating the node list file (hostfile).
      • -n: MPICH command-line option, indicating the total number of NPUs to be terminated, that is, the number of nodes multiplied by the number of NPUs involved in training on each node. Change the value as required.
      • pkill: Linux command. The following -f is the option of pkill, which is used to specify the process name or command line option pattern to be matched. all_reduce_test in the command example is the HCCL test command executed earlier. Change the command as required.
    • In the Open MPI installation scenario, the command example is as follows:

      mpirun -hostfile hostfile -n 512 pkill -9 -f "all_reduce_test|openmpi"

      • -hostfile: Open MPI command-line option, indicating the node list file (hostfile).
      • -n: Open MPI command-line option, indicating the total number of NPUs to be terminated, that is, the number of nodes multiplied by the number of NPUs involved in training on each node. Change the value as required.
      • pkill: Linux command. The following -f is the option of pkill, which is used to specify the process name or command line option pattern to be matched. all_reduce_test in the command example is the HCCL test command executed earlier. Change the command as required.
  3. After the preceding steps are complete, run the HCCL Performance Tester again.