"bash:orted: command not found" Error

Symptom

In a cluster scenario, when the mpirun command is executed, the error message "bash: orted: command not found" is displayed, as shown in the following:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
bash: orted: command not found
--------------------------------------------------------------------------
A daemon (pid 8793) died unexpectedly with status 127 while attempting
to launch so we are aborting.
 
There may be more information reported by the environment (see above).
 
This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.

Possible Cause

Some hccl_test processes still exist in the cluster.

Solution

Use the MPI to terminate the residual hccl_test processes.

  1. Prepare the hostfile file configured when the HCCL Performance Tester is executed, that is, the hostfile file defined in 4, for example, hostfile.
  2. Terminate the hccl_test processes on all nodes in the cluster.
    • In the MPICH installation scenario, the command example is as follows:

      mpirun -f hostfile -n 512 pkill -9 -f "all_reduce_test|mpirun"

      • -f: MPICH command-line option, indicating the node list file (hostfile).
      • -n: MPICH command-line option, indicating the total number of NPUs to be terminated, that is, the number of nodes multiplied by the number of NPUs involved in training on each node. Change the value as required.
      • pkill: Linux command. The following -f is the option of pkill, which is used to specify the process name or command line option pattern to be matched. all_reduce_test in the command example is the HCCL test command executed earlier. Change the command as required.
    • In the Open MPI installation scenario, the command example is as follows:

      mpirun -hostfile hostfile -n 512 pkill -9 -f "all_reduce_test|openmpi"

      • -hostfile: Open MPI command-line option, indicating the node list file (hostfile).
      • -n: Open MPI command-line option, indicating the total number of NPUs to be terminated, that is, the number of nodes multiplied by the number of NPUs involved in training on each node. Change the value as required.
      • pkill: Linux command. The following -f is the option of pkill, which is used to specify the process name or command line option pattern to be matched. all_reduce_test in the command example is the HCCL test command executed earlier. Change the command as required.
  3. After the preceding steps are complete, run the HCCL Performance Tester again.