"bash:orted: command not found" Error

Symptom

In the cluster scenario, when the mpirun command is executed, the error message "bash: orted: command not found" is displayed, as shown below.

bash: orted: command not found
--------------------------------------------------------------------------
A daemon (pid 8793) died unexpectedly with status 127 while attempting
to launch so we are aborting.
 
There may be more information reported by the environment (see above).
 
This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.

Possible Cause

Some hccl_test processes have not been exited in the cluster.

Solution

Use MPI to terminate the residual hccl_test processes.

  1. Prepare the hostfile file (the file named hostfile that is defined in 3) configured when the HCCL Performance Tester is executed.
  2. Terminate hccl_test processes on all nodes in the cluster.
    • In the MPICH installation scenario, the command example is as follows:

      mpirun -f hostfile -n 512 pkill -9 -f "all_reduce_test|mpirun"

      • -f: MPICH option, indicating the list file of hostfile nodes.
      • -n: MPICH option, indicating the total number of NPUs to be terminated, that is, Number of nodes × Number of NPUs participating in training on each node. Change the value as required.
      • pkill: Linux command. The subsequent -f is the option, which is used to specify the process name or command-line option mode to be matched. all_reduce_test is the HCCL test command, which can be changed to an actual command.
    • In the Open MPI installation scenario, the command example is as follows:

      mpirun -hostfile hostfile -n 512 pkill -9 -f "all_reduce_test|openmpi"

      • -hostfile: Open MPI option, indicating the list file of hostfile nodes.
      • -n: Open MPI option, indicating the total number of NPUs to be terminated, that is, Number of nodes × Number of NPUs participating in training on each node. Change the value as required.
      • pkill: Linux command. The subsequent -f is the option, which is used to specify the process name or command-line option mode to be matched. all_reduce_test is the HCCL test command, which can be changed to an actual command.
  3. After the preceding steps are complete, run the HCCL Performance Tester again to perform the test.