Minicluster:Mpich com torque
O "Outro" Mpiexec
A funcionalidade Torque fornece um novo binário para o mpiexec
instalado com o MPICH2 (com o mesmo nome). Este "outro" mpiexec - com a funcionalidade Torque - é produzido por Ohio Supercomputer Center e funciona somente junto com o Torque. Usuários não poderão usá-lo sem um qsub script.
Assim, mantenha o original seguro para rodar sem o Torque e instale o novo nos nós escravos.
Preparação
Mova o mpiexec original (ao menos nos nós escravos) tal que não esteja no path do root ou usuários. Ache-o com
root# which mpiexec /usr/lib64/mpich2/bin/mpiexec
e então faça uma cópia de segurança.
root# cd /usr/lib64/mpich2/bin root# mv mpiexec mpiexec2
Instalação
Baixe o último arquivo mpiexec de OSC mpiexec page em downloads. Baixe o arquivo (ou última versão)
wget http://www.osc.edu/~pw/mpiexec/mpiexec-0.83.tgz wget http://www.osc.edu/~djohnson/mpiexec/mpiexec-0.84.tgz
Descompacte e entre no subdir
tar xvzf mpiexec*.tar.tgz cd mpiexec-0.84
Mpiexec segue o padrão source installation paradigm. Rode
./configure --help
para uma lista de opções. Important options include
--prefix=
- specify where you want to have the binaries installed. They need to be accessible by all of the worker nodes. An NFS mount would be a good choice.--with-pbs=
- necessary to get the Torque functionality! Specify the location of the Torque installation. If you followed my Torque tutorial, it's located at/var/spool/pbs
--with-default-comm=mpich2-pmi
- used to indicate which version of MPI
Next, run ./configure
with all the options necessary. My command looked like this:
./configure --prefix=/shared --with-pbs=/var/spool/pbs/ --with-default-comm=mpich2-pmi
Pbs_iff
To do this next part, Torque will need to already be installed. Mpiexec requires that a file named pbs_iff
be on each one of the worker nodes. Normally, this file is only located on the head node and isn't installed as part of the pbs_mom installation, so it needs to be copied out from the head node to each of the other nodes.
There's an easy way to do this by scripting. The first requirement is to have a file with each of the worker nodes listed in it. Assuming Torque is running, this can be generated with
pbsnodes | grep -v = | grep -v '^$' >> machines
-
grep -v =
excludes all lines that have an equal sign in them -
grep -v '^$'
contains a regular expression to delete all empty lines
The "machines" file of all the worker node names can then be used in a quick script to copy pbs_iff to each of the worker nodes. Find the original file with
updatedb && locate pbs_iff
(If you receive an error, apt-get install locate
and then try again.) Then, replacing my locations below for the location you found it on your cluster, run
for x in `cat machines`; do rsync /usr/local/sbin/pbs_iff $x:/usr/local/sbin/; done
Next, pbs_iff needs to have its permissions changed to setuid root. (This means the binary runs with root privileges, even when run by a different user.) Again, to do this across all the worker nodes at once, use a script and make sure the location is correct for your setup:
for x in `cat machines`; do ssh $x chmod 4755 /usr/local/sbin/pbs_iff; done
Without these steps, users trying to run mpiexec will see errors like these:
pbs_iff: file not setuid root, likely misconfigured pbs_iff: cannot connect to gyrfalcon:15001 - fatal error, errno=13 (Permission denied) cannot bind to reserved port in client_to_svr mpiexec: Error: get_hosts: pbs_connect: Unauthorized Request .
Testing
Trying to run a program with mpiexec
outside of a Torque job, it will give an error:
mpiexec: Error: PBS_JOBID not set in environment. Code must be run from a PBS script, perhaps interactively using "qsub -I".
At least it's a helpful error! Therefore, in order to test it, mpiexec will need to be called from within a script. Continue on to the Torque and Maui: Submitting an MPI Job page to test.