Mudanças entre as edições de "Minicluster:Torque"

De WikiLICC
Ir para: navegação, pesquisa
m (References)
m (Hmm...)
 
(19 revisões intermediárias pelo mesmo usuário não estão sendo mostradas)
Linha 5: Linha 5:
 
'''Note:''' Torque has a built-in scheduler that can be used instead of Maui.  However, Maui integrates seamlessly and provides more options and customization than Torque's scheduler.
 
'''Note:''' Torque has a built-in scheduler that can be used instead of Maui.  However, Maui integrates seamlessly and provides more options and customization than Torque's scheduler.
  
 +
= Parte 1 =
 
== Installation ==
 
== Installation ==
 
Before setting up Torque and Maui, [[Name Service: DNS and BIND|DNS]] must be working.  If that's not an option, this requisite can be "cheated" around by setting up <code>/etc/hosts</code> on the head node with an entry for each of the nodes and then copying this file out to each of the worker nodes.  (See the [[Cluster Time-saving Tricks]] page for help with the copying.)
 
Before setting up Torque and Maui, [[Name Service: DNS and BIND|DNS]] must be working.  If that's not an option, this requisite can be "cheated" around by setting up <code>/etc/hosts</code> on the head node with an entry for each of the nodes and then copying this file out to each of the worker nodes.  (See the [[Cluster Time-saving Tricks]] page for help with the copying.)
Linha 35: Linha 36:
  
 
== Installing Torque ==
 
== Installing Torque ==
Before you get and install torque, you'll want to make sure you have all the [[Installing Compilers|compilers]] installed that are necessary. If you don't, it will give you errors about which ones you're missing.
+
Antes de instalar torque, instale todos os compiladores necessários [[Installing Compilers|compilers]], senão aparecerá erros nos que estão faltando.
  
To get the most recent version of torque, visit http://www.clusterresources.com/downloads/torque/ and find the most recent version of it.  At the time if this writing, that happens to be torque-2.2.1.tar.gz.  Copy of the link location of the file.  From <code>/usr/local/src</code>, issue the following command for the most current file:
+
Baixe a versão mais recenete do torque em http://www.clusterresources.com/downloads/torque/
  
:<code><nowiki>wget http://www.clusterresources.com/downloads/torque/torque-2.2.1.tar.gz</nowiki></code>
+
Baixe o arquivo no diretório e descompacte
 +
cd /usr/local/src
 +
wget http://www.clusterresources.com/downloads/torque/torque-3.0.1.tar.gz
 +
tar xvf torque-3.0.1.tar.gz
  
Next, untar the file with
+
Mude de diretório e veja o help do configure
 
+
cd torque-3.0.1
:<code>tar xvf torque-2.2.1.tar.gz</code>
+
./configure -help
 
+
Usamos
Move into the directory that that just created with <code>cd torque-2.2.1</code>, or whatever your directory is. We're ready to run <code>./configure</code> (as part of the [[Source Installation Paradigm]], which you might want to check out if this seems unfamiliar to you). We'll add a number of arguments to the compiler in order to let torque know we want a server, and how to set up the server.  To see all of the possible arguments, type <code>./configure -help</code>.  What we'll use is this:
+
  ./configure --with-default-server=one --with-server-home=/var/spool/pbs --with-rcp=scp
  
:<code>./configure --with-default-server=<your server name> --with-server-home=/var/spool/pbs --with-rcp=scp</code>
 
  
* <code>--with-default-server</code> specifies the head node, which will run the server torque process.  Be sure to replace <code>your server name</code> with your actual head node's hostname!
+
* <code>--with-default-server</code> especifica o nó mestre, que roda o servidor torque (não funcionou usando <code>--with-default-server=one</code>. Gerava um erro: No permission: errno=15007)!
* <code>--with-server-home</code> sets the directory where torque will run from.  <code>/var/spool/pbs</code> is by no means standard, but it's the paradigm I'll be usingOthers use a directory like <code>/home/torque</code>.  I don't like confusing my processes with users.
+
* <code>--with-server-home</code> seta o diretório de onde o torque rodará.   
* <code>--with-rcp=scp</code> sets the default file-copying mechanism. Technically, scp (for [http://en.wikipedia.org/wiki/Secure_copy secure copy]) is the default, but if you don't specify it and <code>scp</code> isn't found, it'll move onto trying to find the next one, which we don't want.
+
* <code>--with-rcp=scp</code> seta o mecanismo para copiar arquivos. Se não for especificado, será tentado outro (não queremos).
  
If the <code>./configure</code> finishes successfully, you're ready to move onto the next step.  If not, address the issues before running the command again.  When it does finish successfully, it will end with a line like <code>config.status: executing depfiles commands</code>, but no message about being finished. Next, run
+
Ao terminar com sucesso aparece uma linha como:
 +
'''config.status: executing depfiles commands'''
 +
 +
Building components: server=yes mom=yes clients=yes
 +
                    gui=no drmaa=no pam=no
 +
PBS Machine type: linux
 +
Remote copy: /usr/bin/scp -rpB
 +
PBS home: /var/spool/pbs
 +
Default server: one.matrix
 +
Unix Domain sockets: no
 +
Tcl: disabled
 +
  Tk: disabled
  
 +
Rode o make
 
:<code>make</code>
 
:<code>make</code>
  
A lot of what looks like gibberish will scroll by, and it may take somewhere around five minutes. Again, it will finish without a confirmation message. The last part of the script finished on mine with
+
O final do arquivo é algo como...
 
 
 
<pre>
 
<pre>
make[3]: Leaving directory `/usr/local/src/torque-2.2.1/doc'
+
make[3]: Leaving directory `/usr/local/src/torque-3.0.1/doc'
make[2]: Leaving directory `/usr/local/src/torque-2.2.1/doc'
+
make[2]: Leaving directory `/usr/local/src/torque-3.0.1/doc'
make[1]: Leaving directory `/usr/local/src/torque-2.2.1/doc'
+
make[1]: Leaving directory `/usr/local/src/torque-3.0.1/doc'
make[1]: Entering directory `/usr/local/src/torque-2.2.1'
+
make[1]: Entering directory `/usr/local/src/torque-3.0.1'
 
make[1]: Nothing to be done for `all-am'.
 
make[1]: Nothing to be done for `all-am'.
make[1]: Leaving directory `/usr/local/src/torque-2.2.1'
+
make[1]: Leaving directory `/usr/local/src/torque-3.0.1'
 
</pre>
 
</pre>
  
Finally, you're ready to run
+
Finalmente
 
 
 
:<code>make install</code>
 
:<code>make install</code>
  
You won't get a confirmation message for this, either, and it'll finish similarly to the way the last one finished.  To make sure it was installed correctly, try using <code>which</code> to locate one of the binaries, like this:
+
Para verificar que instalou corretamente, tente localizar os binários com
 
+
<pre>root# which pbs_server
<pre>gyrfalcon:~# which pbs_server
 
 
/usr/local/sbin/pbs_server</pre>
 
/usr/local/sbin/pbs_server</pre>
  
If it can't find it, double check that the binary was installed with <code>ls</code> and <code>grep</code>:
+
Se não achar, tente
 
 
 
<pre>
 
<pre>
gyrfalcon:~# ls /usr/local/sbin | grep pbs       
+
root# ls /usr/local/sbin | grep pbs       
 
pbs_demux
 
pbs_demux
 
pbs_iff
 
pbs_iff
Linha 87: Linha 98:
 
pbs_server</pre>
 
pbs_server</pre>
  
If it's there in <code>/usr/local/sbin</code>, but <code>which</code> doesn't find it, you'll need to edit <code>/etc/login.defs</code>.  Locate the line for <code>ENV_SUPATH</code> and add <code>/usr/local/bin</code> and <code>/usr/local/sbin</code> to itThe line for <code>ENV_PATH</code> should be right below it; add <code>/usr/local/bin</code> to it.
+
Se estiver lá, mas o which não acha, edite <code>/etc/login.defs</code>.  Localize a linha  <code>ENV_SUPATH</code> e adicionde <code>/usr/local/bin</code> e <code>/usr/local/sbin</code> neleA linha <code>ENV_PATH</code> deve ficar logo abaixo dele; adicione <code>/usr/local/bin</code> a ela.
  
== Configuring Torque ==
+
== Configurando Torque ==
To start the torque server running on the head node and create a new database of jobs, issue
+
Para iniciar o servidor torque rodando no nó mestre e criar um novo banco de dados de jobs, use e verique que o servidor está rodando (para parar um servidor existente use <code>killall -KILL pbs_server</code>):
  
 
:<code>pbs_server -t create</code>
 
:<code>pbs_server -t create</code>
 +
:<code>ps aux | grep pbs</code>
  
Now, if you run <code>ps aux | grep pbs</code>, you'll see the server running.  However, if you run a command to list the queues and their statuses,
+
Entretanto, rodando o comando para listar a queue e o seu status
 
 
 
:<code>qstat -a</code>
 
:<code>qstat -a</code>
 
+
não aparecerá nada, pois a lista está vazia.
you'll see nothing because no queues have been set up.  To begin configuring queues for torque, we need <code>qmgr</code>, an interface to the batch system. You can run
+
Inicie o QManager (qmgr)
  
 
:<code>qmgr</code>
 
:<code>qmgr</code>
  
to start it up in an interactive mode, or enter the commands one at a time on the command line:
+
(ou <code>qmgr one.matrix</code>)  para configurar as filas e iniciar em modo interativo ou entre os comandos abaixo na linha de comando:
  
 
<pre>qmgr -c "set server scheduling=true"
 
<pre>qmgr -c "set server scheduling=true"
Linha 112: Linha 123:
 
qmgr -c "set server default_queue=batch"</pre>
 
qmgr -c "set server default_queue=batch"</pre>
  
Additionally, you can run commands to set the administrators' e-mail:
+
Não funcionou a adição do email:
  
 
<pre>qmgr -c "set server operators = root@localhost"
 
<pre>qmgr -c "set server operators = root@localhost"
qmgr -c "set server operators += kwanous@localhost"</pre>
+
qmgr -c "set server operators += dago@mat.ufrgs.br"</pre>
  
=== Sanity Check ===
+
=== Teste de Sanidade  ===
At this point, running <code>qstat -q</code> to view available queues should give you something like this:
+
Verifique as queues disponíveis:
  
<pre>gyrfalcon:~# qstat -q
+
<pre>root@one# qstat -q
  
server: gyrfalcon
+
server: one
  
 
Queue            Memory CPU Time Walltime Node  Run Que Lm  State
 
Queue            Memory CPU Time Walltime Node  Run Que Lm  State
Linha 131: Linha 142:
 
</pre>
 
</pre>
  
Excellent, we have a queue called "batch" and it's emptyYou can also view your qmgr settings with
+
Note a fila "batch" vazia (ok)Veja a configuração do qmgr com
  
 
:<code>qmgr -c "print server"</code>
 
:<code>qmgr -c "print server"</code>
  
Time to try submitting a job to the queue. First, switch over to a different user account (don't run this as root) with <code>su - <username></code>.  Then, try to submit a job that just sleeps for thirty seconds and does nothing:
+
Agora vamos submeter um job a queue. Troque de usuário (não rode com root), submeta um job que dorme por 30 segundos e verifique a fila:
  
:<code>echo "sleep 30" | qsub</code>
+
<pre>
 
+
root# su - usuario</code>
The purpose of this is to see whether the job shows in the queue when you run <code>qstat</code> after submitting it.  Below is a script of my testing it.
+
usuario# echo "sleep 30" | qsub</code>
 +
0.one.matrix
 +
usuario# qstat  
  
<pre>
 
kwanous@gyrfalcon:~$ echo "sleep 30" | qsub
 
0.gyrfalcon
 
kwanous@gyrfalcon:~$ qstat
 
 
Job id                    Name            User            Time Use S Queue
 
Job id                    Name            User            Time Use S Queue
 
------------------------- ---------------- --------------- -------- - -----
 
------------------------- ---------------- --------------- -------- - -----
0.gyrfalcon              STDIN            kwanous                0 Q batch        
+
0.one                      STDIN            dago                  0 Q batch
 
</pre>
 
</pre>
  
Excellent, the job shows up! Unfortunately, though, it won't run... the state is "Q" (I assume for "queued"), and it needs to be scheduledThat's what we'll install [[Scheduler: Maui|Maui]] for later.
+
Excelente, o job aparece! Infelizmente não rodará... o estado é "Q" (de "queued") e ele precisa ser agendadoIsto é o que instalaremos [[Scheduler: Maui|Maui]] para depois.
  
=== Introducing Torque to the Worker Nodes ===
+
=== Apresente Torque aos escravos ===
Now we need to tell the <code>pbs_server</code> which worker nodes are available and will be running <code>pbs_mom</code>, a client that allows the the server to give them jobs to run.  We do this by creating the file <code>/var/spool/pbs/server_priv/nodes</code>.  With your favorite text editor, add each worker node hostname on a line by itself.  If they have more than one processor, add <code>np=X</code> next to the line. Mine looks like this:
+
Precisamos dizer ao <code>pbs_server</code> quais nós escravos estão disponíveis e rodarão <code>pbs_mom</code>, um cliente que permite ao servidor dar jobs a eles para rodar. Edite o arquivo abaixo com as maquinas e o numero de processadores:
  
 
<pre>
 
<pre>
eagle np=4
+
vi /var/spool/pbs/server_priv/nodes
goshawk np=4
+
cell108 np=2
harrier np=4
+
cell109 np=2
kestrel np=4
+
cell110 np=2
kite np=4
+
cell111 np=2
osprey np=4
+
cell112 np=2
owl np=4
+
cell113 np=2
peregrine np=4</pre>
+
cell114 np=2
 
+
cell115 np=2
Which that, configuration on the head node for torque is done.
+
cell116 np=2
 +
cell117 np=2
 +
cell118 np=2
 +
cell119 np=2
 +
cell120 np=2
 +
cell121 np=2
 +
cell122 np=2
 +
cell123 np=2</pre>
  
 
== Installing Torque on the Worker Nodes ==
 
== Installing Torque on the Worker Nodes ==
Linha 226: Linha 242:
 
:<code>pbsnodes -a</code>
 
:<code>pbsnodes -a</code>
  
(I don't know why this command doesn't have an underscore.)  Each of the nodes should check in with a little report like my node peregrine's below.
+
Para cada nó escravo deveria aparecer algo como
 
 
 
<pre>
 
<pre>
 
peregrine
 
peregrine
Linha 245: Linha 260:
 
* [http://www.clusterresources.com/torquedocs21/a.ltorquequickstart-manualconfig.shtml TORQUE Quick Start Guide - Manual Server Configuration]
 
* [http://www.clusterresources.com/torquedocs21/a.ltorquequickstart-manualconfig.shtml TORQUE Quick Start Guide - Manual Server Configuration]
  
 +
Problemas ver
 +
* http://www.clusterresources.com/pages/resources/documentation/common-issues/torque.php
 +
* http://linux.die.net/man/8/pbs_mom
  
 
= Parte 3 =
 
= Parte 3 =
Linha 334: Linha 352:
 
* [http://www.clusterresources.com/products/maui/docs/pbsintegration.shtml Maui - PBS Integration Guide]
 
* [http://www.clusterresources.com/products/maui/docs/pbsintegration.shtml Maui - PBS Integration Guide]
  
This is the last part of a four part tutorial on installing and configuring a [[Using a Scheduler and Queue | queuing system and scheduler]].  The full tutorial includes:
+
=Parte 4=
 
 
* [[Using a Scheduler and Queue]]
 
* [[Resource Manager: Torque]]
 
* [[Scheduler: Maui]]
 
* [[Torque and Maui Sanity Check: Submitting a Job]]
 
 
 
There is also a troubleshooting page:
 
 
 
* [[Troubleshooting Torque and Maui]]
 
 
 
This part tutorial assumes you have already installed and configured [[Resource Manager: Torque|Torque]] and [[Scheduler: Maui|Maui]].  If you haven't, you'll want to visit those pages first.
 
  
 
== Torque/Maui Sanity Check: Submitting a Job ==
 
== Torque/Maui Sanity Check: Submitting a Job ==
Linha 443: Linha 450:
 
== Hmm... ==
 
== Hmm... ==
 
If you didn't get the results described on this page, visiting the [[Troubleshooting Torque and Maui]] page might be of help.
 
If you didn't get the results described on this page, visiting the [[Troubleshooting Torque and Maui]] page might be of help.
 +
 +
 +
* [http://www.clusterresources.com/torquedocs21/commands/pbsnodes.shtml pbsnodes]
 +
 +
* Para verificar se pbs_mom está rodando
 +
  ps -ef|grep pbs_mom

Edição atual tal como às 00h50min de 25 de maio de 2011

The scheduler and the queue are two essential parts for a cluster. Together, they transform a group of networked machines into a cluster, or at least something closer to one. They're what allow users, working only on the head node, to submit "jobs" to the cluster. These jobs are transparently assigned to different worker nodes, and then - without the user needing to know where the jobs were - the results are deposited back into the user's home directory.

This process requires software in two different roles: the resource manager, responsible for accepting jobs to the queue and running jobs on worker nodes, and the scheduler, responsible for deciding when and where jobs in the queue should be run in order to optimize resources. I'll be using Torque for the resource manager and Maui for the scheduler. Both of these are open source projects.

Note: Torque has a built-in scheduler that can be used instead of Maui. However, Maui integrates seamlessly and provides more options and customization than Torque's scheduler.

Parte 1

Installation

Before setting up Torque and Maui, DNS must be working. If that's not an option, this requisite can be "cheated" around by setting up /etc/hosts on the head node with an entry for each of the nodes and then copying this file out to each of the worker nodes. (See the Cluster Time-saving Tricks page for help with the copying.)

Torque needs to be installed in two parts. First, a pbs_server is set up on the head node and configured to know where all of the worker nodes are. Then, each of the worker nodes are set up to run pbs_mom, a sort of client, that will accept jobs from the pbs_server and run them on the worker node. A basic queue for Torque also needs to be configured.

Maui is installed only on the head node, and needs to be set up to interact with the pbs_server. It does not communicate with the worker nodes, but instead talks to them by way of the server.


Use and Features

After both are installed and working properly, you might want to look at

References

Parte 2

About Torque

From the Cluster Resources page on Torque,

"TORQUE is an open source resource manager providing control over batch jobs and distributed compute nodes. It is a community effort based on the original *PBS project..."

Because torque branched off from PBS, it still retains a lot of the old commands and names. PBS stands for portable batch system, and from here, I'll still call it torque, but commands may have "pbs" in them rather than "torque".

Installing Torque

Antes de instalar torque, instale todos os compiladores necessários compilers, senão aparecerá erros nos que estão faltando.

Baixe a versão mais recenete do torque em http://www.clusterresources.com/downloads/torque/

Baixe o arquivo no diretório e descompacte

cd /usr/local/src
wget http://www.clusterresources.com/downloads/torque/torque-3.0.1.tar.gz
tar xvf torque-3.0.1.tar.gz

Mude de diretório e veja o help do configure

cd torque-3.0.1
./configure -help

Usamos

./configure --with-default-server=one  --with-server-home=/var/spool/pbs --with-rcp=scp


  • --with-default-server especifica o nó mestre, que roda o servidor torque (não funcionou usando --with-default-server=one. Gerava um erro: No permission: errno=15007)!
  • --with-server-home seta o diretório de onde o torque rodará.
  • --with-rcp=scp seta o mecanismo para copiar arquivos. Se não for especificado, será tentado outro (não queremos).

Ao terminar com sucesso aparece uma linha como:

config.status: executing depfiles commands

Building components: server=yes mom=yes clients=yes
                    gui=no drmaa=no pam=no
PBS Machine type: linux
Remote copy: /usr/bin/scp -rpB
PBS home: /var/spool/pbs
Default server: one.matrix
Unix Domain sockets: no
Tcl: disabled
Tk: disabled

Rode o make

make

O final do arquivo é algo como...

make[3]: Leaving directory `/usr/local/src/torque-3.0.1/doc'
make[2]: Leaving directory `/usr/local/src/torque-3.0.1/doc'
make[1]: Leaving directory `/usr/local/src/torque-3.0.1/doc'
make[1]: Entering directory `/usr/local/src/torque-3.0.1'
make[1]: Nothing to be done for `all-am'.
make[1]: Leaving directory `/usr/local/src/torque-3.0.1'

Finalmente

make install

Para verificar que instalou corretamente, tente localizar os binários com

root# which pbs_server
/usr/local/sbin/pbs_server

Se não achar, tente

root# ls /usr/local/sbin | grep pbs       
pbs_demux
pbs_iff
pbs_mom
pbs_sched
pbs_server

Se estiver lá, mas o which não acha, edite /etc/login.defs. Localize a linha ENV_SUPATH e adicionde /usr/local/bin e /usr/local/sbin nele. A linha ENV_PATH deve ficar logo abaixo dele; adicione /usr/local/bin a ela.

Configurando Torque

Para iniciar o servidor torque rodando no nó mestre e criar um novo banco de dados de jobs, use e verique que o servidor está rodando (para parar um servidor existente use killall -KILL pbs_server):

pbs_server -t create
ps aux | grep pbs

Entretanto, rodando o comando para listar a queue e o seu status

qstat -a

não aparecerá nada, pois a lista está vazia. Inicie o QManager (qmgr)

qmgr

(ou qmgr one.matrix) para configurar as filas e iniciar em modo interativo ou entre os comandos abaixo na linha de comando:

qmgr -c "set server scheduling=true"
qmgr -c "create queue batch queue_type=execution"
qmgr -c "set queue batch started=true"
qmgr -c "set queue batch enabled=true"
qmgr -c "set queue batch resources_default.nodes=1"
qmgr -c "set queue batch resources_default.walltime=3600"
qmgr -c "set server default_queue=batch"

Não funcionou a adição do email:

qmgr -c "set server operators = root@localhost"
qmgr -c "set server operators += dago@mat.ufrgs.br"

Teste de Sanidade

Verifique as queues disponíveis:

root@one# qstat -q

server: one

Queue            Memory CPU Time Walltime Node  Run Que Lm  State
---------------- ------ -------- -------- ----  --- --- --  -----
batch              --      --       --      --    0   0 --   E R
                                               ----- -----
                                                   0     0

Note a fila "batch" vazia (ok). Veja a configuração do qmgr com

qmgr -c "print server"

Agora vamos submeter um job a queue. Troque de usuário (não rode com root), submeta um job que dorme por 30 segundos e verifique a fila:

 root# su - usuario</code>
 usuario# echo "sleep 30" | qsub</code>
 0.one.matrix
 usuario# qstat 

Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
0.one                      STDIN            dago                   0 Q batch

Excelente, o job aparece! Infelizmente não rodará... o estado é "Q" (de "queued") e ele precisa ser agendado. Isto é o que instalaremos Maui para depois.

Apresente Torque aos escravos

Precisamos dizer ao pbs_server quais nós escravos estão disponíveis e rodarão pbs_mom, um cliente que permite ao servidor dar jobs a eles para rodar. Edite o arquivo abaixo com as maquinas e o numero de processadores:

vi /var/spool/pbs/server_priv/nodes
cell108 np=2
cell109 np=2
cell110 np=2
cell111 np=2
cell112 np=2
cell113 np=2
cell114 np=2
cell115 np=2
cell116 np=2
cell117 np=2
cell118 np=2
cell119 np=2
cell120 np=2
cell121 np=2
cell122 np=2
cell123 np=2

Installing Torque on the Worker Nodes

Now we need to install a smaller version of torque, called pbs_mom, on all of the worker nodes. Move back into the directory we untarred earlier, /usr/local/src/torque*. There's a handy way to create the packages for the torque clients. Run

make packages

and they'll be created for you. This time you'll get a confirmation message:

Done.

The package files are self-extracting packages that can be copied
and executed on your production machines.  Use --help for options.

You'll see some new files in the directory now if you run an ls. The one we're interested in is torque-package-mom-linux-*.sh where the * is your architecture. We need to copy that file to all the the worker nodes. You can either copy it over to a shared NFS mount, or see my Cluster Time-saving Tricks on how to copy a file to all the nodes using the rsync command. I'm copying it over to my NFS mount with

cp torque-package-mom-linux-i686.sh /shared/usr/local/src/

Once it's on each worker node, they each need to run the script with

torque-package-mom-linux-i686.sh --install

You have a couple of options for doing this on each node. You can ssh over and run it manually, or you can check out my Cluster Time-saving Tricks page to learn to how to write a quick script to run the command over ssh without having to log into each node. If you're going with the second route, the command to use is

for x in `cat machines`; do ssh $x /<full path to package>/torque-package-mom-linux-i686.sh --install; done

Before we can start up pbs_mom on each of the nodes, they need to know who the server is. You can do this by creating a file /var/spool/pbs/server_name that contains the hostname of the head node on each worker node, or you can copy the file to all of the nodes at once with a short script (assuming you've created a file at ~/machines with the hostnames of the worker nodes as outlined in the Cluster Time-saving Tricks page):

for x in `cat ~/machines`; do rsync -plarv /var/spool/pbs/server_name $x:/var/spool/pbs/; done

Next, if you're using a NFS-mounted file system, you need to create a file on each of the worker nodes at /var/spool/pbs/mom_priv/config with the contents

$usecp <full hostname of head node>:<home directory path on head node> <home directory path on worker node>

The path is the same for me on my head node or worker node, and my file looks like this:

$usecp gyrfalcon.raptor.loc:/shared/home /shared/home

Again, this file can be created on each of the worker nodes, or you can create it and copy it over to each of the nodes. If you're using the latter technique, assuming you've created a machines file with all the host names, and you've created a config file, the command to run from the head node is

for x in `cat ~/machines`; do rsync -plarv config $x:/var/spool/pbs/mom_priv/; done

After you've done that, pbs_mom is ready to be started on each of the worker nodes. Again, you can ssh in to each node and run pbs_mom, or the script equivalent is

for x in `cat ~/machines`; do ssh $x pbs_mom; done

Everyone Placing Nice on Torque

Finally, it's time to make sure the server monitors the pbs_moms that are running. Terminate the current queues with

qterm

and then start up the pbs server process again

pbs_server

Then, to see all the available worker nodes in the queue, run

pbsnodes -a

Para cada nó escravo deveria aparecer algo como

peregrine
     state = free
     np = 4
     ntype = cluster
     status = opsys=linux,uname=Linux peregrine 2.6.21-2-686 #1 SMP Wed Jul 11 0
     3:53:02 UTC 2007 i686,sessions=? 0,nsessions=? ,nusers=0,idletime=1910856,totme
     m=3004480kb,availmem=2953608kb,physmem=1028496kb,ncpus=8,loadave=0.00,netload=18
     0898837,state=free,jobs=,varattr=,rectime=1200191204

Ready to continue? Move on to installing Maui, the scheduler.

References

Problemas ver

Parte 3

About Maui

The Maui Cluster Scheduler, or just Maui for short, is a cluster scheduler from Cluster Resources. Maui needs to be installed on just the head node, and then Torque is used to submit jobs to this scheduler. Maui manages the clients by way of the pbs_moms.

Installing Maui

To get Maui, first visit http://www.clusterresources.com/downloads/maui/temp/ and find the most recent version of it. At the time if this writing, that happens to be the 27-Jun-2007 snapshot. Copy the link for the location of the file. From /usr/local/src/, issue the following command for the most current file:

wget http://www.clusterresources.com/downloads/maui/temp/maui-3.2.6p20-snap.1182974819.tar.gz

Next, untar the file with

tar xvf maui-3.2.6p20-snap.1182974819.tar.gz

Move into the directory that that just created with cd maui-*. We're ready to run ./configure (as part of the Source Installation Paradigm, which you might want to check out if this seems unfamiliar to you). We'll add a number of arguments. To see all of the possible arguments, type ./configure -help. What we'll use is this:

./configure --with-pbs --with-spooldir=/var/spool/maui/
  • --with-pbs makes it compatible with Torque
  • --with-spooldir sets it to use /var/spool/maui as its home directory

If it finishes successfully, you'll see a message and a confirmation, as shown below.

configure: NOTE:  link 'docs/mauidocs.html' to your local website for access to 
user and admin documentation
NOTE:  latest downloads, patches, etc are available at 'http://supercluster.org/
maui'

configure successful.

Next, run

make

If it finishes without an error, the make was successful. Finally, run

make install

and again, if it finishes without an error, that's a success. In order for mine to work, I had to edit /var/spool/maui/maui.cfg. (If you didn't change your spool directory during ./configure, yours will be located at /usr/local/maui/maui.cfg.) You should have a line like

#RMCFG[HEADNODE] TYPE=PBS@RMNMHOST@

where HEADNODE is your head node's hostname in capital letters. Comment out this line by adding a pound symbol, #, in front of it. Then create a line below it:

RMCFG[headnode] TYPE=PBS

where headnode is your head node's hostname in lowercase letters.

Starting Maui

Now maui can be started up on the head node. Maui installs the executable to /usr/local/maui/bin, so you'll want to add that as part of root's path. To do this, run

export PATH=$PATH:/usr/local/maui/bin:/usr/local/maui/sbin

(To make this a permanent addition, add the above line to your ~/.bashrc file.) Then run

maui

You won't get any output from it, but running

ps aux | grep maui

should show maui running now. In addition, running showq should show give you a nice view of jobs in the queue waiting to be scheduled. Currently there are none.

gyrfalcon:/var/spool/maui# showq
ACTIVE JOBS--------------------
JOBNAME            USERNAME      STATE  PROC   REMAINING            STARTTIME


     0 Active Jobs       0 of    0 Processors Active (0.00%)

IDLE JOBS----------------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME


0 Idle Jobs

BLOCKED JOBS----------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME

Sanity Check

By this point, you should have both torque and maui installed. Great! Continue onto the sanity check to make sure they're talking to each other.

References

Parte 4

Torque/Maui Sanity Check: Submitting a Job

A job is one particular instance of running a particular script or program of code. You won't want to run a job as root, so first, on your head node, become one of your users. (For instance, su - kwanous.)

Jobs are submitted to the job queue run by torque, which maui monitors and will then schedule, and torque will tell the pbs_mom client running on the worker node that maui picks to run the job. Jobs are submitted to torque with the qsub command.

Test: Sleep Job

An easy job to submit and monitor is just a sleep command.

As one of your users, enter the command that will create a job that simply sleeps for 30 seconds, as shown below:

echo "sleep 30" | qsub

Immediately afterward, run the torque command qstat to see the job appear in torque's queue, and then the maui command showq. You can even run

pbsnodes | grep -v status | grep -v ntype

to see which node the job is running on. A script of my output is shown below.

kwanous@gyrfalcon:~$ echo "sleep 30" | qsub
6.gyrfalcon

kwanous@gyrfalcon:~$ qstat
Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
6.gyrfalcon               STDIN            kwanous                0 R batch          
kwanous@gyrfalcon:~$ showq
ACTIVE JOBS--------------------
JOBNAME            USERNAME      STATE  PROC   REMAINING            STARTTIME

6                   kwanous    Running     1     1:00:00  Wed Jan 23 14:00:24

     1 Active Job        1 of   28 Processors Active (3.57%)
                         1 of    7 Nodes Active      (14.29%)

... snipped ...

Total Jobs: 1   Active Jobs: 1   Idle Jobs: 0   Blocked Jobs: 0
kwanous@gyrfalcon:~$ pbsnodes | grep -v status | grep -v ntype
eagle
     state = free
     np = 4

 ... snipped ...

peregrine
     state = free
     np = 4
     jobs = 0/7.gyrfalcon

Approximately thirty seconds later, the job should finish running. If you run qstat and showq again, you should no longer see the job (6.gyrfalcon, in my example) running.

Sleep Job Results

In the home directory of the user you've submitted the job as, you should now see two files, something like:

  • STDIN.o3
  • STDIN.e3

where 3 is the job ID. The file ending in .o# is all of the output in the form of standard out that came from the job. .e# is all the output from standard error. For our sleep job, both of these should be empty. sleep doesn't give any output to standard out or standard error.

Test: Standard Output vs Standard Error

Qsub can also take input in the form of files. These files can give all sorts of specifications to torque about how long the job will run and what resources it needs. (To learn more about qsub submission files, see Torque Qsub Scripts.) We'll write just a simple one. Open your favorite text editor and enter the contents of my Standard Output/Error For Loop Script and save this file to submission. This script has a simple for loop that runs from 1 to 10. If the number is less than 5, it will print a statement to standard output. If the number is greater than or equal to 5, it will print a statement to standard error.

Submit the job with

qsub submission

where submission is the name of the script file.

Job Results

Again, you should have .o# and .e# files in your home directory, but this time they should start with the name of the file submitted to qsub (submission). This time, they should have content in them. Your output file should have the first four lines, which were printed to standard output:

1 is less than 5
2 is less than 5
3 is less than 5
4 is less than 5

and your error file should have the last six, which were printed to standard error:

5 is greater than or equal to 5
6 is greater than or equal to 5
7 is greater than or equal to 5
8 is greater than or equal to 5
9 is greater than or equal to 5
10 is greater than or equal to 5

Hmm...

If you didn't get the results described on this page, visiting the Troubleshooting Torque and Maui page might be of help.


  • Para verificar se pbs_mom está rodando
 ps -ef|grep pbs_mom