Speech:Torque Updates

This is a list of what's been done with Torque so far.

What's Working

 * Torque is installed on Caesar (pbs_server)
 * All Nodes have the client installed and configured (pbs_mom)
 * There are two batch execution queues created (batch, batch2)
 * When configured, Sphinx will submit the necessary jobs to the desired queue when executing RunAll.pl

What's Not Working

 * Jobs have issues running from in the queue
 * Sometimes they error out, citing dependency issues (check server logs and the mail log the user receives)
 * This causes RunAll.pl to wait infinitely

What's Next

 * Torque is currently configured to allow all users (including root) to submit and run jobs. This can (and should) be changed. See section 1.3.2.6 (pg 22) in the PDF manual.
 * Job running needs to be troubleshooted so they run properly
 * Using pbsnodes to manage which nodes are available
 * Also to configure the resources and parameters available (if determined necessary)

Resources/Info

 * Torque PDF Documentation
 * Torque HTML Documentation
 * TORQUE_HOME is /var/spool/torque
 * Logs can be found in $TORQUE_HOME/server_logs/YYYYMMDD
 * Logs on the nodes can be found at /var/spool/torque/mom_logs
 * Server name is specified in $TORQUE_HOME/server_name (should be ceaser)
 * Nodes are
 * In $TORQUE_HOME/mom_priv/config: $pbsserver caesar
 * Nodes are told to look for caesar in /var/spool/torque/server_name
 * Caesar looks for nodes based on $TORQUE_HOME/server_priv/nodes (one per line)
 * Tommy's Full Logs
 * Full walkthrough of a job(better resource)
 * Note: root is not necessary for many of these anymore, and will cause issues with the job running (perms with moving files back to the experiment directory)
 * The nodes cannot have 2 IP addresses in the hosts file. This causes issues.

Quick Command List

 * pbs_server - creates process*
 * pbsnodes - view the nodes*
 * qsub - manually submit a job
 * qstat - view queue*
 * qrun JOBID - run a job in a queue
 * pbs_mom - for use on the nodes

* denotes flags are available and useful (see documentation)