Speech:Summer 2013 Tommy McCarthy:Prelim Torque Logs

Setup
This is the configuration being used. //start server if it's not running % pbs_server

//default queue settings % qmgr -c "create queue batch queue_type=execution" % qmgr -c "set queue batch started=true" % qmgr -c "set queue batch enabled=true" % qmgr -c "set queue batch resources_default.nodes=1" % qmgr -c "set queue batch resources_default.walltime=3600" % qmgr -c "set server default_queue=batch"

Work Overview
This log is the first real thorough test of using TORQUE and Sphinx to Run a Train.

To begin, I used Eric's new scripts (undocumented as of publishing, I believe)

Step 1
% /mnt/main/scripts/user/train_01.pl Returns 0135

Step 2
% /mnt/main/scripts/user/train_02.pl -e 0135

Step 3
% /mnt/main/scripts/user/clone_exp.pl -t all -e 0135 -c 0089

Step 4
(Generate feats data) % /mnt/main/scripts/train/scripts_pl/make_feats.pl -ctl /mnt/main/Exp/0135/etc/0135_train.fileids

Step 5
Customize etc/sphinx_train.cfg for TORQUE. - denotes original line + denotes line change LINE 122 (-) $CFG_NPART = 1; LINE 122 (+) $CFG_NPART = 8; LINE 165 (-) $CFG_QUEUE_TYPE = "Queue"; LINE 165 (+) $CFG_QUEUE_TYPE = "Queue::PBS"; LINE 168 (-) $CFG_QUEUE_NAME = "workq"; LINE 168 (+) $CFG_QUEUE_NAME = "batch";

Step 6
We have some permission issues now. So we have to change the perms of the experiment folder. (This is preliminary work, I need someone who knows more *nix stuff better!) % cd /mnt/main/Exp/ % chmod -R 777 0135 % cd /0135 We cannot submit the job as root, which means we can't run RunAll either as root. % su % /mnt/main/scripts/train/scripts_pl/RunAll.pl

Step 7
Now the jobs are enqueued, which you can see by running qstat. You'll see their status is Q or H. We have to manually submit them (for now) as follows: % su root % qrun ##.caesar ...where ## is the job number (from qstat)

- You can see which nodes are working by running pbsnodes. If running, they'll show up as job-exclusive.