Speech:Summer 2013 Tommy McCarthy

July 17
A week ago or so, I scoured the web to find any examples, configurations, or tools that people have used, developed, or written about in regards to using Sphinx over multiple machines. Unfortunately, I really did not find much. Torque is used for sending multiple processes over many machines, but doesn't really distribute the data. We care more about the data than the process, since we only have so many processes running (presumably one per train, but I haven't watched yet). I turned to the forums for Sphinx at SourceForge, hoping to get some guidance. So far just a condescending response. (As if this guy had no idea what I was looking for...) Gave him a nice sarcastic response. Jerk.

Initial post:

Hey everyone, I'm currently working on a project at my university. We're using Sphinx3, and I need to look into ways to distribute the processing. Currently, we have one primary machine and 9 drones, all networked together. They each share the same mount point and applications, but cannot work together right now. It's my task to try to find a way that we can break up the processing among the networked machines (and, in theory, above and beyond!) I've searched around on here and Google and mostly found people pointing towards Torque. When I proposed that to my professor, he mentioned how that isn't splitting up the data and processing it, and it's scheduling processing jobs across multiple machines, and therefore wouldn't provide the results we're trying to obtain. I found it strange that I hadn't come across anything yet, and was hoping to see if anybody has some resources, examples, or anything else that may be helpful to me! Thanks for your time

Follow up:

We are looking to train models in parallel. We have ten machines and would like to cut our training time to 1/10th. The idea is that if we have 100 hours of training data...for the sake of argument let's assume that training runs in real-time...that instead of running on a single machine for 100 hours we run on ten machines for 10 hours plus some small amount of time per training iteration (since training is done in multiple iterations) to sync-up/combine any separately generated new information (i.e. new separate models). My advisor has done this sort of thing with a proprietary system many years ago but doesn't know Sphinx that well and was wondering if that has been done already. We realize that torque is integrated with Sphinx and we're curious if someone has applied torque to allow this type of parallelization or is merely utilizing it to run things a bit more efficiently. What we want to do is really cut model building time down to 1/10 or less, dictated by the number of machines we have. Any insight on this would be helpful. We can obviously split our 100 hours of data pretty easily into 10 equal chunks, but the post-iteration re-combination of models will require effort and if either a) it's been done already or b) there's a different technique that accomplishes the same thing and does so with torque then that would be great know about and have a reference for.

I plan on searching more specific forums for examples and tools about what we can do to solve this problem. We'll see where this goes!

July 24
I took to the IRC channel (#cmusphinx on freenode) to ask for more help after being ignored again, on Mike's request. Hey everyone, I have a question regarding distributed processing and Sphinx. https://sourceforge.net/p/cmusphinx/discussion/sphinx4/thread/d9400973/ Didn't know if someone could take a look and point me in the right direction! :) the right direction is to ask question properly and then you will get an answer ;) I provided additional details In my follow up post those have zero information as well you can't expect answers asking questions this way and sometimes it's easier to try than to ask We're simply not sure the best way to approach it. If you want to train on 10 machines you can use torque, it's supported Does that actually distribute the data or just the processes? As in, (theoretically) cut down 100 hours of realtime processing to 10? data is distributed over nfs, processing over torque

In today's meeting with Mike, we'll determine what direction we'll take.

We're going to install torque and run two 5 hour trains to see how that goes.

August 7
There was a TAR of torque within /mnt/main/install/tar. I unpacked it and installed it. I've taken notes of my steps for documentation (if needed). Eric helped me out getting the pbs_server command to work (turns out the libraries just needed to be refreshed). Will update shortly

August 12
After Eric and I worked out some kinks on Wednesday, today I got all of the drones working with torque. However, as with obelix, I commented out their second IP addresses in /etc/hosts. Otherwise the main commant, pbs_server, fails. :/ Could potentially be an issue.

As of right now, all the drones successfully show as available to torque to use for processing. To start the process, run "pbs_mom", and to stop, "momctl -s".

I went to go run genTrans6.pl for experiment 0131, but "qsub" didn't take the command "qsub /mnt/main/scripts/user/genTrans6.pl /mnt/main/corpus/switchboard/first_5hr 0131". Instead, I wrote a shell script with that exact command in it, and put the shell script into the queue. I ran it, and it had an "E" status within a few seconds. "E" means "Job is exiting after having run." Something didn't go right here :/ I emailed Eric about this for any input. Will report back.

August 13
Today I got everything all set again and tried to make a shell script to execute /mnt/main/scripts/train/scripts_pl/make_feats.pl -ctl /mnt/main/Exp/0133/etc/0133_train.fileidss I submitted to torque with qsub FEATS.sh (my file name), and it didn't work. I ran it manually after Eric said it wouldn't be worth it to run that in batch. After more trial and error and discussion, it seems we may not need to make a shell script to submit a Sphinx job (i.e. RunAll.pl). In sphinx_train.cfg, I changed (line num:) 122: $CFG_NPART = 8; #there are 8 drones now, Eric's using Asterix 165: $CFG_QUEUE_TYPE = "Queue::PBS"; 168: $CFG_QUEUE_NAME = "batch"; #current name of our queue (changeable via qmgr) After these configuration changes are made, RunAll.pl fails with: Phase 3: Forward-Backward Queue submission failed: no Job ID Something failed: (/mnt/main/Exp/0133/scripts_pl/20.ci_hmm/slave_convg.pl) (looking into that now...) After more work, we had to chmod the 0133 dir to 777 (yeah, I know that's bad). Qsub won't take a script from root, so I had to run as myself. After those changes, it's running! I came across an error "FATAL_ERROR: "main.c", line 627: errors normalizing", but that seems to be gone now. I ran qstat in a separate session and got this: Job id                   Name             User            Time Use S Queue - ---  - - 9.caesar                   bw.1.1           twh22                  0 Q batch 10.caesar                 bw.1.2           twh22                  0 Q batch 11.caesar                 bw.1.3           twh22                  0 Q batch 12.caesar                 bw.1.4           twh22                  0 Q batch 13.caesar                 bw.1.5           twh22                  0 Q batch 14.caesar                 bw.1.6           twh22                  0 Q batch 15.caesar                 bw.1.7           twh22                  0 Q batch 16.caesar                 bw.1.8           twh22                  0 Q batch 17.caesar                 norm.1           twh22                  0 H batch Which means these are all enqueued! Process is currently running, so we'll see how it goes.

August 14
Please see this page for a new experiment fully documented.

After some fiddling, I got all but 1 job to run. Now I'm getting errors running any command. ps -u root shows pbs_server is running still, but I cannot access any commands. I'm greeted with caesar:~ # qstat pbs_iff: cannot read reply from pbs_server No Permission. qstat: cannot connect to server caesar (errno=15007) Unauthorized Request This is strange. Going to the meeting now, discussing it then.

August 15
I ran the RunAll script again as myself (twh22), and then su'd back to root and submitted each job in the queue with qrun. All jobs completed, and qstat reports nothing, and pbsnodes report all are available now. Unfortunately, the RunAll script is still listed in "ps -u twh22", and my PuTTY session times out. I reached out to Eric to see if he has any ideas on this. Either Torque isn't completing properly, or Sphinx isn't realizing that the queue is done, and not resuming.

Later on
Here's a rundown of what I have come across since then. I've been working on it and not updating my logs, mostly been in conversation with Eric. At this point, I was getting a LOT of post job processing error in the torque server logs. Torque isn't very specific about it. After rebuilding and reinstalling torque, recreating all pbs_mom servers on the drones, and countless other things, I've gotten to a pretty decent point.
 * Do not create an experiment as root if you're looking to use Torque
 * Torque can't run jobs as root, and then a landslide of permission issues arise
 * Toruqe's logs aren't that specific. Check the user mail.
 * It took me FOREVER to determine what the "post job processing error" was. The only way I did was in the user email. In there, it specified that scp was failing due to "permission denied". (Turns out my user couldn't move a file into an exp created by root)

At this point, the jobs either complete properly or do not. The reasoning for a not successful job is a dependency issue, so it "runs" and deletes it.

Raw Stuff
Me: Doing more investigative work, and I'm looking through the logs seeing "Post job processing error" When I Google that, I get this thread, which seems to say that you need to be able to ssh between servers without a password. Root is running the jobs, and cannot ssh among the servers without a password now. Do you think this is something worthy of pursuing? If so, I forget how we do the ssh trick to allow ssh'ing without passwords. Especially related to root. Would this be a security issue?

Eric: SSH allows for auto-authentication/login through a special file called 'authorized_keys', located within the .ssh directory within the user's home directory, which contains an RSA public key for a trusted host. What we do is essentially take the user's public key (I think its id_rsa.pub), and rename it to authorized_keys, allowing the machine to login to itself; since each user accounts on each batch machine uses the same home directory (on /mnt/main), each batch machine behaves similarly.

Now, root I think is local to each batch machine, so the trick won't work unless you manually spread the keys around to root's home directory (/root ?) batch machine. Its certainly an option, but I don't think its the best option.

A more important issue is that the batch machines cannot write to anywhere at or within /mnt/main as root. This is due to a security feature on the NFS share on Caesar (I think its called 'root_squash' or something like that), we can certainly disable that (its defined within the /etc/exports file), though doing so I think is a security/screwyness risk

I think the best option would be to set up either a special 'torque' account that is used by torque, or otherwise telling torque to use a user-level account.

Look at this manual: http://icme.stanford.edu/Computer%20Resources/docs/TORQUE_Administrator%27s_Guide.pdf Specifically on part 1.3.2.6 on pg. 22. We can set non-root administrators to start, configure, and manage the pbs_server daemon. I'm not sure if this will allow us to run torque as non-root users though

I forget, what part needs to be be root to run? Was it the queue submission/execution stuff?

Me: Right now, the actual RunAll.pl script needs to be submitted by non-root. However, the actual items in the queue need to be run (qrun) by root. For some reason I still have to investigate, the queue won't auto start. So I manually submit each one. Should I try the suggestion to allow all users to be trusted as operators and managers?

Eric: That would make sense, its possible that queues can't be started by non-operator/manager users (except for root).

Me: Okay, so it's not working, as usual. I added the + into the files, and explicitly added me (twh22@caesar.unh.edu) as a manager. I can now qrun as twh22. Still getting the error. And I can SSH among the servers as twh22 without a password. I'm still getting the same error, and every time I Google it, I see the ssh'ing issue. Why won't this just work for me :'( lol

I guess this is a mostly pointless email. But do you have any thoughts or other things we could try? (Yes, I did stop 'qterm' and do a cold restart 'pbs_server -t cold' after I made the changes)

Me: Still chugging away at this. I rebuild Torque with a use-scp flag (see page 74 section 6.1) and I'm precisely where I was stuck before. The logs are still giving the same post job processing error. One thing I'm still curious about is in section 6.2.1. This configures the mom nodes to use NFS for the jobs. This is what they have in regards to using NFS as a config: mom_priv/config $usecp *:/home /home submit host to $usecp *.fte.com:/data /usr/local/data In obelix (/var/spoo/torque/mom_priv/config), I have the following (aside from default config stuff): $usecp *:/mnt /mnt $usecp *.unh.edu:/data /usr/local/data I'm not sure if this is even possibly the problem. The error is still useless to me. Anyways, I know the NFS is mounted on each machine as /mnt, so thats the logic behind the first line. The second, I'm not sure at all about.
 * 1) /home is NFS mounted on all hosts
 * 1) submission hosts in domain fte.com should map '/data' directory on
 * 1) '/usr/local/data' on compute host

Me: ALRIGHT. New but better question. For some reason I looked at my mail file. And for some reason, Torque actually tells you what's wrong in the 'email' sent to me! PBS Job Id: 70.caesar.unh.edu Job Name:  bw.1.8 Exec host: miraculix/0 An error has occurred processing your job, see below. Post job file processing error; job 70.caesar.unh.edu on host miraculix/0

Unable to copy file /var/spool/torque/spool/70.caesar.unh.edu.OU to twh22@caesar:/mnt/main/Exp/0142/qmanager/bw.1.8.out scp: /mnt/main/Exp/0142/qmanager/bw.1.8.out: Permission denied Output retained on that host in: /var/spool/torque/undelivered/70.caesar.unh.edu.OU
 * error from copy
 * end error output

Unable to copy file /var/spool/torque/spool/70.caesar.unh.edu.ER to twh22@caesar:/mnt/main/Exp/0142/qmanager/bw.1.8.err scp: /mnt/main/Exp/0142/qmanager/bw.1.8.err: Permission denied Output retained on that host in: /var/spool/torque/undelivered/70.caesar.unh.edu.ER I ssh'd into miraculix and made a test file to copy with scp and keep getting "permission denied" errors. I chown'd exp 0142 to my user (twh22) and tried again with no luck. AFAIK the key for caesar should be in authorized hosts. Though I think this is where I'm messing something up. Any ideas?
 * error from copy
 * end error output

Me: It's all permission errors. I made a new experiment, 0143, and now the jobs seemingly run properly. This is from the log about ONE job I ran with qrun: 09/19/2013 14:48:10;0100;PBS_Server;Job;116.caesar.unh.edu;enqueuing into batch2, state 1 hop 1 09/19/2013 14:48:11;0008;PBS_Server;Job;116.caesar.unh.edu;Job Queued at request of twh22@caesar.unh.edu, owner = twh22@caesar.unh.edu, job name = norm.1, queue = batch2 09/19/2013 14:48:16;0008;PBS_Server;Job;116.caesar.unh.edu;Job Run at request of twh22@caesar.unh.edu 09/19/2013 14:48:16;000d;PBS_Server;Job;116.caesar.unh.edu;Not sending email: User does not want mail of this type. 09/19/2013 14:48:17;000d;PBS_Server;Job;116.caesar.unh.edu;Not sending email: User does not want mail of this type. 09/19/2013 14:48:17;0010;PBS_Server;Job;116.caesar.unh.edu;Exit_status=2 resources_used.cput=00:00:00 resources_used.mem=0kb resources_used.vmem=0kb resources_used.walltime=00:00:00 09/19/2013 14:48:17;0100;PBS_Server;Job;116.caesar.unh.edu;dequeuing from batch2, state COMPLETE Looks perfectly fine, right?! Well sometimes when I run one, the last job (which is suspiciously in the queue as "H" hold status, gets deleted: PBS Job Id: 134.caesar.unh.edu Job Name:  norm.1 Aborted by PBS Server Job deleted as result of dependency on job 126.caesar.unh.edu