Speech:Spring 2011 Brian Log


 * Home
 * Spring 2011
 * Proposal
 * Report

Week Ending March 8th, 2011
My task this week was to enable universal login support for all of the servers using kerberos. This would allow a login database to be created on one server that would then be used for authentication on all of the servers. Using the kerberos system, the user would only need to log into the system once. After that, a security key is exchanged which can tie into ssh as well as suse graphical login and a few other linux authentication systems. After logging in on one system, if the user attempts to access another system, then the security key will be used so that the user will not have to authenticate themselves again. This system is managed using a server containing central database of system login information which issues the security keys to clients when a user logs in. This work has not yet been completed, but hopefully I will be able to complete it later tonight, I have an installation guide that I have emailed myself, I just have not yet gotten the chance to follow the guide. The kerberos system should lend itself to handling this task nicely seeing as the Suse operating system appears to natively support it as a login method. Suse's Yast administration system supports authentication via kerberos and assuming this does not present any issues, the major task will be setting up a kerberos server. There are several installation guides for this task available on the internet which should be fairly easy to follow. As far as the other task that I have been assigned, I have not yet made very much progress. After doing some research on google, I have come across a sphinx command called lm_combine. This utility appears to have been created in order to merge sphinx training models. Unfortunately, the majority of the information that I have been able to find on this so far is several links to the source code and two links that briefly discuss the utility. As far as I have been able to find so far, the lm_combine utility is contained in the cmuclmtk package. However, more research still needs to be done on this to see if it will actually be possible for us to use it. One of the two pages that I have found that discusses the utility stated that the data needs to be formatted in a specific way in order for it to be compatible with lm_combine. Further research will provide the information required to format the data for lm_combine.
 * Task:

Tonight, I plan to follow the installation guide that I have found for kerberos. Tomorrow, I plan to set up the servers to use kerberos to authenticate their ssh sessions. After that, I will look into furhter authentication systems that kerberos is compatible with and discuss setting them up with the linux group in order to decide if any of them are necessary for our application. Research on the kerberos system is revealing possible security issues. Due to the fact that kerberos uses a central repository to store it's login information (to be expected, just not something that I had though of before), one would normally want to limit access to the server containing the kerberos login repository. The installation guide that I have found even suggests disabling all other services including stuff like dhcp and the X11 (linux graphical environment) servers in order to reduce the number of security holes in the system. In our case, the kerberos server will probably be installed on caesar due to the fact that caesar is the central repository for data on the server cluster. However, this may not be appropriate due to the prveiosly mentioned security concerns that come into play when the central login system for the servers is placed directly on the internet. I will talk with professor Jonas about these security concerns and see how he responds. Should time allow, I will also do further research on solutions to the merging issue that is presenting itself with Sphinx as well as to do some more research on the lm_combine command that I have found.
 * Plan:

At this time, there are no concerns with my assigned tasks. Available time has been presenting itself as an issue, but if the lm_combine proves to be useful, then I should be able to get the parallelization system up and running within the next couple of weeks. As far as kerberos is concerned, it should just be a matter of following the installation guide in order to get the system setup.
 * Concerns:

Week Ending March 22nd, 2011
My task this week was to set up the kerberos system for universal login support for the 10 servers. This is similar to Active Directory where each system reports to a central repository for login information to that system. I also have been assigned tasks of doing further research on implementing a parallelization system for training on the server queue. This includes research using google, as well as forums that are available for help. These tasks are required to be completed by March 29, and a working solution is planned for April 5. This week, I have attempted to implement the kerberos login system using a few guides that I found online. Before I discuss the results of my attempts, I would like to talk about the two variations of the Suse linux operating system. The first variation of Suse is OpenSuse. This version is maintained by the open source community, and, unlike Suse Enterprise Linux Distrobution (SLED), is not commerical. Because this OpenSuse is not commerical, it lacks some of the features and support that Sun (now Oracle) has implemented in their commercial offering. I tried two variations of installation guides for kerberos. The first variation was using installation guides specifically designed for Suse. These guides specified Suse, but after attempting to follow the guides, I am assuming that they were talking about SLED rather than OpenSuse. The guides that I found specified packages that were not available in the OpenSuse repositories and did not mention where to find them. Doing some research on the packages, I found the source code for the variation of kerberos that the guide talked about (Heimdal). Attempts to compile this source code, however, resulted in errors. I also attempted to find packages online for Heimdal. All of the results that I found, however, pointed to the MIT variant of kerberos (krbd5). After I had issues following the guides for Heimdal, I tried to install the MIT variation of kerberos. I found some of the packages for krbd5 in the repositories for Caesar and installed them. However, I was unable to figure out what to install to get some of the utilities that the installation guide for krbd5 mentioned. The installation guide that I found did not mention what packages to install. Finally, as a last attempt, I tried to find an installation guide for Ubuntu. Ubuntu has a large community and it is usually pretty easy to find a well written installation guide for whatever one may need. The first two results on google brought up two very good installation guides. They specified which packages to install, how to install them (they were all in the Ubuntu package repositories), and how to perform any configuration that is necessary. I tried to substitute what I could find in the packages for OpenSuse into this guide in order to be able to follow it on Caesar, but I again had trouble finding all of the utilities that were required for the configuration. After reviewing the installation guides for Suse, my suggestion is to install Ubuntu on a spare computer and then just follow the installation guide for Ubuntu available here: https://help.ubuntu.com/community/Kerberos. This guide is the one specified above that I tried to follow on Suse. It is well written and detailed on configuration. I expect that it would take 15 or 20 minutes to install Ubuntu and that the installation guide would take another 15 or 20 minutes to follow, meaning that the entire installation for kerberos would take an hour or so.
 * Task:
 * Plan:

Currently, there is a log in issue with Caesar that was created by my attempts to install kerberos. I believe that this may cause issues in the future. Also, I have been having trouble following an installation guide for kerberos that will work for our purposes.
 * Concerns:

Week Ending March 29th, 2011
This week I had a couple of tasks. My primary task for this week was to get kerberos up and running on the server queue. After this task was complete, my job was to do some more research on parallelization using the asterisk queue. The kerberos task was required to be completed by the end of the week or as soon as possible while the parallelization task, after talking to Professor Jonas, is due by April 22. On April 22, according to my understanding, I need to be able to explain some of the details behind the encoding mechanism used in Spinx including a detailed description of the method used. Also, by April 22, I should be able to state whether it is feasible to use the tools built into Sphinx (the binaries located in the SphinxTrain directory that are used in the training process) in order to parallelize encoding, or whether Professor Jonas and I will have to get together over the summer in order to program our own method of parallelization for the server queue using the encoding method that Sphinx itself uses.
 * Task:

After class last Tuesday, I spent an hour and a half or so repairing the damage that I had caused on Caesar while I was trying to set up Kerberos. This damage was caused by a component of the operating system that was removed in the process of reverting changes that I had performed in one of the installation methods before I tried a different one. After Professor Jonas and I examined Caesar, we came to the conclusion that Caesar would need to be reinstalled. We reinstalled Caesar and luckily the reinstallation did not overwrite any of the old files or settings and no changes had to be performed on Caesar after the reinstall had been performed. After I had repaired Ceasar, Professor Jonas and I discussed the issues that I had experienced while attempting to install  Kerberos on Caesar. After careful consideration, we decided that the user management would be implemented using a Perl script that read in a text file line by line and created user accounts using the data that it read out of this file. This script would first perform this process on Caesar, after which it would SSH into each of the other servers and perform the same actions that it had performed on itself. I am happy to say that I have managed to implement this script. Over this next week, I plan on doing some more research on Sphinx. I will start by making a list of each of the binaries in the SpinxTrain binary directory and figuring out what each program does, using google and any scripts that are contained in this directory for aid if need be. After I have done this, I will discuss my results with Professor Jonas. If neither of us feel that this has yielded an appropriate method of parallelization, then I will determine which process Sphinx uses to encode it's data.
 * Results:
 * Plan:

None this week.
 * Concerns:

Week Ending April 5th, 2011

 * Task:

Wednesday:

Thursday: N/A

Friday: read /root/speechtools/SphinxTrain-1.0/readme.txt. Didn't really seem to say anything but installation instructions, also said where to find sample dictionary and stuff.:Also read /root/speechtools/SphinxTrain-1.0/doc/tinydoc.txt. Brief illustration of install process, but it looks like if I actually want to get any information I might have to go here instead: http://www.speech.cs.cmu.edu/sphinxman/scriptman1.html.
 * Also uploaded dictionary to miraculix:~/decodeFiles
 * Put sw02001.sph on miraculix:~/decodeFiles. Will see if I can figure out what to do with it tomorrow.

Saturday:
 * Found http://www.isle.illinois.edu/sst/courses/minicourses/2009/lecture1.pdf. Described process of converting sph to wav. Not installed on any of the servers, which will need to change. Used sftp to transfer one of the files to my laptop, and played around with sox on it. I developed this command in order to convert a section of the sph file to wav.
 * sox sw02001.sph sw02001.wav trim 0 00:50

where sw02001.sph is the sph filename, sw02001.wav is the output filename, 0 is the start time, and 00:50 is the end time. Looked at transcript file, transcript does not seem to begin until a minute or so into the sph file. Will look at transcript some more tomorrow and figure out where to go from there. Servers and foss seem to be really laggy today.

Sunday: Working on building this script. Plan to reference script that I wrote for caesar, but both the cluster and foss are being laggy again. First line is the first line of the transcript, and the rest is the script. Hopefully those of you whose homework it is can take advantage of it.

C.Reekie, Admin 00:20, 5 April 2011 (UTC) added syntax highlighting ooooh colors! Monday: Some more work on the script in order to implement reading/parsing of transcript. Just needs filename fixing and debugging to see if everything works after this. Not tested yet, but googling looks hopeful.

Did this and that. Figured out how to convert sph to wav giving a start time and a duration in order to grab just the chunk that a line of the transcript represents. Got sphinx dictionary to use for our decoding. Started writing script to set up experiment directories/parse transcripts and grab relevant data/split sph file for correct section of conversation. Theoretically once all of this is done, all of the transcript directories will be set up with their correct transcripts and wav files. Then they will just have to be plugged into a sphinx decode and that directory can be used. C.Reekie, Admin 23:40, 4 April 2011 (UTC) spaces or tabs in front text will write all text to one line. Will put finishing touches on script next week and use it to run through a train. After that will have transcript, dictionary, and wav. Will check and see if there's anything else after and then do a train. Will talk to Professor Jonas afterwards and see what he says. None this week.
 * Results:
 * Plan:
 * Concerns:

Week Ending April 12th, 2011

 * Task:

Wednesday:

Thursday:

Friday:

Saturday: Looking at: http://sourceforge.net/projects/cmusphinx/forums/forum/5471/topic/4406613. I know that we aren't supposed to be doing any research any more,

COMMENT: Mike-jonas 19:55, 10 April 2011 (UTC) we still want to research questions, but we want to make sure they stem from digging into the stuff we have installed so we can formulate better questions (see below)

but looking for information on training yielded this and it looked too good to pass up. The Sphinxtrain config file specifies two modes of training. In the first mode, Sphinxtrain uses only the local machine (Queue::Posix in sphinx_train.cfg), and in the second mode, Sphinxtrain uses a PBS/Torque queue which is a server that distributes work among multiple computers. This would require the installation of a PBS/Torque server on one of the systems to distribute work between all of the systems. Further research would have to be done into this if Professor Jonas says that it is a viable option, but if the Sphinxtrain config file and that forum are right, then it looks like the sphinx parallel training will be easier than previously thought.

COMMENT: Mike-jonas 19:55, 10 April 2011 (UTC) Yeah, having looked at it, certainly the poser of the question wants what we want...the example wasn't clear cut as it shows multiple jobs queued up on the same machine. We need to look into this further, absolutely continue:)

Interesting how results come up when you aren't looking for them.

COMMENT: Mike-jonas 19:55, 10 April 2011 (UTC) This is what I'm getting at, by playing around with how training works you end up stumbling upon this...good work...it's all about decomposing problems and at one point you end up finding things that lead you to other things and now you end up doing more online searches to help you get a better answer!

Unfortunately this server isn't in the Suse repositories either, but like kerberos it is in the ubuntu repositories. ;(. How I wish that we were using Ubuntu instead of Suse. Oh well, OS is set up and that's what we're staying with. Any input Professor Jonas?

COMMENT: Mike-jonas 19:55, 10 April 2011 (UTC) Well, we went that route because we couldn'g get Ubuntu installed on the PowerEdge machines. I personally had no attachement to SUSE and wished we had something that was more compatible. If we went with Ubuntu then the server disk's raid drives wouldn't work, with SUSE Kerberos & Torque aren't easy to install...seems like six of one half a dozen of the other:-)

Sunday: Doing some more research on Torque. Managed to find the email address of one of the people who posted on the forum from yesterday and dropped them an email. They are in Russia, so we will see when they respond. It looks like I should be able to just download Torque and compile it and then just start the service. Hopefully there aren't too many dependencies. The sphinxtrain config file wants a few changes if we use this: the server queue name to use, the training mode would need to be changed to pbs (torque queue), and the number of parts to split into would need to be specified. Should we install the server on miraculix and see what happens?

Monday: Torque pbs compiled and installed. Admin and queue setup on miraculix. Rebooted miraculix. Will have to look at torque some more tomorrow and see if I can figure out how to use/manage it. After that I will talk to Matt and see if we can pool our knowledge and run a train using torque. Ran through this http://ubuntuforums.org/showthread.php?t=289767 setup guide and everything works except for the pbs_server part, which gives this error: PBS_Server: LOG_ERROR::process_host_name_part, no valid IP addresses found for 'miraculix' - check name service. I can list contents of queue and other stuff though. Unsure whether a server will have to be installed on each system, but... This (http://www.supercluster.org/pipermail/torqueusers/2010-January.txt) seems to think it's a problem with the hosts file, but all looks good as far as I can tell.

As discussed above, a server has been setup that should allow us to set up parallelized training using sphinx. Next week I will complete the setup of two systems using torque, and I will get Matt's help to run through a train on these systems in order to confirm that the training does work. None this week.
 * Results:
 * Plan:
 * Concerns:

Week Ending April 19th, 2011
Summary: My task this week was to get a torque server running on miraculix and to get two of the systems connecting to it. I was also supposed to run a train on this queue.

My task this week was to get torque working on two of the systems.
 * Task:

Saturday: Installed torque on asterix as well and told it to use miraculix as the server. Neither of the servers are managing to talk to each other for whatever reason. I am getting this error when I try to start the server on either system: PBS_Server: LOG_ERROR::process_host_name_part, no valid IP addresses found for 'asterix' - check name service

That is the version that I receive from asterix. From miraculix I receive the same, but with the name miraculix instead of asterix. Checked /etc/hosts. Tried adding 127.0.1.1 asterix to hosts and restarting network system to see if that would do anything. No result. Anyone else have any suggestions?

Sunday:

It appears that torque does not like multiple entries for the same host in the hosts file. There are three or four entries for the local system in the hosts file. On both miraculix and asterix I have replaced all of these with the following. This got the local system connecting on both, but the second system is not yet connecting. I have disabled the firewall to see if that would help, but have not yet had any luck.

miraculix state = down np = 1 ntype = cluster mom_service_port = 15002 mom_manager_port = 15003 gpus = 0

asterix state = free np = 1 ntype = cluster status = rectime=1302993467,varattr=,jobs=,state=free,netload=732167,gres=,loadave=0.04,ncpus=4,physmem=4087048kb,availmem=301829824kb,totmem=297133980kb,idletime=1235,nusers=1,nsessions=6,sessions=1623 1632 1638 1803 1814 1935,uname=Linux asterix 2.6.34.7-0.7-default #1 SMP 2010-12-13 11:13:53 +0100 i686,opsys=linux mom_service_port = 15002 mom_manager_port = 15003 gpus = 0

Monday: N/A

Tuesday: N/A. I'll put in some extra time this weekend to make up for today. I have set up torque on two of the systems. They are both talking to themselves, but for whatever reason will not communicate with each other. This week I plan to get torque working properly and work with Matt to run a parallel train. Also, this week I plan on contacting the person from the forum mentioned last week or the week before and see if they know how the sphinx training system actually functions (how it divides work up among the systems: sending a few wavs off to each system in the queue or processing them all in parallel). I will also do some research (possibly involving source code examination since that appears to be the most common result for most of my queries) in order to see if I can solve this problem myself. None this week.
 * Results:
 * Plan:
 * Concerns:

Week Ending April 26th, 2011
Summary: My task this week was to get the torque queue up and running as well as to see if I could figure out how Sphinx parallelizes it's training across a cluster.


 * Task:

Wednesday:

Thursday:

Friday:

Saturday: N/A

Sunday: For whatever reason, the sourceforge servers are having an issue when I try to send the guy from the forum before (the torque forum) an email (I click on the link and sourceforge gives me an error that "Could not retrieve page id Page ID '1' was referenced, but does not exist!". Having trouble submitting a support ticket to sourceforge as well. Also did some research to see if I could find the person who was asking about torque, but could not find anything. I do think that I found the same guy on linkedin, but was unable to send him an email without upgrading my linkedin membership.

Did some research on torque. (Limited information on (torque && sphinx) available). Torque is a batch job system. Reading through this (http://www.democritos.it/activities/IT-MC/documentation/newinterface/pages/runningcodes.html), and a few other pages originally made me think that it would divide the work up among all of the systems by dividing up the input wav files and submitting more input to each system as it finished it's quota, but looking at this page: http://www.bc.edu/offices/researchservices/cluster/torqueug.html, it looks like torque supports specifying a number of cpus, amount of memory, and other resources that are required for each job. After the job is submitted, torque will take the free resources that you specify and run the process this way. In other words, torque supports distributing one job across the computing power of all of the systems in the queue like a normal cluster would. While this doesn't specify whether Sphinx implements it by dividing up the input files or running everything as one job, it does prove that either approach would be possible. Further research is required to determine which version Sphinx actually uses.

Interesting document that I have found: http://www.speech.cs.cmu.edu/sphinx/tutorial.html. Contains a lot if info. Information on training, decoding, and a little bit of information on torque.

My guess based on how/where it's specified in the sphinxtrain config: $CFG_QUEUE_TYPE = "Queue::PBS"; is that sphinx runs the train the way it normally would with multiple cpus (one massive process that contains all of the provided audio), but uses torque for this instead. (It runs the train as one big process that takes advantage of all allowed cpus in the queue).
 * 1) Queue::POSIX for multiple CPUs on a local machine
 * 2) Queue::PBS to use a PBS/TORQUE queue

Monday: Got asterix and miraculix seeing each other on torque. Basically miraculix runs pbs_mom, pbs_sched, and pbs_server. Each node runs pbs_mom and sets the server name to miraculix or whatever the host server's name is. The pbs_mom system then connects to the pbs_server on the host machine. The guide that I found (http://wiki.hpc.ufl.edu/index.php/TorqueHowto), suggests that pbs_mom not be run on the server host due to the fact that the server host has a lot of other management work to do already and putting further load on it could lead to unexpected consequences. Leaving pbs_mom running on miraculix for now. Also, each system's name needs to go in pbs_nodes on miraculix. Posted on Matt's page to see if he has been able to complete a train yet. If so, then we should be able to run a train on the queue and confirm or deny my theory that I presented yesterday.

Tuesday:

Did some more work on transcript parsing perl script. Current contents are:

Needs work on regex for filename, parsing of transcript file lines, and conversation parsing. Jonas made a working one on caesar.

This week I figured how to get the machines communicating on the torque queue. I also did some research on how torque itself works. Next week I plan on running a train on the queue. After that I will confirm or deny the theory that the train runs as one massive job across the cluster (one train rather than multiple divided ones). After that I should be a in a position where I can write the parallelization section of the report.
 * Results:
 * Plan:

None this week.
 * Concerns:

Mattw - Hey brian I have had no luck with training as of yet. We can work on training across servers this week if you like?

Week Ending May 3rd, 2011

 * Task: My task is to write any scripts required to run a train as well as to run a train on the first 89 lines of the Switchboard transcript.

Friday: Following the instructions in this guide as far as running a train: http://www.speech.cs.cmu.edu/sphinxman/fr4.html. The first required part that is specified is a feature file. It states that this can be found in the Sphin-III training package, which I am assuming is sphinxtrain. Combining previous guide with this: http://www.speech.cs.cmu.edu/sphinx/tutorial.html#prelimtraining. Hoping that between the two of those, I can get a train up and running with the data that Professor Jonas and I generated Tuesday.

C.Reekie, Admin 19:38, 2 May 2011 (UTC) What Data? What did you do to get the Data?

Saturday: Figured out how to generate the fileID file. Wrote this script for it:

for i in /root/speechtools/SphinxTrain-1.0/train1/wav/*.wav ; do echo ${i%%.wav} | sed 's#^.*/##'; done
 * 1) !/bin/sh

Just pipe the output to a file with the extension fileid and you should be good.

Tried running make_feats.pl using the file ids file, but got this error:

-cfg not specified, using the default ./etc/sphinx_train.cfg -param not specified, using the default ./etc/feat.params bin/wave2feat \ -verbose yes \ -alpha 0.97 \ -dither yes \ -doublebw no \ -nfilt 40 \ -ncep 13 \ -lowerf 133.33334 \ -upperf 6855.4976 \ -nfft 512 \ -wlen 0.0256 \ -c etc/train1.fileIDs \ -nist yes \ -di ___BASE_DIR___/wav \ -ei sph \ -do ___BASE_DIR___/feat \ -eo mfc

[Switch]     [Default]      [Value] -help        no             no     -example      no             no     -i -o -c                          etc/train1.fileIDs -nskip -runlen -di                         ___BASE_DIR___/wav -ei                         sph -do                         ___BASE_DIR___/feat -eo                         mfc -nist        no             yes -raw         no             no     -mswav        no             no     -input_endian little         little -nchans      1              1 -whichchan   1              1 -logspec     no             no     -feat         sphinx         sphinx -mach_endian little         little -alpha       0.97           9.700000e-01 -srate       16000.0        1.600000e+04 -frate       100            100 -wlen        0.025625       2.560000e-02 -nfft        512            512 -nfilt       40             40 -lowerf      133.33334      1.333333e+02 -upperf      6855.4976      6.855498e+03 -ncep        13             13 -doublebw    no             no     -warp_type    inverse_linear inverse_linear -warp_params -blocksize   200000         200000 -dither      yes            yes -seed        -1             -1 -verbose     no             yes INFO: fe_interface.c(100): You are using the internal mechanism to generate the seed. INFO: fe_sigproc.c(752): Current FE Parameters: INFO: fe_sigproc.c(753): 	Sampling Rate:            16000.000000 INFO: fe_sigproc.c(754): 	Frame Size:               410 INFO: fe_sigproc.c(755): 	Frame Shift:              160 INFO: fe_sigproc.c(756): 	FFT Size:                 512 INFO: fe_sigproc.c(757): 	Lower Frequency:          133.333 INFO: fe_sigproc.c(758): 	Upper Frequency:          6855.5 INFO: fe_sigproc.c(759): 	Number of filters:        40 INFO: fe_sigproc.c(760): 	Number of Overflow Samps: 0 INFO: fe_sigproc.c(761): 	Start Utt Status:         0 INFO: fe_sigproc.c(763): Will add dither to audio INFO: fe_sigproc.c(764): Dither seeded with -1 INFO: fe_sigproc.c(771): Will not use double bandwidth in mel filter INFO: wave2feat.c(139): ___BASE_DIR___/wav/sw2001-0012.sph ERROR: "wave2feat.c", line 655: Cannot read ___BASE_DIR___/wav/sw2001-0012.sph FATAL_ERROR: "wave2feat.c", line 90: error converting files...exiting

C.Reekie, Admin 16:05, 2 May 2011 (UTC)Formatted for readability

Not sure why. Doesn't matter if I have sph files in the directory or if I don't.

Sunday:

Found out some more stuff. Not sure how much it helps us out at this point or what to make of it, but the scripts or at least the make_feat.pl script actually specify to use sph files. Running the make_feat.pl script generates a folder called ___BASE_DIR___ and then complains about not being able to find the sph files. I then created a wav folder inside of the ___BASE_DIR___ folder and copied all of the sph files that were used for our conversion into this folder. I then ran the script again and it worked. I have found where the make_feat.pl script specifies sph and changed a copy of this script to wav to see what would happen. Running it still produced the same result. Will have to look at this more later.

Monday: Read through Chris's comment/page and wrote a paragraph for him on what I had found/accomplished this week. Tried to repeat his process, but it does not seem to create the folder or it's contents. Don't really have any time to look into it further today. Tried to run both sphinx examples on Verleihnix and neither one seemed to generate the required directories and files for the test train. Then left that and continued working on my own train. I have managed to write a script to generate the fileids file as well as generate the features. I have found that the feature generation is hardcoded to look for sph files, but I believe that sphinx requires the wav files further on in the training.
 * Results:

Next week, I will continue working on accomplishing a train using the first 89 lines of the transcript.
 * Plan:

None this week.
 * Concerns:

Week Ending May 10th, 2011

 * Task:

Wednesday:

Thursday:

Friday:

Saturday:

Sunday:

Monday:

Tuesday:

Looked through Chris's entry real quick to see if I could find anything that he'd done but I hadn't. I realized that the an4 script doesn't actually generate anything and that everything is just in the .tar.gz. Ran it using sph demo files (on Caesar. I know, bad me, but proof of theory/instructions). Worked fine. Took maybe 5 minutes to get it training, 4:30 of which was getting the file to download. Config files used by it might be helpful in finishing the real train. Don't know, but basically you just download the tar.gz, extract it, run the make_feats, and execute runall.

MODULE: 00 verify training files O.S. is case sensitive ("A" != "a"). Phones will be treated as case sensitive. Phase 1: DICT - Checking to see if the dict and filler dict agrees with the phonelist file. Found 498 words using 39 phones WARNING: This phone (SIL) occurs in the phonelist (/root/speechtools/SphinxTrain-1.0/train1/etc/train1.phone), but not in the dictionary (/root/speechtools/SphinxTrain-1.0/train1/etc/train1.dic) Phase 2: DICT - Checking to make sure there are not duplicate entries in the dictionary Phase 3: CTL - Check general format; utterance length (must be positive); files exist Phase 4: CTL - Checking number of lines in the transcript should match lines in control file Phase 5: CTL - Determine amount of training data, see if n_tied_states seems reasonable. Total Hours Training: 1.56107777777778 This is a small amount of data, no comment at this time Phase 6: TRANSCRIPT - Checking that all the words in the transcript are in the dictionary Words in dictionary: 498 Words in filler dictionary: 0 WARNING: This word: was in the transcript file, but is not in the dictionary ( YES NOW YOU KNOW IF IF EVERYBODY LIKE IN AUGUST WHEN EVERYBODY'S ON VACATION OR SOMETHING WE CAN DRESS A LITTLE MORE CASUAL OR ). Do cases match? WARNING: This word: was in the transcript file, but is not in the dictionary ( YES NOW YOU KNOW IF IF EVERYBODY LIKE IN AUGUST WHEN EVERYBODY'S ON VACATION OR SOMETHING WE CAN DRESS A LITTLE MORE CASUAL OR ). Do cases match? WARNING: This word: was in the transcript file, but is not in the dictionary ( YEAH WELL MY UH MY UH PROBABLY ONE OF THE BIGGEST DECISIONS I THINK THAT WAS VERY STRENGTHENED FOR OUR FAMILY NOISE WAS RATHER THAN HAVE ONE CHILD MAKE THAT DECISION ). Do cases match? WARNING: This word: was in the transcript file, but is not in the dictionary ( YEAH WELL MY UH MY UH PROBABLY ONE OF THE BIGGEST DECISIONS I THINK THAT WAS VERY STRENGTHENED FOR OUR FAMILY NOISE WAS RATHER THAN HAVE ONE CHILD MAKE THAT DECISION ). Do cases match? WARNING: This word: was in the transcript file, but is not in the dictionary ( FINDING A PLACE AND EVERYBODY HAD DUTIES TO PERFORM YOU KNOW WHETHER IT WAS JUST YOU KNOW GIVING MONEY OR WHETHER IT WAS ACTUALLY TAKING PART IN IN A LOT OF THE DECISION MAKING YOU KNOW LIKE FINDING A A PROPER NURSING HOME ). Do cases match? WARNING: This word: was in the transcript file, but is not in the dictionary ( FINDING A PLACE AND EVERYBODY HAD DUTIES TO PERFORM YOU KNOW WHETHER IT WAS JUST YOU KNOW GIVING MONEY OR WHETHER IT WAS ACTUALLY TAKING PART IN IN A LOT OF THE DECISION MAKING YOU KNOW LIKE FINDING A A PROPER NURSING HOME ). Do cases match? WARNING: This word: was in the transcript file, but is not in the dictionary ( AND THEY I KNOW THEY AND WELL THEY HAD WELL THEY HAD THEY HAD SEEN IT COMING SO SO I MEAN NOISE IT I MEAN I I I I HARDLY I I TRULY WISH THAT IF SOMETHING LIKE THAT WERE TO HAPPEN THAT MY CHILDREN WOULD DO SOMETHING LIKE THAT FOR ME ). Do cases match? WARNING: This word: was in the transcript file, but is not in the dictionary ( AND THEY I KNOW THEY AND WELL THEY HAD WELL THEY HAD THEY HAD SEEN IT COMING SO SO I MEAN ....

Looks like there's some dictionary errors here.

Oh. Oh. Lovely vim command to fix this: "/%s/ //g" or to put it more simply "/%s/term1/term1replacement/g" where g is global on line and % is every line in file. % can also be a range.

Ok. Trimmed and. Leaves me with this:

MODULE: 00 verify training files O.S. is case sensitive ("A" != "a"). Phones will be treated as case sensitive. Phase 1: DICT - Checking to see if the dict and filler dict agrees with the phonelist file. Found 498 words using 39 phones WARNING: This phone (SIL) occurs in the phonelist (/root/speechtools/SphinxTrain-1.0/train1/etc/train1.phone), but not in the dictionary (/root/speechtools/SphinxTrain-1.0/train1/etc/train1.dic) Phase 2: DICT - Checking to make sure there are not duplicate entries in the dictionary Phase 3: CTL - Check general format; utterance length (must be positive); files exist Phase 4: CTL - Checking number of lines in the transcript should match lines in control file Phase 5: CTL - Determine amount of training data, see if n_tied_states seems reasonable. Total Hours Training: 1.56107777777778 This is a small amount of data, no comment at this time Phase 6: TRANSCRIPT - Checking that all the words in the transcript are in the dictionary Words in dictionary: 498 Words in filler dictionary: 0 WARNING: This word: THEM_# was in the transcript file, but is not in the dictionary ( LAUGHTER YEAH YEAH JUST BECAUSE THEY'RE GRANDPARENTS JUST YEAH JUST BECAUSE THEY'RE GRANDPARENTS THAT DOESN'T AUTOMATICALLY MAKE THEM_# A GOOD CHILD CARER  ). Do cases match? WARNING: This word: FEDERALDES was in the transcript file, but is not in the dictionary ( AS YOU KNOW THE THEY'RE ALLOWED TO COME ON SITE THE FEDERALDES ANYTIME THEY WANT DRIVE THROUGH AND SEE AND INSPECT SO IT'S A FULL TIME UH EVERYBODY HAS YOUR HOME PHONE NUMBER TYPE OF JOB UH ). Do cases match? WARNING: This word: DUCTWORK was in the transcript file, but is not in the dictionary ( WHICH IS TOTALLY LEGAL BUT THE COST OF DOING THIS IS ASTRONOMICAL THEY ACTUALLY SHAVE UP DUCTWORK AND THINGS AND SO WE'RE UH VERY VERY UH COGNIZITIVE AND AWARE OF ALL THESE TYPE OF UH ). Do cases match? WARNING: This word: COGNIZITIVE was in the transcript file, but is not in the dictionary ( WHICH IS TOTALLY LEGAL BUT THE COST OF DOING THIS IS ASTRONOMICAL THEY ACTUALLY SHAVE UP DUCTWORK AND THINGS AND SO WE'RE UH VERY VERY UH COGNIZITIVE AND AWARE OF ALL THESE TYPE OF UH ). Do cases match? WARNING: This word: THEM_# was in the transcript file, but is not in the dictionary ( I PUT A STOP TO SOME OF THEM_# AS FAR AS THE DOOR TO DOOR EITHER RELIGIOUS GROUPS OR PEOPLE ). Do cases match? WARNING: This word: CHOWPHERD was in the transcript file, but is not in the dictionary ( IT'S UH PART CHOW AND PART SHEPHERD AND IT AS I UNDERSTAND IT UH BOTH SIDES OF THE WERE THOROUGHBREDS SO SHE'S A GENUINE CHOWPHERD ). Do cases match? WARNING: This word: ALBRIDGE was in the transcript file, but is not in the dictionary ( NOISE AT AMERICA'S SERVICE BY CARL ALBRIDGE IT TALKS ABOUT UM WHO THE CUSTOMER IS AND BEING CUSTOMER ORIENTED UH WHICH FALLS IN LINE WITH THE TI CULTURE HERE AT TEXAS INSTRUMENTS ). Do cases match? WARNING: This word: SOUTHBEND was in the transcript file, but is not in the dictionary ( NO ORIGINALLY I'M FROM NEW MEXICO I WAS BORN IN NEW MEXICO AND WE LIVED IN UH SOUTHBEND FOR EIGHTY EIGHT YEARS AND UH THEN MOVED TO UH TENNESSEE ACTUALLY ). Do cases match? WARNING: This word: ABOUT_# was in the transcript file, but is not in the dictionary ( OH OKAY SO SO THEY CAN GET LIKE THE DOORS AND LED ZEPPELIN YEAH THAT'S COOL AND HOW ABOUT_# THE ROLLING STONES ). Do cases match? WARNING: This word: BECAUSE_# was in the transcript file, but is not in the dictionary ( OH I GUESS THE STUFF THAT WAS DONE MORE IN THE SEVENTIES LAUGHTER BECAUSE_# THAT'S ). Do cases match? Phase 7: TRANSCRIPT - Checking that all the phones in the transcript are in the phonelist, and all phones in the phonelist appear at least once WARNING: This phone (SIL) occurs in the phonelist (/root/speechtools/SphinxTrain-1.0/train1/etc/train1.phone), but not in any word in the transcription (/root/speechtools/SphinxTrain-1.0/train1/etc/train1_train.transcription) Something failed: (/root/speechtools/SphinxTrain-1.0/train1/scripts_pl/00.verify/verify_all.pl)

Some errors about words not in dictionary. Looks like it's predicting 1.5 hours of training.

Did this and that. Accomplished this and that. Next will do this. None this week.
 * Results:
 * Plan:
 * Concerns: