Speech:Spring 2014 David Meehan Log


 * Home
 * Semesters
 * Spring 2014
 * Proposal
 * Report
 * Information - General Project Information
 * Experiments - List of speech experiments

Week Ending February 4th, 2014
Phonetic Dictionary: .dic Phoneset: .phone Language Model: .lm.DMP Filler Dictionary: .filler File mapping: _train.fileids and _test.fileids Training transcript: _train.trans and _test.trans
 * Task:
 * My primary task for this week iteration is to begin exploring Caesar. To begin training and decoding, the modeling group must first become oriented with the files, configurations and inner workings of Sphinx. Last week I researched how speech recognition works in relation to sphinx, so I have a decent understanding of what the different file extensions are and what they are used for. With this in mind my first task is to locate and analyze the following files:
 * If the group feels comfortable with the system, I would like to attempt to initialize a train using the Tiny dataset.


 * Results:

2/1
 * I was able to successfully SSH into Caesar and change my password.
 * I reviewed the Experiment setup and Training guide. I also checked the revision history on both of those to ensure that the data was up-to-date to ensure that the information is current with the previous class' findings.
 * Analyzed the Exp directory for where the various dictionaries and files are located. There are two primary places to find these files. The first is in a directory /root/speechtools/SphinxTrain-1.0/, which acts as the base for which the train data is taken. In preparation for a train we will copy the files of one of the baseline trains in this directory to /mnt/main/Exp/. Inside the Exp directory, most of the core files can be found in /mnt/main/Exp//etc:
 * The .dic, .phone, .filler, .fileids and .trans
 * Analyzed the switchboard corpus data-sets. The audio files are in the .sph file format. This will be relevant as some sphinx settings will vary slightly depending on whether we are using .sph or .wav. From Eric's logs it would also seem we are using 8 KHz sound files, which will also affect the settings we use (some audio settings such as lo and hi filtering could be affected).

2/2
 * Read logs

2/3
 * I spent a little more time analyzing the training and experiment documents.
 * It appears some changes were made to the configuration of Caesar without being modified in the wiki:
 * The biggest change being that there are no script files or Sphinx training directories in /root anymore
 * I found the scripts, which were located at /mnt/main/scripts/user.old/ and /mnt/main/scripts
 * The training data is located at /mnt/main/root/tools/SphinxTrain-1.0/train1
 * Began separate training process 0145 (Colby is working on 0144) using Tiny corpus. The trainer produced an error at step 6, much like with Colby's experiment. I am working at adding those items to the dictionary. I will also search to see if we have a script available to parse the html output and grab all the words that failed. If not, I will begin writing the script. It appears many of the words that were missing have a quotation mark either before or after the word. I'll need to look into this more to determine how those words are represented in the dictionary (i.e. do we need to enumerate all words preceded and proceeded with a ").

2/4 if( $#ARGV != 0) { print ""; exit -1; }
 * Wrote process_missing_words.pl: Reads in an html error file produced by the training software and generates a text file with the missing words:
 * 1) !/usr/local/bin/perl

$HTML = $ARGV[0]; $search = "WARNING: This word: ";

open my $MYFILE, $HTML or die "Could not open $file: $!";

while(my $line = <$MYFILE>) { if($line =~ /$search/) { $line =~ s/$search//g; $line =~ s/\s.*//g; print "$line\n"; } } /mnt/main/scripts/train/scripts_pl/process_missing_words.pl /mnt/main/Exp/ / .html > missing_words.txt sed -i 's/"//g' 0145_train.trans MODULE: 45 Prune Trees   Phase 1: Tree Pruning FATAL: "main.c", line 167: Unable to open /mnt/main/Exp/0145//trees/0145.unpruned/AW2-0.dtree for reading; No such file or directory MODULE: 50 Training Context dependent models    Phase 1: Cleaning up directories:        accumulator...logs...qmanager...    Phase 2: Copy CI to CD initialize    Phase 3: Forward-Backward        Baum welch starting for 1 Gaussian(s), iteration: 1 (1 of 1)        0% FATAL_ERROR: "main.c", line 1054: initialization failed
 * To run the script, type the following on the command line:
 * I haven't tried to add any words into the dictionary yet. The quotation marks present in the error log lead me to believe something else went wrong. Tomorrow (or later today) I will retry the train and see if I get the same results.
 * I am going to try and strip the " characters from the transcript and try to train again. I'm not convinced these characters serve any purpose in the actual pronunciation of words. To remove the character I ran the following:
 * Reran the train (it still failed but the list of missing words is much smaller and easier to add back in now)
 * Some words did not exist in the pronunciation database, for those words I combined several known words that collectively makeup the sound.
 * If we plan to automate training, adding missing words will be the hardest part. I am looking at the pronunciation site and thankfully the site is using query strings to process words. With that in mind, if Perl can retrieve and parse a web page it could be entirely possible to automate even this. I have done similar tasks in Java and PHP, where I retrieve and parse HTML data, so I suspect a similar function exists in Perl as well. The other hard part will be automating words that cannot be found here. This can be remedied using my technique. The scripts could prompt the user to enter several sub-words which collectively make the missing word.
 * Train failed:
 * It is likely that the failure was caused because of the quotation marks.
 * I began running a train using the last 5 hour data model. I couldn't find a missing words document so I began building one using the CMU website tool provided in the wiki.
 * I wrote the modeling and introduction sections of the proposal.


 * Plan:
 * (Estimated Deadline: 2/2): Analyze the settings we are currently using for the train data. Make sure that the settings match the specifications provided by Sphinx given file types and audio file details. Even if they are different from the recommended I will not change anything until we have a successful train completed. I am also curious about the density and senone settings currently set. As I understand it, audio of 5 hours requires 200 senones and a density of 8. A senone count that is too large will result in overly sensitive speech recognition which will fail to account for diversity in speech patterns, while small senone counts will not be discriminating enough to discern differences between words. Manipulating these settings could help improve our baseline.
 * (Estimated Deadline: 2/4): Run a Tiny train. Estimates suggest this will take about 30 minutes, not accounting for failures. We should allocate at least two hours to this task. If we are successful it would also be good to try and decode our data.
 * (Estimated Deadline: 2/3): Automate parsing the HTML error file.
 * (Estimated Deadline: TBD): Automate adding dictionary words by retrieving the correct pronunciations.


 * Concerns:

Week Ending February 11, 2014

 * Task:
 * Resolve issues with Tiny and Mini data trains.
 * Improve dictionary2.pl for better performance.

2/6
 * Results:
 * Completed a 5 hour train using first_5hr.
 * Spoke with Pauline and Ray, told them about the problems we were encountering with Tiny trains, told them to use first_5hr.
 * Continued working on fixing the Tiny data corpus.
 * Colby and I worked with Pauline to show her how to run a train.
 * I created a language model and began decoding Exp 0150 (using the acoustic model from 0148).
 * To do this efficiently without wasting server resources, I created a new experiment (0150) and created symlinks pointing to the files located in experiment 0148 (the acoustic model).
 * This worked, except when I needed to run the decode. The decode asks for the experiment number of the acoustic model. When I gave it 0150 as the model, it attempted to access the files located at /mnt/main/Exp/0150/, which linked back to 0148. This worked, but because the files in the original directory contained the exp id 0148 in file names and not 0150, the number I provided, the decoder failed to run. To fix this, I created a symlink in 0148 called LM (the dir created for the Language Model) which pointed to the LM directory in 0150. When I decoded, I told it the experiment number for the acoustic model was 0148, which was able to access the needed files as well as the Language Model via the symlink. The decode results were located in /mnt/main/Exp/0150/DECODE. I may clean out the LM symlink in 0148 to ensure that the directory is standalone, but before I do I want to make sure no files in 0145 were changes during the construction of the Language Model or decoding.

2/7
 * The decode finished last night. I scored the decode.log file and got the WER. The results are archived in the Experiment log for experiment 0150. The WER was pretty bad (47%). I need to take a closer look and determine why it was so high. Perhaps it has something to do with the settings in the sphinx configurations file.

2/10
 * I noticed a number of people had created experiments over the weekends but nobody created an experiment wiki page (Experiments 0151-0155).
 * I created the log entry for those experiments, specifying the username and date information from checking each experiment with ls -all.
 * Created a new experiment, 0156. This is the same as 0148 except I changed the density to 8 and the senones to 200.

2/11
 * Spoke with Colby J. about the Tiny and Mini trains. If we could get them to work, we could do a brute-force approach with some of the known parameters to see what produced the best results. If we could do that and get Torque running this may actually be feasible. Colby found that the transcript appeared to be in a different order. It is unclear as to whether or not this makes a difference.
 * I created a new version of run_decode.pl (run_decode2.pl) which accounts for the number of senones. The previous script assumed that there were 1000 senones, but if you change the number some of the files in model_params had a different name. By adding an extra parameter to the script I was able to automate it.
 * I created the LM and decoded experiment 0158. For some reason the decode took over 8 hours and still had not completed. I ended up having to stop it. Despite this I was still able to score it, and everything seemed to be in order somehow, although the final scoring was quite abysmal.
 * I am not sure whether the decreased performance was a result of changing the senones or because I stopped the script. The fact that I could score suggests perhaps the former. Eric seemed to think that increasing the senones would improve the rate. The Sphinx guide says to use a smaller number for 5 hours, but based on the performance perhaps I will try increasing them next time. It does make some sense that using a higher senone count would improve the decode process as it is using the same data that was used to build the models. Perhaps the lower number is only viable for general use where different people will be using it.


 * Plan:
 * Continue running trains with different parameters. My next goal is to use a higher senone count to see if that help.
 * Work with data group to get the tiny data fixed.
 * Concerns:
 * It would be much more efficient to test if we could use the two shorter data sets. Training and decoding on 5 hours takes a long time, and limits the amount of experimentation we can do.

Week Ending February 18, 2014

 * Task:
 * Improve the baseline for experiments by working with the senone and density variables.
 * Build a master dictionary containing all words present in the full corpus
 * One problem we have to account for when training is missing words in the dictionary. Thus far we have been using the first_5hr data set because we have a txt file with all the missing words for that train already available. In the future, it will be important to merge all transcript words into one dictionary that works for any data set. I created an experiment 0179, where I will be working on finding and defining the pronunciation for all missing words. We have the 100 or so missing words from the first_5hr and I am currently filling in the 300 missing words for the 10hr (less since I merged the first_5hr words we already have).


 * Results:

2/12 2/16
 * I created a new experiment 0160, which also uses the first_5hr data set. This time I am configuring the experiment with 2000 senones instead of 200. I read over more of Eric's experiment logs, and on one he posted a link to the Sphinx FAQ page. On that page there was a table expressing the relationship between data length and senone count. According to the page, 4-6 hours of data requires 2000 senones. I am a bit perplexed by this, as the Sphinx train guide shows a similar table but has different mappings. I suspect the difference is because the version Eric used was for Sphinx 3. The version I looked at was for the most recent version of Sphinx, so that could explain the discrepancy. I will try an experiment with 2000 and see if I cannot match Eric's run, which was 30% WER. My own experiment may be somewhat higher as first_5hr trains tend to have a higher WER than the last_5hr.
 * Read logs, in particular, I looked into the results for the brute force trains Colby had started last Thursday. According to the results the ideal value is a senone value of and a density of 64, with a WER of 15%. I am unsure how well this would perform in an actual test, or if this configuration works well only with this data. When the data group builds the test data sets it will be important to test these results further.
 * I Took a closer look at the tiny data set in comparison to the first_5hr. It seems the lines present in both are almost identical except for the quotation marks. I started looking to make sure all the sounds were present but it is still unclear how the mapping there works. It appears as though multiple wav files have been compressed into one sph, which would explain why the dic file references sound files that don't exist.
 * I worked on the proposal, formatting it to follow Josh's template. We still need to determine the deadlines and task delegation.
 * I built a new experiment 0178, with the plans of running the optimal configurations from Colby's test on the 10hr train. So far the largest train we have run has been the first_5hr. I am not sure if an add.txt file exists for this data set, but of not I will proceed to generate the pronunciations for the words.

2/17 use Net::SSH::Perl; my $host = "miraculix"; my $user = $ENV{LOGNAME} || $ENV{USER} || getpwuid($<); my $passwd = ""; my $ssh = Net::SSH::Perl->new($host); $ssh->login($user, $pass); my($stdout, $stderr, $exit) = $ssh->cmd($cmd);
 * Began working on building the missing words dictionary for the 10hr train. Train 0127 had been done using the 10hr data set, and appears to be the most recent time a 10hr train was done. There is a words list already done here but the words do not seem to match the list I got. My guess is that the data set may have been changed since then or perhaps the dictionary was updated with the new words. I have copied over 0127.dic. If this file contains all the missing words I need, and seems to match the transcript for the 10hr train, I will use that as the baseline for the 300 words I am missing.
 * I created experiment 0179. The purpose of this experiment is to build a master dictionary that contains all the missing words from the full transcript. This is one process that consistently slows us down when using any data other than the first_5hr train. First I will finish the 10hr data. When that is done I will find all the missing words, merge the 10hr words in and fill in the rest. Depending on how many are missing I may attempt to automate this but it may be too variable to actually do. Unfortunately many of the words are not spelled correctly or incomplete, and must be entered by hand. If we can create a master dictionary we will no longer need to worry about what data set we are using as all data will be present.
 * I discovered a Perl module called Net::SSH::Perl. This module allows Perl to open SSH sessions and run remote processes. If we cannot get Torque running, it may be worthwhile to investigate this module. Most of our intensive processes are Perl scripts, such as genTrans and pruneDictionary. Example code for running this is as follows:
 * 1) This command retrieves the current user
 * 1) Can also use cmd($cmd, $strin) for standard input
 * 1) Iterate over output
 * I began training experiment 0178.
 * I did more work and finally got the Tiny train running (experiment 0145). Strangely, by changing the senone and density values the train was successfully able to run. Normally it stops and errors out at process 45, due to missing dtree files in /mnt/main/Exp/ /trees/ _unpruned/
 * Created LM for 0145 (experiment 0184). After that was created, began to decode.

2/18 nohup run_decode2.pl 0172 0172 3000 > outfile & ps r | grep decode
 * Colby created a new dictionary for the 10hr train, containing all the words. We will use this as the source dictionary for 10hr trains from now on.
 * The Decode for experiment 0184 failed due to duplicate entries in either the base transcript or hyp.trans file.
 * I ran the uniq command on both but the problem was not resolved.
 * I began reviewing the experiments for trains that were not "test on train". It seems as though most were, although I two experiments that were labeled as "test on dev", which was experiment 0111 and 0024. To help us further determine the success of configuration modifications, it will be important to train on our data but also on external data to ensure our models are not too highly tuned to the data. I created a new experiment (0183), with the plan to explore using different data when decoding. My first step will be modifying the run_decode2.pl script I wrote last week to further allow decodes on different data.
 * I added a new section to the proposal for our test on development experimentation. I also added in an introduction and added it to the working final version Josh has setup.
 * My decode keeps failing because of inactivity in the terminal (despite having set the wakeup signal). I found the following command which will cause the process to run outside of the context of the session:
 * I logged out and ran the following command:
 * This displayed a list of running processes with the word decode. Sphinx3_decode was running in the process list.


 * Plan:
 * Shift research focus to decoding using test data instead of test on train. There are currently a number of test data sets already built. I will review these to make sure they are in working condition. The data group is also working on building these tests sets so I will keep an eye on what progress they are making.
 * Concerns:

Week Ending February 25, 2014

 * Task:
 * Begin decoding Colby's 5 base experiments (0171, 0173, 0174, 0175, 0176) using the last_5hr test data.
 * Learn more about our corpus.
 * Start looking at factors other than sphinx_config parameters for why we are getting a lower WER. I have the string suspicion there are problems with the transcript files such as missing words.


 * Results:

2/19
 * Added conclusion to proposal
 * Wrote Perl script to calculate the total length of the corpus. The total time it calculated was 308 hours, which explains where the other group got that number. I'm looking at train.trans now to find out why it gets that number:
 * 1) !/usr/bin/perl

if($#ARGV != 0) {

}

$length = $ARGV[0]; $main = "/mnt/main/corpus/switchboard/308hr/train/trans/train.trans"; open(MYINPUTFILE, "<$main") || die("Error"); $time = 0; while(my $line = ) { @temp = split(' ', $line); $time += $temp[2] - $temp[1]; } print("Total Time: " . $time . " Seconds!\n");
 * The 308 hours includes a large amount of overlap data, caused by using two channels. With this in mind I modified my script, counting the max number of seconds for each file, mixing the channels (noted by the highest second count value), and then adding them to a total. When I did this I got a total of 250 hours of data, which was the number Sam Workman found before. We should be seeing something along the lines of 97 hours, which means there is still likely something I am overlooking when doing these calculations. I did notice one potential problem though. I ran my script on other data such as the 10hr train and the first_5hr train, and found that the script produced expected results, i.e. they were 10 hours and 5 hours respectively. This means that either the data is in fact 250 hours long or the data subsets are timed incorrectly.
 * I did additional research on the switchboard corpus. It was released in a number of iterations. I found the following information in relation to the size of the corpus:
 * The switchboard 1 -release 2 was released in 1993 (and re-released on 1997). It contains 2,400 sound files, with 543 speakers. These sound files consist of 1155 labeled conversations of 5 minutes. That would leave us with 96 hours of data (Linguistic Data Consortium).
 * The initial report for switchboard ("SWITCHBOARD: Telephone Speech Corpus for Research and Development"), released in 1992 by Texas Instruments, Inc (Godfrey, Holliman, and McDaniel) states that the corpus consists of 500 speakers, for a total of 250 hours of speech. This seems to be the source most often cited when describing the size of switchboard. Other sources I have looked at such as "The Semi-Supervised Switchboard Transcription Project" (Subramanya, Bilmes) and "Investigation of Deep Neural Networks(DNN) for Large Vocabulary Continuous Speech Recognition: Why DNN Surpasses GMMsin Acoustic Modeling" (Pan, Liu, Wang, Hu, Jiang, Hefei, Anhui, China) suggest that this source is indeed referring to the switchboard 1 (as noted above as 96 hours), which would make some sense since the switchboard corpus had not yet been completed at the time of the 1992 report (which makes the 93 switchboard release a probable actualization of the research still underway in the 92 report). According to these other sources, the switchboard audio files contain 320 hours total, 250 of which are recordings of actual speech (the rest is silence and other non-speech related data). Why there seems to be a discrepancy between most sources and the Linguistic Data Consortium estimate is still unclear. One answer may be that the LDC specified that 96 hours of data had been "labeled". I do not know what this means, and if this differs from the actual total size of the data.
 * Also, according to the LDC overview page, 150 conversations were missing from the original release of the switchboard corpus. Conversations run for an average of 5 minutes each, which leaves us with 750 minutes of data, or 12.5 hours missing. Assuming what I previously found was true, this would make quite a bit of sense. If the total number of hours was 320, and 12.5 hours were missing from the initial release, assuming we were using the initial switchboard corpus we would have 308 hours total. This statement makes a number of assumptions, especially since we don't know if the 320 hour baseline was before or after the new conversations were added, or if the dual channels were accounted for somehow in this number. It is quite probable that this is just a coincidence that the new total would be 308, the same number we were getting for the full transcript.
 * Sources:
 * Well-known and influential corpora: A survey
 * The Semi-Supervised Switchboard Transcription Project
 * Investigation of Deep Neural Networks(DNN) for Large Vocabulary Continuous Speech Recognition: Why DNN Surpasses GMMsin Acoustic Modeling - Jump to the Experiment section.
 * Switchboard-1 Release 2
 * SWITCHBOARD: Telephone Speech Corpus for Research and Development - You will need to access IEEExplore via the UNH Library to read this article
 * Corpora Available from The Linguistic Data Consortium
 * Started adding words to the dictionary for the 308hr train we are doing.

2/20
 * Continued building the missing words dictionary. The PruneDictionary Colby ran last night finished, we are missing 3,500 words! The ten hour dictionary consisted of about 300 missing words, which leaves us with 3,200 words still.
 * To try and ease this process alone, I wrote a utility script to try and fill in the missing words using the CMU pronouncing dictionary. The nice thing about this was that the website uses query strings for the word to search and the flag to determine whether to use stresses or not. After a few hours of development I got a working script. The final script was a simple HTML web page using JavaScript. The reason I chose HTML/JavaScript was because of the benefits of Ajax provided for asynchronous processing of a large number of HTTP GET requests. The script worked but was not very effective. Overall, it only managed to find about 1/20 of the total words. The problem is that an overwhelming majority of the missing words are not actually words at all. They are either 1) Numbers 2) Misspelled words 3) Last Names 4) Abbreviations of words. Because of this, CMU was very ineffective at finding most of the words. I could fine tune the JavaScript to process numbers, but even then it wouldn't be worth the time as there are only about 100 numbers.
 * With the above in mind, it does raise a red flag about our data. Based on a numerous misspellings and partial words, I would have to imagine this would have a pretty strong negative impact on the effectiveness of our models. At a bare minimum it would likely result in a large number of decoding errors as the decoder must decide which word was spoken when there are two to four versions of the exact same word with the same (or very similar) phonemes, but a slightly different spelling. The Language Model might be able to offset this but to what extent I am not sure. The other problem is that it is unclear whether the misspellings are actually related to the pronunciation in the audio or if they are transcription errors.

2/21 density: 8 WER: 69% density: 16 WER: 68.7% density: 32 WER: 71.0% tr -d '\15\32' < 0192_train.trans > 0192_train.trans.new full: 256.4 Hours 308hr: 256.4 Hours 100hr: 76.7 Hours 10hr: 10.0 Hours first_5hr: 5.0 Hours last_5hr: 4.9 Hours mini: 11.6 Hours tiny: 1.2 Hours
 * Created experiment 0194 (All are using 3000 senones, last_5hr test data, genTrans5.pl)
 * Began decoding 0171 using last_5hr test in /mnt/main/Exp/0194/0171/
 * Began decoding 0173 using last_5hr test in /mnt/main/Exp/0194/0173/
 * Began decoding 0174 using last_5hr test in /mnt/main/Exp/0194/0174/
 * Cleaned the Windows newline characters from the transcript file in experiment 0192 using the following command:
 * Finalized corpusSize.pl, which calculates the total size of the corpus provided as arg 0 to the script (only the basename, i.e. /mnt/main/scripts/user/corpusSize.pl 10hr). The script now resides in /mnt/main/scripts/user. Based on these calculations, our corpus sizes are as follows:

2/25/2014
 * I further analyzed the last_5hr test data to check for compatibility with the first_5hr. It appears that the first_5hr dictionary contains roughly 700 of 1000 words in the last_5hr test dictionary, or around 70%. This would account for some of the quality loss I was experiencing, but it still seems to be a stretch that decoding using that data would result in 70% error.
 * Compare results with Colby. His decodes performed better (although still not great), using the 10hr acoustic model and LM. The larger data size seemed to make a pretty dramatic improvement. Future tests with even larger data should be done.
 * I wrote two scripts: cleanTrans.pl and pullFromTrans.pl which cleans out all special characters from the trans (much like genTrans without any sph processing or fileids generation) and extracts the filler words respectively. These will be used when Colby and I attempt to add the filler words to the filler dictionary.


 * Plan:
 * Add missing words to train.trans in the corpus directories.
 * Fix misspellings in train.trans.
 * Run a larger acoustic/LM train and decode on those.
 * Concerns:

Week Ending March 4, 2014

 * Task:
 * Work with Colby to run tests dealing with the out of vocabulary words in the transcript. Our first test will be to build test data that does not contain any OOV lines (we are removing the line not the words). By eliminating these lines we will be getting a good perspective on what our results should be like when we find effective ways to clean them up.
 * Develop tools to provide us flexibility for training and decoding. We want to develop a script to strip all lines containing OOVs from the transcript (cleanTrans.sh).
 * Using our cleanTrans script we want to make clean transcripts for all the primary corpora.

3/1/2014
 * Results:
 * Worked with Colby J. and C. to begin preparation for our test trains (0199). We will be running a series of tests, replacing various out of vocabulary words.
 * Wrote bash script to strip all lines containing [] or _1 from the transcript, and replacing i- with i and {} with nothing. The script cleanTrans.sh is now in scripts/user
 * Built clean data transcript for the 10hr data.
 * Started preparing 9 trains 0200/[d8|d16|d32]/[s3000|s5000|s7000]/ using the last_5hr clean data transcript
 * We crashed Caesar due to running too many trains at once...

3/2/2014
 * Started working on incorporating the shell script I wrote into a more complex script which will generate a transcript file containing x hours of clean data.

3/3/2014
 * Caesar is up again. I will resume the trains that were not finished on 3/1.
 * Continued working on the script to build dictionary data. I added an extra parameter, called offset. The script takes in a transcript file (param1), a time in hours (param2) and an offset time in hours (param3). The script produced a new transcript derived from the base transcript that is param2 hours long, starting at param3. The script works great for general transcript files but does not work for cleaned transcript files. The reason for this is that the time calculating algorithm uses a shortcut present in the base transcript to avoid complex calculations. My goal is to make a general purpose script that can read any transcript file and generate a new transcript from it. Since we have a full cleaned transcript file already, it would be great to be able to build new transcripts for that for 5 hours and 10 hour trains. The current clean transcripts do not match the correct time. For instance, the first_5hr cleaned transcript was generated off of first_5hr train, with all [] lines removed, making the total time less than 5 hours (I believe it is 3.8 hours). This script will give us accurate shortened transcripts and will make the generation of test scripts very easy.
 * I wrote a new version of corpusSize.pl, which performs a more accurate algorithm which will work on any transcript. The original script could only calculate unclean transcripts, and would error out on clean transcripts. The new script works for both, although is a bit slower since we have to calculate the gaps and used time. According to the script we have 192 hours of clean audio.

3/4/2014
 * I finished the script createSubCorpus.pl which now resides in /mnt/main/scripts/user.
 * I built a new data set called 50hr, containing 50 hours of data after the first 10 hours, with both a clean and uncleaned version. To test this, I am running a train on this data using 16 density mixtures and 5000 senones (exp0204).
 * There will likely be missing words for this. I may not proceed to fill them out, as it is not our immediate goal to run large trains, so much as it is to fix OOVs. This experiment is merely a test to demonstrate our new data and the script which produced it.
 * Ran trains for 0200
 * d8 - s3000, s7000 (s5000 encountered a problem)
 * d16 - s3000, s5000, s7000
 * d32 - s5000, s7000 (s3000 encountered a problem)
 * Build LM and symbolically link it.


 * Plan:
 * Finish experiment 0200 with decodes to compare with Colby's results for the first_5hr. Last_5hr contains minimal cross talk so that may be the key to improving performance.


 * Concerns:
 * The data we are using is for training is too small for successful decodes on test data. To successfully train on test data, we need larger trains, which necessitates filling in missing dictionary words. If we can build a dictionary for the clean data corpus there should be far fewer missing words as we no longer have incomplete words left behind by eliminating OOVs.

Week Ending March 18, 2014

 * Task:
 * For this week I continued working with Colby in refining the train process. First, I finished running the decodes for the 9 trains I did using the last 5 hour.
 * I also resumed working with decoding on foreign data. To start, I built a data set consisting of four hours of data outside of last_5hr and first_5hr.
 * Began decoding on the AM and LM for experiments 0200, 0199.

3/5/2014
 * Results:
 * The results for the decodes for the nine trains I did in experiment 0200 are in:
 * The biggest difference was in d8 and d16 which experienced an improvement using last_5hr over first_5hr. Using a density of 32 actually decreased the performance of the last_5hr. My guess is that this data is over-trained, which would explain the drop. For five hours we should be using around 3000 senones and a density of 8 or 16.

3/6/2014 sed "s/^[^ ]*\s //g" cmudict.0.7d | sort | uniq -c -d | sed "s/ /\n/g" | grep "^[0-9]" | tr '\n' "\+" | sed "s/\+/\-1\+/g" | sed 's/\+$/\n/g' | bc
 * I finished scoring the last two experiments and added them up above.
 * Extracted duplicate words and pronunciations in the dictionary file. According to a command I ran, there are over 11,000 pronunciations which appear in the dictionary at least but often more than two times. Additionally, using the following command I calculated that there are 15,968 duplicate pronunciations with the original pronunciation factored out:

3/17/2014
 * I created a new test dataset in last_5hr called test2. Test2 contains 4 hours of test data outside of the last_5hr data. The reason I chose 4 hours is that Colby mentioned that he saw that test data should be no more than 4 hours. Therefore, unless otherwise specified all test data will be 4 hours for uniformity. I will be using this data to test the AM built for experiments 0199 and 0200, which both produced results of less than 20%, the best of which was 15%. I still think this is over-trained, but I would like to see how we do in comparison to the decode I did on experiment 0175 (also got a WER of 15%) which decoded at 70% when decoding on unknown data. I'm hoping these experiments will provide better decodes.
 * The last_5hr test2 data was generated from the full clean corpus, containing 4 hours of data starting at the 30 hour mark (this means that we can use this same test data to test on first_5hr without problems). I used the script I wrote, createSubTranscript.pl to build this.

3/18/2014
 * The decodes I started yesterday finished. The results were as follows:
 * Decode on 0200/d32/s3000: 76.4%
 * Decode on 0190/d32/s3000: 79.0%
 * Decode on 0200/d16/s5000: 81.0%
 * Decode on 0199/d16/s5000: error
 * I am not sure why are decodes are still doing so poorly. It's possible the models for 32 densities are over trained, but a density of 16 should do better than this. For consistency I am beginning decodes for the d8 models we built (s3000, s5000, s7000). When these are done I will try the remaining d16s and d32s. Depending on the results, I may need to analyze the decoding process I am doing and see if there is an error there.


 * Concerns:


 * The decodes are still doing terrible. The first thing I will try is decoding on our data with the smallest density (8). Next, I may try to build a new Language Model for the full data and use that instead of the LM for the last_5hr. I am still determining which files we need to use from the base experiment and which we need to generate for our test data.
 * The problem is likely being caused by the dictionary, although it could be the language model. I have been using the dictionary for the new test data since it presumably will only need to find those words, but maybe if I try using a bigger dictionary the decode will do better.

Week Ending March 25, 2014

 * Task:
 * Continue analyzing the AM and LMs we built in previous experiments by running decodes against them using a 4 hour test data set I built from the middle of the full transcript.
 * Research decoder settings that may be useful for improving our scores.


 * Results:

3/19/2014
 * Started several decodes for the 0200 and 0199 d8 trains. Because the clean data is 3.8 hours, it would make sense that we would have a lower density.

3/23/2014
 * Read logs

3/24/2014
 * I scored decodes for s3000 and s5000, the results are as follows:
 * 0200-d8-s3000 = 83.4
 * 0200-d8-s5000 = 86.3
 * I had assumed that the lower density trains would have done better but in fact they did increasingly worse than the larger density trains.
 * I also ran a test decode using the language model from experiment 0218 (the 100 train). Surprisingly the results were decently better (6 percent improvement).
 * 0200-d8-s3000(dict) = 77.7
 * With this in mind, I have begun to run new decodes for the d32 trains using the new language model.
 * While these are going, I am also running another test decode using the 0200-d8-s3000 using the dictionary from 0218 as well as hard coding the sample rate on the decoder to 8000. The decoder may be defaulting to 16000 which would increase the score on our decodes. If this is the case, then the results for that experiment should be better than 77.7%.
 * I just looked at the decode.log file of one of my previous decodes. The decoder is defaulting to 16000 for the sample rate. This should be 8000 as the switchboard corpus uses 8000 Khz audio. Hopefully the 8000 Khz decode I am running now will improve the WER.
 * I started a new test on train decode in experiment for experiment 0200/d8/s3000 using a sample rate of 8000. Hopefully because these decodes are using a density of 8 they should go quicker. I'm curious to see what our new test on train score will be using the correct sample rate. We got a 26% WER running it with a samprate of 16000.

3/25/2014
 * The decodes I ran yesterday have finished. The results were as follows:
 * 0200-d8-s3000-full_dict-samprate(test on eval) = 77.7%
 * 0200-d8-s3000-full_dict-samprate(test on train) = 29.9%
 * Strangely, changing the sample rate for the 0200-d8-s3000-full_dict(eval) had no impact on the WER at all, yet for the test on train variant it increased the WER quite a bit. I'm rerunning both of these decodes using the standard language models to try and get a better picture for what's happening. In retrospect I should have done this for the test on train experiment anyway since doing otherwise would add a new variable to the experiment (since the base used the standard LM).
 * I reran the decodes using the default language model, but once again the sample rate had no impact on the score.
 * With this in mind, I am now preparing to run a test on eval decode using the AM and LM for the 100 hour train. Maybe a larger model will produce better results.


 * Plan:
 * Run decodes to determine whether or not settings will improve the overall WER.
 * Concerns:
 * The decodes for all three density levels for our clean data decodes performed poorly against the test data.
 * Changing the sample rate decoder setting surprisingly had no affect on the final WER. 16,000 and 8,000 Hz performed identical.

Week Ending April 1, 2014

 * Task:
 * Continue improving decodes.

3/30/2014 3/31/2014
 * Results:
 * I created an experiment we will be using for testing the construction of acoustic models using the 100 hour data set (Exp/0245). To simplify the process, I created symbolic links to the artifacts of 0192 and overwrote the sphinx configuration file. Before I proceed, I want to analyze the files closer first to ensure that there are not other files that need to be unique per experiment.
 * I created another experiment 0244 in which I will prepare all the eval data sets for so we can run decodes without having to reconstruct the feats, dictionary and transcript files. Currently this process is being held back because I need to first generate the eval and dev test data for this.
 * Read logs
 * Copied over files from 0216 (test data for last, first 5hr and 10hr data). It consists of 4 hours of data taken from the middle of the transcript.
 * Before I get too much further, we need to decide what the size of our test data will be. This number should be consistent between all test data sets. Currently I am using 4 hours, but depending we may want to change that.
 * Began looking into the warnings and errors for training. I haven't found anything conclusive yet.

4/1/2014 MODULE: DECODE Decoding using models previously trained Aligning results to find error rate Can't open /mnt/main/Exp/0249/result/0249-1-1.match word_align.pl failed with error code 65280 at scripts_pl/decode/slave.pl line 172. if($?) 56: for (my $i = 1; $i <= $ST::DEC_CFG_NPART; $i++) { 57:     push @jobs, LaunchScript('decode', [$ST::DEC_CFG_SCRIPT,  $i, $ST::DEC_CFG_NPART]); 58: } s = Step over n = Step in . = Print current line r = Print current file
 * Colby showed me the new decode files he found.
 * We attempted to run a decode using slave.pl as recommended by CMU but the script produced an error:
 * The problem is that the match file specific to our part/npart configuration (0249-1-1.match, where part is 1 and npart is 1) is not being created and thus the file cannot be found.
 * The script starts by passing in the $match_file variable into concat_hyp. $match_file contains the string 0249.match.
 * Colby and I isolated the problem to line to line 83 in concat_hyp, where Perl tries to open the .match file that is not found. The script cannot load the file, throws the exception but continues to execute. Eventually it reaches line 172 (not exact), where the whole script errors out after tripping over a conditional:
 * Upon researching, we found that $? contains the results of the last system command to run.
 * I spent quite a bit of time looking through slave.pl. Colby found an online source which mentioned something about the specific error code 65280. The source said that it was caused by the decoder failing. Upon looking closer, I discovered where the decoder is called. On line 57 of the file the program calls the following command:
 * For each npart the script spawns a new process using the LaunchScript function. My guess is that this script is supposed to do something and fails, thus the 0249-1-1.match file is not created.
 * The problem is that I do not know what file this is. The LaunchScript function takes in a string and uses it to run some script.
 * My first guess was that it was running s3decode.pl located inside the scripts_pl/decode directory. But after running some tests I determined that this was not the script it was calling.
 * I discovered a way to run an internal Perl debugger on the slave.pl script using the -d option
 * Perl's debugger stops the execution after each line and provides the user the option to run one of many commands. I found the following commands useful:
 * Using the debugger I found out that LaunchScript was declared in SphinxTrain:util
 * Using this I can read through the LaunchScript function and determine which script is being called.
 * According to online sources, whatever this script is, it should be putting output into a log file in logdir/decode but no such file exists. I need to determine what this file is before I can continue.
 * The actual file where LaunchScript is declared is scripts_pl/lib/SphinxTrain/util.pm. LaunchScrpt is declared on line 414.
 * The script starts by declaring a variable $scriptdir which is equal to $ST::CFG_SCRIPT_DIR (/mnt/main/Exp/0249/scripts_pl) concatenated with basename(dirname($0)).
 * Surprisingly, after parsing the results of LaunchScript, I realized that my first intuition was in fact correct. LaunchScript("decode") Calls s3decode.pl... Looking closer, I am now suspecting that the function is running the process silently, because none of the output from s3decode.pl is actually displayed. Now I need to look at s3decode.pl again. This does make sense. I had previously reached the conclusion that LaunchScript("decode") was calling s3decode.pl because the parameter list for s3decode.pl requires the part and npart, and the LaunchScript call passed two arguments, a part and npart. So now I need to find out what is going wrong here.
 * s3decode.pl starts another process using the RunTool function (also declared in Util.pm) called sphinx3_decode.
 * According to s3decode.pl, sphinx3_decode should be writing out to a file logdir/decode/0249-1-1.log but it is not.

3/1/2014
 * In s3decode.pl, the script eventually runs a function called RunTool("sphinx3_decode", arguments...). I looked back in Util.pm and I found the RunTool function. I still couldn't get my print statements to work so I setup a simple log to output to. After running some tests, I found the problem. If no absolute path is given, RunTools looks for the file in /mnt/main/ /bin/. There is no sphinx3_decode file in bin. I modified s3decode.pl and provided the absolute path /usr/local/bin/sphinx3decode as the first argument. The decoder started!
 * The final WER ended up being a bit worse than we were getting using the old method.
 * Colby and I set all the configuration parameters to match the old method and matched the WER, with a slight 0.1% difference in the other method's favor.
 * We successfully ran a decode using npart and cut the decode time by over half.
 * Colby tried to score one of the decodes using the 0249.match file instead of hyp.trans, but it did considerably worse, almost 8% higher. I ran some unix commands and determined that the 0249.match file is the same as hyp.trans without the < /s> tags.
 * Setup experiment 0250 to test tune_senones.pl.


 * Plan:
 * Work with Colby J. to get the new decoding method to work.
 * Get the new decodes to perform as well if not better than the other method.


 * Concerns:
 * There were some changes that had to be made to s3decode.pl to get it to work properly.

Week Ending April 8, 2014

 * Task:
 * Start preparing data we can use to train on for future experiments
 * Modify the senone tuning script to be more dynamic and process all the models created by the decoder.
 * Work with the group to help improve our trains/decodes.


 * Results:

4/3/2014
 * I created a new version of the tuning script to account for other important parameters as well. There is still currently a bug in the script because of the way the decoder builds the output files. I am searching through script files to determine where the file in question is generated, and then I can modify it so that it works as expected.
 * Prepared data for the upcoming trains we plan to run.

4/4/2014
 * Colby mentioned that there was an error with the total time for the data I prepared for us to experiment on by 20% according to sphinx. Today I ran some tests on the corpus size script I wrote several weeks ago to figure out where this discrepancy came from. My script should be accurate to the second. I recreated the initial version of my script, now called corpusSize0.pl. This script calculates the total size of a data set, ignoring the fact that the sound files overlap with one another. This was the same version of the script which told us the total corpus was 308 hours long. I ran this new script against the data I created on 4/3, and got the same result Colby said that sphinx got. To be sure this was the error, I ran this new script against last_5hr/train. Colby and I had determined, according to Sphinx, that this script was around 6 hours long. CorpusSize2.pl (the corpus time calculation algorithm used to generate sub transcripts) calculated this data to be exactly 5 hours long. When I ran the same test again using corpusSize0.pl it calculated the last_5hr to be 5.9 hours, once again matching Sphinx's calculation. Therefore, I can conclude that sphinx is not accounting for overlap between the conversations.

4/7/2014
 * I modified the senone tuning script again to calculate the current density value. I tried running the script, commenting out some of the earlier steps to see if it created the files we needed. So far I have only seen one line generate a file with the needed extension, which was in slave.pl. I modified the name there to see if this was somehow connected to the generation of the other files, but it was not. Even so, if the primary experiment file is the same as the sub files I may still be able to rename this with no consequence, so long as we can still score from it. I'm not 100% sure why there is a primary  version of the file but also - version.

4/8/2014
 * I fixed tune senones. The location I needed to modify was right in front of my face the whole time. The script should now iterate over all AMs built.
 * To answer my previous question, it does happen as I suspected. It creates .extension first and then renames the file to - .extension before proceeding to the next stage.
 * I worked with Colby, Josh, Brian and Mike on trying to get Speak up and running.
 * I created a new copySph script to handle creating the symbolic links to the spf files and running sox in our new corpus directory structure (audio -> [wav, sph]).


 * Plan:


 * Concerns:

Week Ending April 15, 2014

 * Task:
 * My task this week was to refine and finish implementing the new data arrangement.
 * Develop a script to automate the process.
 * Develop documentation on the new process.


 * Results:

4/9/2014
 * I wrote two new scripts, copySph2.pl and copySph3.pl
 * CopySph2.pl is the new way we will be generating sph and wav files for a corpus subset. It creates the sph files in audio/sph and then creates symlinks to /mnt/main/corpus/switchboard/full/train/audio/wav to all the needed wav files
 * CopySph3.pl should never be used. It generates the sph files like copySph2.pl and then it uses sox to extract the wav files from the sph files, and then uses sox to convert the utterance wav files back into the spf file format. I already ran this script and created the master wav file list in /mnt/main/corpus/switchboard/full/train/audio/wav. Use copySph2.pl to build a new data subset.
 * I began using copySph2.pl to create the links in all the corpus subset directories. Before I proceed with this, I want to fully initialize one data set and successfuly train/decode on it to make sure the sound files are correct. To do this, I am building on the first_5hr train data. I added the audio/utt/, audio/conv/, info/train.dic, info/train_train.fileids and info/train.phone files to it.
 * Because of these changes, I needed to rewrite genTrans since we no longer need sox. I created a new script buildData.pl. The script has been documented here.

4/13/2014
 * I wrote a guide on using the new process for running a train. It took some time to debug an error I was getting when trying to actually run the train. The error was being caused because the .dic, .trans, .fileids and .phone files were symbolic links as was intended when I wrote buildData.pl. The problem is that Perl cannot open these files normally because they are links, and an error is created. To solve the problem I reorganized the experiment directory such that we could still link to the files. To do this, I created a new directory in the experiment directory called data. This data directory is a symbolic link to the /mnt/main/corpus/switchboard/ / /info directory. This change required that I modify the etc/cfg file to point to the data directory instead of etc. A variable was already in place which contained the path so I modified buildData.pl to change this line in the cfg as well as the file names for the dictionary, transcript and the other info files. Now because the data directory points to the info directory where the actual files reside, the trainer can read them without a problem.
 * Another issue came up when I was trying to get John the steps for training using the new system. It seems as though the transcript for the 100 hour we are doing does not match the base transcript file. I'm not sure why but there are about 100 extra lines in the transcript we had already generated, as well as extra files which are now missing in the fileids. This problem did not happen when I did this on the first_5hr but I need to figure out where they came from.

4/14/2014
 * Today I reran genTrans9.pl to recreate the edited transcript file. After a few hours the script finished and I checked the number of lines against the base transcript, and once again there were about 100 extra lines.

4/15/2014
 * I spoke with Colby briefly about the missing lines. He mentioned that some of the audio files are missing. He said I should extract out the lines from the transcript that do not have audio files. I can probably write a quick script to check this.
 * I wrote the script to pull out all lines from the generated transcript file that do not have a corresponding sph file. I am running into another problem because of the symlinks. Normally in Perl you can run a -e on a file to test if it exists, but because the files are symlinks it doesn't work. -l is supposed to test for symlinks but that too is failing. When I can get this to work and remove the needed lines I can then process of generating the feats files now. This will probably take a few hours to complete.
 * Despite the missing files I attempted to run the train. There are 12,700 missing audio files, which causes the trainer to error out. This was particularly strange, because the five hour train did not have any problems, and I was able to start training on it.
 * Worked with Colby to write the abstract for our poster presentation at the URC.


 * Plan:

4/9/2014
 * I have written the needed scripts to create the base utterance files as well as link to utterance files for all data sets other than full. I have also written the script needed to build an experiment using links rather than generating files. My plan now is to successfully run a train using this new script. If it works, I will delegate the task of building the remaining data sets to Pauline and Mitch.


 * Concerns:
 * 100 extra lines in the transcript are causing problems when we generate feats. These lines only appear to be in the 100hr data, as I did not have this problem with the 5 hour data.

Week Ending April 22, 2014

 * Task:
 * Finish the new experiment process using pruneDictionary4.pl and genTrans10.pl.
 * Begin training a 100hr model for tuning.

4/16/2014 4/18/2014 4/21/2014
 * Results:
 * I wrote two new scripts, genTrans10.pl and pruneDictionary4.pl. The former script is the same as genTrans9.pl except that it does not perform any sox commands. The latter script was a complete rebuild of the pruneDictionary2.pl script in an attempt to improve performance. What once could take hours now takes only a few seconds to complete.
 * I wrote a new script, prepareExperiment.pl, which automates all the steps of running an experiment up to feats generation and handle the creation of all symbolic links. To complement this script I wrote another script, generateFeats.pl which handles the remaining steps to complete a train. Most importantly, this script swaps the symbolic link after the feats are generated to prevent the trainer from hanging up on module 20 because of symlinks. With this new process I successfully completed two trains using the new script with no problems. There is one step missing from these scripts (although genTrans is technically supposed to handle it), which is to remove [LAUGHTER], [NOISE] and [VOCALIZED-NOISE] from the transcript. I am running some tests related to this at the moment, but when I finish I will add those steps to prepareExperiment.pl. What I am trying to do is add these words to the filler dictionary instead of stripping them from the transcript. I used sed to replace all instances of these with the corresponding tags I created in the filler dictionary (+laughing+ and +noise+) and added those to the phones file as well. With these changes I was able to successfully build a new acoustic model, and am now in the process of decoding it to see how we do. I am also running a control experiment which is the same as the previous except we simply remove these words like normal. If all goes well, we should see a performance increase when doing this. My only concern is that I used first_5hr rather than last_5hr, but we should still see a performance gain, even if it is not as pronounced.
 * I spent some time documenting all the scripts I have made over the semester which I admittedly have not been documenting outside my log.
 * The results of the filler dictionary test I ran above yielded a 2% improvement to our WER. Next I ran another test, this time replacing all words that follow the [LAUGHING-<WORD>] format with to see how removing those would work. Unfortunately we did not improve our WER at all, but actually increased it by a small fraction.
 * I modified prepareExperiment to keep the filler words in their original format [LAUGHTER] and used that spelling for the filler dictionary. The 100hr I started has finished, so now I can decode on it. Unfortunately I used the old form of prepareExperiment which means that the filler dictionary and transcript files must be modified to work properly.

4/22/2014
 * I am now decoding on the 100hr I started. The realtime factor is pretty bad, but I haven't tuned any of the decode parameters or the senones yet. Using the new test data I created seemed to drastically improve the time it takes to decode. Even with a realtime of 6+ I was able to run a decode in about an hour. If I can boost the realtime I'll do much better than this.
 * Plan:
 * Begin modifying the decode settings for the 100hr model I have built. I will likely be tuning the senones first though after my current decode finished.
 * Concerns:
 * None at the moment. The realtime factor on my decode was high, but that's where tuning comes into play.

Week Ending April 29, 2014

 * Task:


 * Results:


 * Plan:


 * Concerns:

Week Ending May 6, 2014

 * Task:


 * Results:


 * Plan:


 * Concerns: