Speech:Spring 2014 Avengers

This page we can use for the Spring 2014 Competition to note any "secret" items we are using.

Priorities
---(Colby J)--- 1. Hard code changes we need in the specified files ( I actually already did this) 2. We need to extrapolate our transcription data too. This involves ripping out the last 100 hours of data from the full transcription (I think David wrote a nice script for this). We also want to rip out the last_5 hours of training data. from what is derived for the last 100 hours. 3. Next will be to modify the tune_senones.pl script to iterate the decoding of a hard coded list of density values (I'm thinking using 16, 32 and 64 will be fine. we will be using senone values ranging from 5000 - 10000 although tune senones already handles all senone permutations.) 4. With tune_senones.pl customized, we will use it to find the optimal senone values for our transcription. this will be done by doing senone values of 5000-10000 and increment by 1000 and given the optimal WER there - given it doesn't take too long (I'm expecting that the script may run for about 5-10 days) - we can do it again using a smaller range and smaller increment value. This will really fine tune the density to senone value. 5. After we have a good baseline we will begin tuning other parameters. There is a great page I found that goes into detail about what params should be changed in what order. This will be the most tedious and time consuming process, but it's a bit down the road.
 * Keep in mind when setting up for the decode that the transcription and fileids reference needs to be changed from _test.fileids & _test.trans TO _train.fileids & _train.trans (I didnt want to hard code those changes because future Semesters are not going to use these values for test on evals.)
 * Also the LM will need to be changed from .lm.dmp to the following:
 * If your LM is going to be small use tmp.arpa
 * If it will be large use tmp.lm.DMP
 * These will not have nearly the cross talk that the first 3170 conversations do, which could be our key to success.
 * (David M) We now have a new 100hr data set in /mnt/main/corpus/switchboard/100hr/train2. This data contains the last 100 hours of the full data corpus. Our test data, which is the last five hours of the 100 data now exist in /mnt/main/corpus/switchboard/100hr/test.
 * (David M) I figured out why Sphinx was telling us we had 120 hours of data despite the fact that the script I wrote gave us 100 hours. Sphinx is not accounting for overlap when it calculates the total time. I rebuilt the original version of my script (now corpusSize0.pl) which did not account for the overlap (the one which originally told us the total corpus was 308 hours). It calculated the 100hr/train data as 119.9 hours.
 * (David M) There is now a new version of tune_senones in 0251/scripts. The script iterates over all senones and then for each density power of two from 16 up. There is one bug I just noticed in the new script though. One of the scripts, likely slave.pl or word-align.pl creates the align files with the naming convention of - .align. This means that the script, even though it will generate align files for all density values, will ultimately be overwritten each time. I need to find where it is creating this align file and fix the name. There is an align file being created by slave.pl, but it is not being named with the senone value. This leads me to believe that either another script is generating the file or that the .cfg files being used by tune_senone are changing the exp name to include senone.

http://www.cs.cmu.edu/~archan/s_info/Sphinx3/doc/s3_description.html#sec_dec_tune

If all goes well I think we will be looking at pretty accurate results. I don't wanna settle for that though. I have a task for Justin, Mitch, and John and that is to get the PyLab stack installed on Fedora. I have reason to believe the repositories need to be updated. This could be a very simple task, but David and I had no luck with it. This would allow us to train using MLLT which is said to drop the WER by up to 25% as well as reduce the decode time. I would like Josh, Brian, David, and myself to work on the experiments as we know the process pretty well. Justin, John, and Mitch are encouraged to learn the process as well. if the Fedora issue is simple you guys can help run more experiments. I would like everyone to claim a machine as their own too. I will be using majestix. By not using the same machines we will hope to expedite the training/decode time. I would also like all emails to include everyone on our team I want everyone to stay in the loop and PLEASE contribute to the conversation. If I am forgetting anyone please let me know. Also I am very confident with this system and I can decipher a lot of error messages that we will run into so, as I have run into plenty in my experiences. Please feel free to ask me for insight, I am happy to help.

The tasks are subject to change depending on any discoveries in the coming weeks, so please keep everyone posted if you find anything that could be valuable -- also Josh if you could forward this to the people I missed that would be great.

New Information
Here is a table on CMU's tutorial page with recommended senone values based on data set sizes

Exp Sub Dirs for 0251
I think we should only post our Experiment details here this way we keep them secret. We should still post our results however on the main Exp page under 0251, with limited detail.


 * 001 Mini/train Tune Senones LM:tmp.arpa
 * 002 Mini/Train Tune Senones LM:tmp.lm.DMP
 * 003 ADD DESCRIPT
 * 004 Mini/Train Tune Senones even more LM:tmp.arpa
 * 005 Last_5hr Tune Senones
 * 006 Mini/Train Tune Senones LM:tmp.arpa
 * 007 Train: 100hr/train2 LM:tmp.lm.DMP Decode: 100hr/test2
 * 008 Experiment using the RunAll_CDMLLT.pl script
 * 009 Experiment testing new corpus structure
 * 011 100hr AM (3170/train)
 * 012 Final Results