Speech:Fall 2014 Justin Thibeault


 * Home
 * Semesters
 * Fall 2014

Week Ending Sept 24, 2014
Last week I managed to train successfully on the first_5hr corpus, but the test failed. On Erol's advice, this week I followed the CMU Robust Group tutorial, following it in /mnt/main/Exp/0256/tutorial. I had a few failures, which I attribute to 3 different causes:
 * my failure to follow the instructions 'exactly'
 * missing and inconsistent directory changes in the tutorial
 * scripts that must be run from certain folders existing in other folders and not obviously failing when run from the wrong folder

I was able to complete a decode on the tutorial data, results below. SENTENCE ERROR: 60.0% (78/130)  WORD ERROR RATE: 19.4% (149/773)

Problems
I'm seeing a database error when I try to use the search function on foss. Is this a known issue or something we should report?

Still having trouble doing a success decode on switchboard data, so I'm documenting what I do exactly.

Experiment

 * 1) Following the instructions on Speech:Run_Train_Setup_Script, I jump into Exp/0256/001 and run  Which seems to be successful
 * 2)  Again, success
 * 3)  This step takes a while, but completes successfully.
 * 4) Moving on to Speech:Create_LM  All of this seems to work correctly
 * 5) Following the instructions at Speech:Run_Decode, I return to the experiment folder and run:  which fails with the following message: "MODULE: DECODE Decoding using models previously trained Aligning results to find error rate Can't open /mnt/main/Exp/0256/001/result/001-1-2.match Can't open /mnt/main/Exp/0256/001/result/001-2-2.match Can't open /mnt/main/Exp/0256/001/etc/001_test.fileids for reading [1]    Exit 2                        scripts_pl/decode/slave.pl"

Other activity

 * Emailed Marcel with info/directions to the log file that I created when I was going through the Robust Group Tutorial
 * Email Erol to ask if he could take a look at the Experiment section above to figure out why I can't decode.

Week Ending Oct 8, 2014
Things I did


 * Emailed Erol again asking for help with my decode.
 * Backed up a copy of the Sphinx 3 adaptation documentation.

Things I need to know
 * If my task is to adapt the existing most successful experiment to Marcel's captured audio, I need reproducible steps to recreate and verify the most successful train.
 * I tried using the wiki to find this information. I found lots of decode results, often with configuration parameters, but the students were often using a script (sometimes changing partway through the semester) and the exact script they were using wasn't documented, which makes reproducing their results a guessing game.

Week Ending Oct 15, 2014
Ran a basic decode, test on train, on first_5hr corpus.

Actually ended up running it twice because I accidentally deleted the log file and I want to go through it. After that, I messed around with the results to recreate the processing which Dr. J showed us last week. I was successful, but I'm not 100% comfortable with the process yet, but I'll document the steps next time now that I know I can decode and score.

Results for test on train first_5hr corpus: SER 95.7 WER 51.3.

Next Step: Reproduce Erol's best results and begin adaptation off of those?

Week Ending Oct 22, 2014
Ran the following command adding the -lw10 -beam and -wbeam parameters. Trying to determine which parameter causes the slave script to fail. Logging to ...0256/001/logdir/decode/001b.log

nohup /usr/local/bin/sphinx3_decode \ -hmm /mnt/main/Exp/0256/001/model_parameters/001.cd_cont_1000 \ -dict /mnt/main/Exp/0256/001/etc/001.dic \ -fdict /mnt/main/Exp/0256/001/etc/001.filler \ -lm /mnt/main/Exp/0256/001/LM/001.lm.DMP \ -ctl /mnt/main/Exp/0256/001/etc/001_test.fileids \ -cepdir /mnt/main/Exp/0256/001/feat \ -cepext .mfc \ -hyp /mnt/main/Exp/0256/001/result/001-1-2.match \ -lw 10 -beam 1e-80 -wbeam 1e-40 \ > & logdir/decode/001b.log &

These options only gave marginal improvements over 001a. WER 50.3, SER 95.6

10/20/14 Emailed Marcel re: Language Models and dictionaries. He had a question about words missing when trying to adapt the existing model. I **think** any missing words will have to be added to both the .dic file and the language model. I **think** sphinx will dump a list of missing words to make this easy. But I don't really know. Deferring to Erol, but I think this is new ground for the UNH Speech Academy.

001c: added the rest of the parameters from the slave script except -ctlcount 0 which appears to tell sphinx3_decode to process 0 utterances (defaults to 1 billion if you don't specify. I couldn't find any documentation saying 0 = infinity).

nohup /usr/local/bin/sphinx3_decode \ -hmm /mnt/main/Exp/0256/001/model_parameters/001.cd_cont_1000 \ -dict /mnt/main/Exp/0256/001/etc/001.dic \ -fdict /mnt/main/Exp/0256/001/etc/001.filler \ -lm /mnt/main/Exp/0256/001/LM/001.lm.DMP \ -ctl /mnt/main/Exp/0256/001/etc/001_test.fileids \ -cepdir /mnt/main/Exp/0256/001/feat \ -cepext .mfc \ -hyp /mnt/main/Exp/0256/001/result/001-1-2.match \ -lw 10 -beam 1e-80 -wbeam 1e-40 \ -wip 0.2 -agc none -varnorm no -cmn current \ > & logdir/decode/001c.log &

Week Ending Oct 29, 2014
Created symlinks for the NOAA split corpora and divided the transcript.

Reorganized my experiment documentation on FOSS and the 0256 folder on caesar to match Dr. Jonas' specifications last week.

Went through Marcel's experiments in 0257. They are much better organized then mine and self documenting to a much higher degree, so I'm going to use his setup from now on.

My experiments in 0256 have 213 words in the language model that aren't present in the dictionary. These are mostly partial words in the transcript. Most (all?) of my missing partial words do appear to be present in Erol's dictionary in 0253/B12, so I'm left wondering how my dictionary was generated and where it came from. The train perhaps? Understanding any automated tools to update a language model and dictionary seem like they would be important for understanding adaptation.

rm -i my.dic foreach word (`cat noaa.trans | sed s/ /\n/g | sort | unique `) grep "^$word " /path_to/cmu07.dic >> my.dic end

Week Ending Nov 5, 2014
Attempted to improve results in 0256/004 by using Erol's best language model - reuslts were not impressive.

Recreated Marcel's Experiment 0257/012. Matched his results, but I still need to dig a little deeper to understand exactly what is going on.

To do: More research about adaptation. Outline Paper

Week Ending Nov 12, 2014
Recreated the NOAA language model as part of a larger effort to make sure that experiment 0259 is entirely reproducible (confirming it works identically in 002)

Experiment 003 removes all references to 0257. I still have links to 0253.

I had been working off of Marcel's original transcript that used numbers instead of the words spelled out (84 instead of eighty four), so I updated that. I still need to fix the transcripts in the noaa/half corpi, but I think I need to adjust those because the break in terms of word count is around 33%/66% instead of 50%/50%. I will consult with Dr Jonas this week.

Week Ending Nov 19, 2014
The NOAA full corpus appeared to have bad data in the first 70 utterances/20 minutes, so new subsets were created noaa/40min_split/adapt and noaa/40min_split/test using the data from the end.

I ran baselines on these new corpi and haven't done much else.

It doesn't look hopeful that I'll be able to continue my research next semester due to some major life changes.

Week Ending Nov 26, 2014
No meeting this week due to Thanksgiving break.

When I broke the NOAA_40min corpus into 2 pieces, the difference between the 2 was very high (1800 words vs 4600 words and 25% vs 35% WER when the corpus was split into approximately equal audio lengths). Spoke to Dr J about Marcel about the corpi and the procedure for dividing a corpus. Marcel added that the NOAA full corpus represents several different recording sessions, so I am going to alternate utterances to make split corpi.

For all the work I've done, I'm still learning more about Sphinx. For example, I've been linking to Marcel's features files under the incorrect assumption that they were audio files, but now I understand that isn't the case, so for my experiments moving forward, I'm going to generate my own features analyze those.

I'm ending this entry with some code snippits I've been using so I can refer back to them later.

Vim command to delete every other line in a file: :g/^/+d

Using gawk to sum file sizes to help me group them appropriately. Example: > ls -la audio/wav/NOAA_162.450-1[0-5]* | gawk '{ total += $5 }; END { print total }'

Sed to remove html style tags: > sed -e 's/<[^>]*>//g' file.txt

Sed to remove everything inside parenthesis (and the parentheses too!): > sed 's/([^)]*)//g' file.txt

Using tcsh wildcards to create links to odd numbered files in the 100-200 range: > ln -s /mnt/main/corpus/noaa/full/audio/wav/NOAA_162.450-{1,2}?[1,3,5,7,9].wav.

Week Ending Dec 10, 2014
Generating fileids > ls -la /mnt/main/corpus/noaa/40min_split/test/audio/wav | gawk '{print $8}' | sed 's/.wav//g' > etc/noaa_40min_split_test.fileids

With the new method of splitting the NOAA corpus - alternating lines between each sub-corpi instead of putting the first chunk of lines in one corpus and the second chunk of lines in the second corpus. The baselines on these were better. The difference between the 2 sub-corpi was only 1%.

Setting up adaption: I made a new 40min/split/adapt audio folder and generated feats using sphinx_fe -argfile /mnt/main/Exp/0253/B12/model_parameters/012.cd_cont_10000/feat.params -samprate 16000 -c /mnt/main/Exp/0259/etc/noaa_40min_split_adapt.fileids -di. -do. -ei wav -eo mfc -mswav yes

Overview of the adaption process as I've discovered it from the sphinx documentation.


 * 1) Generate feat on the adaptation data using sphinx3_fe, the audio files, and the original model
 * 2) Generate statistics using bw, which doesn't seem to be on caesar.
 * 3) Update the accoustic model using mllr_solve or mixw depending on the documentation.