Speech:Spring 2016 Benjamin Leith Log


 * Home
 * Semesters
 * Spring 2016
 * Proposal
 * Report
 * Information - General Project Information
 * Experiments - List of speech experiments

Week Ending February 9, 2016
Wednesday, 2/3/16  Today, in conjunction with my work last week, the modeling group tried to run a train. Since I have probably the most experience in this area at the moment, I took a lead role in trying to make that happen.
 * Task:


 * Results:

Training
Things did not go entirely so smoothly. We conducted four sub-experiments before we had one succeed. The issue was due mostly to my own confusion rather than an inherent issue with the scripts. Through the first four iterations and attempts of the experiment I was using the wiki as my guide, and in the wiki the initial step in running a train, once all directories are created by the user, employs the following string:

/mnt/main/scripts/user/prepareTrainExperiment.pl switchboard first_5hr/train (from: Here)

About this, it has the following to say: "It takes one argument, which is the data corpus to use."

This is not actually correct. As it turns out, it takes two arguments, one for the corpus to use, and the other for the sub-path to the actual segmentation of the corpus on which you would like to train. While that was fairly obvious at first glance to me, what was not so obvious, and what required several iterations to figure out, was that the paths from the primary corpus directory are RELATIVE PATHS, meaning first_5hr/train in this case is a path from "/mnt/main/corpus/switchboard" where the "switchboard" corpus is known by the script to reside. This is not documented, and at least not terribly obvious, and tripped us up majorly for several iterations. I'd advise we fix that, or at least amend this to make this more clear somehow.

Language Modeling
With this addressed, I was then able to take care of the Language model creation process without further issue.

Decoding and Scoring
Here is where we run into problems again. While my decode did complete, I wasn't able to successfully score my decode. Running sclite with the Wiki's suggested command switches, as follows:

sclite -r 005_train.trans -h hyp.trans -i swb >> scoring.log

I receive only the following output into my scoring.log file:

[bgl1@caesar etc]$ sclite -r 005_train.trans -h hyp.trans -i swb sclite: 2.3 TK Version 1.3 Begin alignment of Ref File: '005_train.trans' and Hyp File: 'hyp.trans' SYSTEM SUMMARY PERCENTAGES by SPEAKER Segmentation fault (core dumped)

More research is needed into why sclite is failing. Maybe I will need to attempt another decode and sift through the log file, or go digging in the one I've already created for clues on why sclite won't function. I'm not done enough to be concerned yet.
 * Plan:
 * Concerns:

Thursday, 2/4/16  A Quick note here: On Training speed: Our 5 hour example train, once it ran, ran successfully into Wednesday night in just under four hours. This represents a potentially enormous speed improvement on Caesar's new hardware. We may be doing train/decode cumulatively in better-than-real-time! I didn't dig up the numbers just yet, but just based on my own experience I'd expect this might degrade as we add training hours... Five is pretty small. More testing will be needed.

Friday, 2/5/16  Friday I did not participate directly, but I helped Jon run his own train in a new experiment. The creation of the new semester user accounts toasted my access to the previous folder (EXP 281 and all sub experiments), and rather than reset the group access, Jon made a new experiment and tried to run it himself. Unfortunately he wasn't able to get the actual training to start, and I worked with him over text to troubleshoot, but didn't have quite enough time to review in full myself.

Sunday, 2/7/16  After reviewing Jon's Friday log, I want to dig into his experiment and see if I can help him out. It looks like we dropped the experiment at whatever roadblock Jon hit and nobody picked it back up again after that. I'd like to get at least a train to run before Wednesday's class. Haven't spoken to Ryan or James since Wednesday. The modeling group has adopted a weekly group-status email that we're using as a global "one-shot" update of all group activities, goals, thoughts, and intentions, on a weekly basis. Last one went out on Thursday. This will eventually make it into our group logs and it's likely our portion of the report will, in the end, be written chiefly from a summary of the contents of these. I will do another group status email sometime during the day tomorrow to get everyone abreast of our new progress and try to make sure Ryan, James, Jon, and I, all stay on board, and on the same page. (If you're reading, guys, standby for that.)

Sunday, 2/7/16 UPDATE 2 Second update, it looks like Matt (not a Modeling group member) beat us to the punch on the train/decode yesterday. His scoring went better than mine and completed just fine. That said, he's now our proof of concept, and we'll endeavor to duplicate it ourselves. I'm getting a little worried about some of the group getting left in the dust on the train/decode process and general understanding of what's happening. I'll need to address that in the group update mentioned above.

On Matt, Specifically I want to talk to him about what was going to be our next step - duplicating last year's best result, I'll see about contacting him now, and get the rest of the group to look at his log.

(Since I'm pretty sure you'll read this, nice work Matt!)

Week Ending February 16, 2016
Wednesday, 2/10/16  Having a decent handle on train/decode following last week's testing (completed 1st 5hr train on Tuesday), we (modeling group) were prepared to make some progress towards improving the train/decode scores and process.
 * Task:

The trouble was, where to start?

The Pragmatist's Week 3 Guide to Sphinx
Moving into the week, since I've ended up de-facto "acting group manager", I had to devise a plan. To understand what I'm getting at with subsequent tests, you'll need to understand my approach. Here's my take: Progress made in speech modeling, training, and decoding can be broadly divided into three categories, not including Data modeling/processing/validation, which we'll touch on in a bit. They are:

1. Acoustic Modeling
This includes everything that ties into the sphinx-included utility "SphinxTrain", including the all-important sphinx_train.cfg. The values fed into (and generated by) Sphinxtrain and the training process together form a set of distributions, vectors, and a whole bunch of other stuff, collectively called the "acoustic model". These models form the basis for our system's understanding of what we've given it in the past, on which it will base its expectations for future input. In other words, we are "building the brain". (And a better brain makes better guesses.) This is where our changes happen this week!

2. Language Modeling
This is the way our language is understood by Sphinx. It consists of a dictionary and a mapping of words-to-sounds, and, among other things, is combined with the AM to form a picture of the language Sphinx wants to understand. We haven't touched this, and we won't yet. But it will be relevant later. Keeping with our brain analogy - Even a crappy brain with good information makes better guesses more often than an amazing brain fed bad information. For improvement purposes, think of the LM as language facts fed into the brain we built.

3. Decoding and Scoring Process
Decoding is the stage where the magic happens. We take away Sphinx's transcript files, and test it on how well it understands our language. The percentage of correct guesses forms our "Word-Error-Rate" (WER) against which our results are based. The ultimate goal is, contort the AM and LM in such a way as to reduce the WER decoding produces. We might be able to speed up decoding somehow, too. Relevant details are stored in "sphinx_decode.cfg"

Some combination of those three things governs anything and everything you could want to change about our setup. Simple, right?

Ignoring the "Why" and breezing over the "How"; Here is the "What":
 * Plan:

Things we'll definitely do:
1) Reduce the Sphinxtrain.cfg convergence ratio 2) Enable sphinxtrain's variance normalization

Things we might do
1) Try 128 as our density value 2) Anything the group mates come up with :P

Alright, lets do it then!
Nope. Stop right there.

First, we have a problem. We don't really have a good target to aim for. Last year's group before us got their best result on a 250 hour train, and 250 hours is going to be vastly too unwieldy for us to test on (for one, it just takes forever to run a train on that kind of corpus). So our first step will be to establish a baseline using their testing criteria (see last year's report, we're copying their experiment's values), on our (125hr) test data. With that, we should be well-equipped to start modifying values (one at a time), to see their effects, individually, as opposed to "shotgunning" values and simply watching to see what sticks. This systematic approach will be slower, but I think the targeted effects will pay dividends later.

'''[We fired off the train at about 4pm Wednesday, and I gave the execution reigns over to James Schumacher. Check his log for progress there.]'''

Thursday, 2/11/16  Helped James get a new sub-experiment going. Something happened to our previous train, and he needed to start a new one. See his log for more.

Friday-Sunday, 2/12-14/16  This permissions issue continues to frustrate us. We'll bring this up at the next meeting and see what Systems can do for us. It's slowing us down. Train completed over the weekend and James, with a lot of trouble along the way, kicked off the decode process on a questionably-large (25 hour?) decode late Sunday night. I helped him, answering some general Linux questions and guiding him on general understanding of the process. See his log for detail.

Tuesday 2/16/16  I meant to make this entry yesterday, but I think I owe you folks some detail on what exactly I'm trying to accomplish with the lower convergence ratio and variance normalization. I'm far from a speech recognition expert, so try to forgive liberties where I take them.

Baum, Welch & Markov
I'm going to be super-simplistic about this, but there are great guides on this that explain it better than I can. Here's one from MIT: http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-345-automatic-speech-recognition-spring-2003/assignments/assignment8.pdf, (about halfway down page six).

Baum-Welch is an iterative algorithm run on a Hidden Markov Model (a mathematical model used for basically all speech recognition). In our case, it is used to gauge how likely a particular word order is. The "convergence ratio" is the target ratio of how specific we're going to get with our predictions. Baum-Welch will run multiple times, over and over, until we hit this number. Right now we're at about .004, but I think we can do better, by moving down to .001 (the lowest value recommended by Sphinx for decent training). This will make Baum-Welch run longer before we reach our target convergence and may improve prediction accuracy, with the outside chance that we overtrain the models. MIT says, if we run more than 15 iterations, we're overtraining. We'll need an experiment to see how many iterations it takes to get there. It is possible to hard-cap the trainer at 15 if we need to, but it apparently will torch our word error rate if we do.

Week Ending February 23, 2016
Thursday, 2/18/2016 Short one today, just checking in. James hasn't touched his train yet, but I'm sure generateFeats is done running by now. I will check up on it later if he doesn't beat me to it.

Side note: REALLY impressed with the cleverness of the Experiment Group log, and would like to duplicate it for ours if possible.

Friday-Sunday, 2/18-21/2016 Our train and decode ran with our modified parameters. I have some stuff to say here.

Training Speed Improvements
Training in this experiment completed in about 54 hours. This represents an improvement of approximately 12.9% over our previous training time. This is a huge deal, an area we haven't historically had much progress in, and will probably save us some hundreds of hours over the course of the semester in training time. I put this up to our enabling of variance normalization on our training date in sphinx_train.cfg. See my previous logs and those of James S. for more details and documentation on why exactly I think this.

Decode Score Improvements
Decoding in our experiment completed in 5 hours and 24 minutes. This is not a noteworthy or substantial speed improvement. The Word Error Rate (WER) however, improved by 1.5%. This is not a large improvement, but it's worth keeping. I would conjecture that the lower convergence ratio (0.004 -> 0.001) is primarily the cause for this. For details of why exactly I think this is, check my previous logs and documentation.

Tuesday, 2/23/2016

Summary Overview
Here's our best result now!

Week Ending March 1, 2016
Wednesday, 2/24/2016 After our noteworthy-but-disappointing result last week, we've hatched a new scheme to drive down error rate.

The High-Level Plan
We noticed that, currently, based on the configuration of lm_create.pl (default, that is), our vocabulary will be capped at the default size (20,000 words). This means that all words that appear more than ten times in our transcript will end up in our vocabulary file (to be matched against our dictionary at decode time) only up to 20,000 total words. Our dictionary, on the other hand, contains 27000 words. This means that, for essentially no reason, our transcript could never contain all the words in our dictionary and also be evaluated correctly. The fact that we consistently create vocab files (*.vocab) which reach this value of 20,000, suggests to me that, if not for this limit imposed artificially, a larger vocab would be generated, and perhaps allow us better results.

lm_create.pl consists of two parts: text2wfreq - Produces a file that details how frequently a word appears in the transcript. wfreq2vocab - Produces a vocab file - a list of words that appeared more frequently than ten times.

Documentation of this is found here: http://www.speech.cs.cmu.edu/SLM/toolkit_documentation.html#text2wfreq

Process
We'll change the language model creation script to include an additional argument.

Edit lm_create.pl to change the line: system( $folder."wfreq2vocab  tmp.vocab" );

To be instead: system( $folder."wfreq2vocab -top 30000  tmp.vocab" );

Friday-Sunday, 2/26-28/2016 Our decode finished late in the weekend. Here's our new score.

We're making steady (but slow) progress in WER and training time. Our target for now will be 25% WER overall.

Also, I've started a little data-munching side project to help us better identify "Sweet spots" in the trade-off between Word Error Rate, Training Time, Decode Time, and Unseen-Data WER. It's not done, but below are some graph samples using this dataset (small and fairly limited for now).



Monday, 2/29/2016 I added pictures! (see above)

Week Ending March 8, 2016
Wednesday, 3/2/2016

Corpus Invalidity And Discovered Errors
In the class meeting on Wednesday, the Data group revealed that they have discovered a substantial number of errors in the switchboard corpus data. This suggests that a fair amount of past experimentation is, to put it lightly, compromised. After discussing it with them, I determined that there's at least some sanity to the early part of our data, and it's on this basis that I'll propose a solution to salvage what we can. To start with, the breakdown of errors goes like this:

Existing Errors in Corpus, Placement and Orientation
Audio files, as you may know, come in the form of conversations (".sph"), these are then split into utterance files (".utt"), which map to a single speaker, and more importantly, a single line in the transcript file (.trans). In the data group's work on "data verification" (combing through lines, one at a time, to listen to their corresponding audio), they determined that the first segment of files to "fail" their verification was utterance #32603 (corresponding to line number 32603 in the transcript). These errors persisted for up to some nearly 11,000 subsequent utterances. See their group log for more specific information if you need it.

For example's sake, our Corpus, following their audit, looks like this: [Corpus Data Table Coming]

Side Note: Data Verification Methodology
The Data group segmented the data into "blocks" of 2500 files, and has been checking the first utterance of each "block" against the transcript, in order to determine the validity of the data in the corpus. It is completely possible that smaller numbers of utterances (say, any failed utterance which occurs or repeats incorrectly for whatever reason less than 500 times) could be missed by this method. It's not a perfect method, but it's the only way for them to keep the data manageable.

Resolution, and/or, The New Plan
The proposed resolution, therefore, was to create a new corpus out of the first 30,000 utterances.


 * Task:


 * Results:


 * Plan:


 * Concerns:

Week Ending March 22, 2016

 * Task:


 * Results:


 * Plan:


 * Concerns:

Week Ending March 29, 2016
Wednesday, 3/23/2016

New Scripts, New Corpus, New Problems
James, John, and to a lesser extent, Ryan and I, put in work over spring break week (+/- the week-or-so around that on either end) to try to create the best possible audio and trans data from which to generate our new copra. We weren't entirely successful.

Nothing works?
On arriving today, I discovered that feature generation doesn't seem to work anymore?! Generatefeats.pl fails with several "Unknown Header Data" and "Unknown Machine Endian" error messages.

What Happened?
At first, it was hard to say exactly. The issue was first brought to my attention by several Systems Group members, wondering why their trains/decodes had failed, when they followed the wiki exactly as written. Initially, I chalked this up to either human error, or somehow related to the corpus changes undertaken by the Data Group. Neither gave me great results and I was left scratching my head somewhat. Eventually, in working with several Systems Group members, I was able to deduce the following: As part of James' sox-based conversation file -> utterance production script, the conversation files (.sph), get broken up into smaller utterance files (also .sph files). In addition, James performed a few changes on the audio data intended to improve performance, including breaking all speakers out into their own channel, and producing "mono" (one-channel) data of a single speaker from the two-channel source conversations.

Those audio file headers, when James is done with them, look like this:

sample_count -i 84670 sample_n_bytes -i 1 channel_count -i 1 sample_byte_format -s1 1 sample_rate -i 8000 sample_coding -s4 ulaw end_head

... The old headers, of the utterance files before we did anything, look like this:

sample_count -i 85000 sample_n_bytes -i 2 channel_count -i 2 sample_byte_format -s2 01 sample_rate -i 8000 sample_coding -s3 pcm end_head

As you can see, the format is different in more ways than just channel count

How? What? How do we Fix it?
Yeah, good question. From what I've pieced together, it looks like the old Sox-based utterance-generation script must have converted these files to a different encoding. Maybe this was a default setting of a previous sox version, but whatever the reason, and however we ended up this way, clearly generateFeats is not able to cope with the different audio type. To correct this, James, Ryan, I, and Tom, worked to create a new super-super small corpus (first_10_sentences), where we would use sox to convert the audio files to the correct version, and then attempt a train on this one corpus. If it works, it would conclusively prove that our issue with feature generation was related to the different audio encoding.

Did we do it?
It took a LOT of doing to backstep through sox to find out what the right combination of switches would have been, but eventually, it was Tom who actually found it. This worked, and makeFeats ran successfully without errors again, like normal.

It's over!
Maybe, looking for confirmation from James and an email to Mike. Both Team Captain and Team Stark are waiting on my signal to train again, so we're working as fast as we can. More on that soon.

Thursday, 3/24/2016

Isn't that kind of important?
I realized that I forgot to add this yesterday, here's the command to get Sox to output files in the format that wave2feat.c is expecting: sox [source file] --bits 16 --encoding signed-integer [output file]

Today, Jon finished his corpus creation scripts, and coached Data a bit on how to use them. I'm mostly following what he's done. His log has more.

Week Ending April 5, 2016
Wednesday, 3/30/2016

Teams!
With our corpus issues resolved, we split into teams this week. Although I wasn't totally prepared, I ended up essentially holding a brief "Here's how we, you know, do everything" sort of lecture/training session with the whole team. It was pretty productive I think, and we left with a set of tasks that I'm not going to put here (because enemy spies are certainly already among us).

Win.
 * Task:

Not Winning Yet.
 * Results:

To Win. How can we win? What if we don't win?
 * Plan:
 * Concerns:

Thursday, 3/31/2016 We've managed to improve [REDACTED] dramatically, and this has put us in a good position to start winning. We are currently verifying the thing.

Week Ending April 12, 2016
Thursday, 4/7/2016

Meeting, Plans?
Productive meeting yesterday with a few more values and some optimizations picked out Sphinxtrain.cfg... I've had a few new leads and been a lot more successful getting other people to participate in the discovery process. If we're being realistic, if all the research is my own, we're definitely going to lose. John and James are competent guys, and this is a uphill battle for sure.

That said, at minimum we have 2 trains running and about 5 more planned. I'd put detail here, but I wont.

What I will put here is a some other research I've done.

Feature Generation Theory
Because I tend to be curious that way, I looked up a few academic papers on how feature generation actually works. The answer is predictably "mathy", and decidedly more complex than just measuring the height of the waveform as I'd previously assumed.

Here's one paper

Broadly-speaking, the important bits for us are:
 * Right now we're using MFCC (Mel-Frequency Cepstral Coefficients)
 * Sphinx supports others

According to this paper, there are alternatives that Sphinx can certainly be made to use, but they're not a whole lot better. They might, in a perfect system, move us a maximum of one or two percent. Maybe a big deal later, but pretty small potatoes for now. And it certainly looks like it would be a hell of a job to change the feature generation type also. I can't really fully understand all the math going on under the hood there, but I have at least an idea.

I shared it with the group, just for posterity. For now I expect nothing to come of it.

Win.
 * Task:

Not Winning Yet.
 * Results:

To Win.
 * Plan:

Week Ending April 19, 2016
Wednesday, 4/13/2016 Progress and Planning Now that unseen decodes are running successfully, we have a few ideas for what to run next. We've achieved a pretty solid WER that I wont share at the moment, and our next task will be to "brief" the team on how to run an unseen decode in what we currently consider to the "best possible" way (which has changed a bit). There also might be a trade, more details to follow.

Note: No trade. More progress.

Tuesday, 4/19/2016 Sync'd up with Jon/James on URC poster presentation plans. I plan to be available starting at 1pm.

Week Ending April 26, 2016
Wednesday, 4/20/2016 Met with group. Found good literature and potential new changes to try to improve WER. Will work more closely with Matt and others in the coming days to dole out responsibilities.

Week Ending May 3, 2016
Wednesday, 4/27/2016

Combined Team Capstone!
Today we combined Team Stark and Team America. This makes us approximately 300% as awesome as we were before. Matt and Ryan, who were not in class, were informed, as were Aaron, Nigel, and anyone else who missed the event. We were pretty short-handed today, so this is a welcome event for us. (Long live our glorious union.)

Future Plans
In conjunction with Jon and James, I'm working to create a new overall plan for all teams, which will enable us to allocate our resources as efficiently as possible in the coming weeks. We're also starting to discuss the finer points of creating the End of Semester report. I'll be using previous semesters as a reference.

Week Ending May 10, 2016

 * Task:


 * Results:


 * Plan:


 * Concerns: