Speech:Exps 0078

Description
Author: Tyler Martin Date: 4/9/13 - 4/10/13

Purpose: To run a train, create a language model, decode it and score it using the new first_5hr transcript.

Details: The Spring 2013 Data group has found a complete set of transcripts for the Switchboard corpus. This set of transcripts brings our total amount of available audio data to about 308 hours. Our goal for this experiment is to take the first 5 hours of this 308 hours worth of data, and create and be able to model it.

Results When creating the setup for the train, I noticed that in experiments 0074 and 0075 that the other group had developed a new dictionary script. The promise of the script was that it runs faster and finds the words not in the main dictionary. The script was indeed faster and was nice that it told me how many words I was missing along with a text file of what those words were. According to the script I was missing 111 words from the dictionary. After about an hour or so I was able to find all the words from the dictionary using the CMU pronouncing dictionary. (Note: It is best to try and break apart words.)

The following words needed to be added to my dictionary:

When going to run the train, I ran into the same problem that the other combined groups did with  and  E probably being end aside still being in the transcipt. Using nano I was able to search for the asides and edit out just that comment. After removing these I was able to run the train.

After the train successfully completed, I then moved on to create the language model. This step was easy and no errors occurred in this step.

Moving on to the decode, I had no issues either and let it run all night. Once it was done I moved onto the final step of scoring my experiment. To my surprise no errors occurred like they have before and the log was successfully created. The results of my scoring are below: