First_5Hr Train Train, LM, Decode and Score
Author: Tyler Martin
Date: 4/9/13 - 4/10/13
Purpose: To run a train, create a language model, decode it and score it using the new first_5hr transcript.
Details: The Spring 2013 Data group has found a complete set of transcripts for the Switchboard corpus. This set of transcripts brings our total amount of available audio data to about 308 hours. Our goal for this experiment is to take the first 5 hours of this 308 hours worth of data, and create and be able to model it.
Results When creating the setup for the train, I noticed that in experiments 0074 and 0075 that the other group had developed a new dictionary script. The promise of the script was that it runs faster and finds the words not in the main dictionary. The script was indeed faster and was nice that it told me how many words I was missing along with a text file of what those words were. According to the script I was missing 111 words from the dictionary. After about an hour or so I was able to find all the words from the dictionary using the CMU pronouncing dictionary. (Note: It is best to try and break apart words.)
The following words needed to be added to my dictionary:
007 Z IH1 R OW0 Z IH1 R OW0 S EH1 V AH0 N 101 W AH1 N Z IH1 R OW0 W AH1 N 60 S IH1 K S T IY0 ABC AH0 B IY1 S IY1 ABOUTS AH0 B AW1 T S ABRODE AH0 B R OW1 D ALBRIDGE AO1 L B R IH1 JH ALRIGHTY AO2 L R AY1 T Y ALWAY AO1 L W EY2 AMA AE1 M AH0 ANANIMOUS AE1 N AE1 N IH0 M AH0 S AOR AH0 AO1 R ATM AH0 T IY1 EH1 M ATM'S AH0 T IY1 EH1 M EH1 S BABYSITTINGWISE B EY1 B IY0 S IH2 T IH0 NG W AY1 Z BASTROP B AE1 S T R OW1 P BSING B IY1 EH1 S IH1 NG CAMPANY K AH1 M P AH0 N IY0 CASSEROLES K AE1 S ER0 OW2 L Z CBS S IY1 B IY1 EH1 S CHAPE SH EY1 P CHOWPHERD CH AW1 P ER0 D COGNIZITIVE K AA1 G N AH0 Z IH0 T IH0 V COMPOOTER K AH0 M P Y UW1 T ER0 CONDITIONER'S K AH0 N D IH1 SH AH0 N ER0 Z CPA S IY1 P IY1 AH0 CONQUISTADORS K AA1 N K W IH1 S T AA0 D AO1 R Z CULOTTES K AH1 L AA1 T S DC D IY1 S IY1 DC'S D IY1 S IY1 EH1 S DIESEL'S D IY1 S AH0 L Z DISEAL D IY1 S IY1 L DMC D IY1 EH1 M S IY1 DUCTWORK D AH1 K T W ER1 K EDS IY1 D IY1 EH1 S ELLSWORTH'S EH1 L Z W ER0 TH EH1 S ENVIRONMENTALS IH0 N V AY2 R AH0 N M EH1 N T AH0 L S ESPN IY1 EH1 S P IY1 EH1 N EXERCISING'S EH1 K S ER0 S AY2 Z IH0 NG EH1 S FEDERALDES F EH1 D ER1 AH0 L IY1 S FINALIZATION F AY1 N AH0 L AY2 Z EY1 SH AH0 N FREON'S F R IY1 AA0 N EH1 S GESTALT JH IY1 AH0 S T EY1 T AA1 L T GM JH IY1 EH1 M GPA JH IY1 P IY1 AH0 GRE JH IY1 AA1 R IY1 GREENWARE G R IY1 N W EH1 R GROUND'S G R AW1 N D EH1 S GTE JH IY1 T IY1 IY1 HEARTRENDING HH AA1 R T R EH1 N D IH0 NG HM EY1 CH EH1 M HOUSEPAINTER HH AW1 S P EY1 N T ER0 IBM AY1 B IY1 EH1 M IBS AY1 B IY1 EH1 S JFK JH EY1 EH1 F K EY1 JUNKINS' JH AH1 NG K IH0 N Z KGB K EY1 JH IY1 B IY1 KLIF K EY1 L IY1 EH1 F LER L ER0 LEWISVILLE L UW1 IH0 S V IH1 L LISTENABLE L IH1 S AH0 N EY1 B AH0 L LOA L OW1 AH0 LUCKED L AH1 K T MAHALS M AH0 HH AA1 L S MARKOV M AA1 R K AH1 V MAY'VE M EY1 V IY1 MOOSEWOOD M UW1 S W UH1 D MOUNTAINEERING M AW1 N T IH0 N IH2 R IH0 NG MTV EH1 M T IY1 V IY1 NBC EH1 N B IY1 S IY1 NONSOLICITED N AA1 N S AH0 L IH1 S IH0 T IH0 D NOX EH1 N AA1 K S OGEN OW1 JH EH1 N PBS P IY1 B IY1 EH1 S PC P IY1 S IY1 PCS P IY1 S IY1 Z PHD P IY1 EY1 CH D IY1 POSTGRADUATE P OW1 S T G R AE1 JH AH0 W AH0 T PRICEWISE P R AY1 S W AY1 Z PTA'S P IY1 T IY1 AH0 Z PURIFIES P Y UH1 R AH0 F AY1 Z RV AA1 R V IY1 RVER AA1 R V IY1 ER0 SCYENE S K Y EH1 N IY0 SHANKAR SH AE1 NG K AA0 R SHEETROCK SH IY1 T R AA1 K SKORTS S K AO1 R T S SMUSH S M AH1 SH SOCIOPATHIC S OW1 S IY0 OW0 P AE2 TH IH0 K SOUTHBEND S AW1 TH B EH1 N D SPOOKIER S P UW1 K IY0 ER0 SUPERVISOR'S S UW2 P ER0 V AY1 Z ER0 Z TACB T AE1 K B TENSIONED T EH1 N SH AH0 N EH1 D TEXTELLER T EH1 K S T EH1 L ER0 THANKSGIVING'S TH AE2 NG K S G IH1 V IH0 NG EH1 S TI'S T IY1 EH1 S TRANGED T R EY1 N JH D TREE'S T R IY1 EH1 S TVS T IY1 V IY1 S UNAIR AH1 N EH1 R UNSEASONAL AH1 N S IY1 Z AH0 N AH0 L UNTYPICALLY AH1 N T IH1 P IH0 K L IY0 VACATION'S V EY0 K EY1 SH AH0 N EH1 S VEGAN V EH1 G AH0 N VET'S V EH1 T EH1 S VOCALIZED V OW1 K AH0 L AY2 Z D WANT'S W AA1 N T EH1 S WEATHERWISE W EH1 DH ER0 W AY1 Z YELLER Y EH1 L ER0 YUPPYISH Y AH1 P IY0 IH1 SH
When going to run the train, I ran into the same problem that the other combined groups did with <B_ASIDE> and <E_ASIDE> E probably being end aside still being in the transcipt. Using nano I was able to search for the asides and edit out just that comment. After removing these I was able to run the train.
After the train successfully completed, I then moved on to create the language model. This step was easy and no errors occurred in this step.
Moving on to the decode, I had no issues either and let it run all night. Once it was done I moved onto the final step of scoring my experiment. To my surprise no errors occurred like they have before and the log was successfully created. The results of my scoring are below:
|-----------------------------------------------------------------| | SPKR | # Snt # Wrd | Corr Sub Del Ins Err S.Err | |=================================================================| | Sum/Avg | 4659 70274 | 68.1 24.1 7.8 13.8 45.7 97.0 | |=================================================================| | Mean | 58.2 878.4 | 68.0 24.6 7.5 15.0 47.0 97.4 | | S.D. | 22.1 336.4 | 7.4 5.9 2.7 6.6 10.9 3.8 | | Median | 55.5 826.0 | 68.3 24.3 7.3 14.0 45.5 98.2 | `-----------------------------------------------------------------'