Author: Spring 2013: Groups B&C
Date: 4/8 - 4/9/13
Purpose: Create an acoustic model on a 5 hour set of new corpus data.
Details: The Spring 2013 Data group has found a complete set of transcripts for the Switchboard corpus. This set of transcripts brings our total amount of available audio data to about 308 hours. Our goal for this experiment is to take the last 5 hours of this 308 hours worth of data, and create an acoustic model from it. This experiment also serves as a test for the new pruneDictionary2.pl and dictionary2.pl scripts which both drastically speeds up the time it takes to create an experiment dictionary, and simplifies the process of resolving missing dictionary entries by creating a list of any missing words at the script's completion.
The acoustic model created from this experiment will be tested in Experiment 0075. This is done to simplify the decoding process.
Results This experiment was a group effort, with various group members completing steps in the process. No issues were found out of the ordinary for most steps.
The new dictionary scripts were highly effective in both cutting down the time needed to execute the script; this will prove to be very helpful as the train sizes get larger. That being said, the new script's ability to create a list of words missing from the dictionary is invaluable. The script stated that we were missing 103 words. By creating this list early in the week, we were able to assign a set of words for each team member to define; we were subsequently able to make quick work of finding words.
The Experiment dictionary required the following words to be added:
1500 F IH0 F T IY1 N HH AH1 N D R AH0 D 260 T UW1 S IH1 K S T IY0 286 T UW1 EY1 T IY0 S IH1 K S 2CI T UW1 S IY1 AY1 386 TH R IY1 EY1 T IY0 S IH1 K S 401K F AO1 R OW1 W AH1 N K EY1 9050 N AY1 N T IY0 F IH1 F T IY0 990 N AY1 N N AY1 N T IY0 AARP AH0 AH0 AA1 R P IY1 ABC AH0 B IY1 S IY1 ALZHEIMERS AE1 L Z IY1 HH AY1 M ER0 S ANTEKS AE1 N T EH1 K EH1 S ANTIAMERICANISM AE1 N T IY0 AH0 M EH1 R IH0 K AH0 N IH2 Z AH0 M APLIPLATIONS AH0 P IH1 L P L AA1 T IY1 AY1 AH0 N EH1 S ATM AH0 T IY1 EH1 M AT&T AH0 T AH0 N D T IY1 AT&T'S AH0 T AH0 N D T IY1 EH1 S AUTOMETER AO1 T OW0 M IY1 T ER0 BACKSEATS B AE0 K S IY1 T S BITISH B IH1 T F IH1 SH BLOW'S B L OW1 EH1 S EH1 S CEO S IY1 IY1 OW1 COP'S K AA1 P EH1 S EH1 S COTR K OW1 T IY1 AA1 R CRX S IY1 AA1 R EH1 K S CU S IY1 Y UW1 CWA S IY1 D AH1 B AH0 L Y UW0 AH0 DAHLMER D AA1 L M ER0 DC D IY1 S IY1 DCOM D IY1 K AA1 M DEBITING D EH1 B IH0 T IH1 NG DFM D IY1 EH1 F EH1 M DPMA D IY1 P IY1 EH1 M AH0 DRAWED D R AO1 EH1 D EINSTEINS AY1 N S T AY1 N S EMULATIONS EH1 M Y AH1 L EY1 SH AH2 N S EUROVAN Y UW1 R OW1 V AE1 N FBI EH1 F B IY1 AY1 FEMTOMETER F EH M T OW s M IY T ER FIDONET'S F AY1 D OW0 N EH1 T H1 S EH1 S GABLERS JH IY1 AE1 B L AH0 AA1 R EH1 S GEOS JH IY1 IY1 AA0 S GLUING JH IY1 L UW1 IH1 NG GM JH IY1 EH1 M HM EY1 CH EH1 M HMOS EY1 CH EH1 M OW1 S IBM AY1 B IY1 EH1 M INDEPTH IH0 N D EH1 P TH IRS AY1 AA1 R EH1 S ISDN AY1 EH1 S D IY1 EH1 N ITERATIONS IH1 T ER0 EY1 SH AH0 N Z LAOTIANS L EY1 OW1 SH AH0 N Z MAJORLY M EY1 JH ER0 L IY1 MECHANIC'S M AH0 K AE1 N IH0 K Z MIATAS M IY1 AH0 T AA1 S MONONUCLEOSIS M OW1 N OW0 N UW1 K L IY1 OW0 S IH1 S MORTICES M AO1 R T AY1 S EH1 S NIH EH1 N AY1 EY1 CH NONSALARY N AA1 N S AE1 L ER0 IY0 NUKING N UW1 K IH1 NG PAVERS P EY1 V ER0 EH1 S PAY'S P EY1 EH1 S EH1 S PC P IY1 S IY1 PENTALTY P EH1 N AH0 L T IY0 PERSONALTIES P ER1 S AH0 N AE1 L T AY1 Z PFM P IY1 EH1 F EH1 M PLUSSES P L AH1 S S EH1 S PPOS P IY1 P IY1 OW1 EH1 S PRECARE P R IY1 K EH1 R PREFIXES P R IY1 F IH0 K S EH1 S PRERETIREMENT P R IY1 R IY0 T AY1 ER0 M AH0 N T QUADRATRON K W AE1 D R AH0 T R AA1 N RABBETING R AE1 B AH0 T IH1 NG RCA AA1 R S IY1 AH0 RENOVATIVE R EH1 N AH0 V EY2 T IH0 V ROTC AA1 R OW1 T IY1 S IY1 ROUTER R UW1 T ER0 SCARIOUSLY S K EH1 R IY0 AH0 S L IY0 SHEBANG SH IY0 B AE1 NG SHEETROCK SH IY1 T R AA1 K SHEETROCKING SH IY1 T R AA1 K IH1 NG SIMULAR S IH1 M AH0 L ER0 SMATTERINGS S M AE1 T ER0 IH1 NG S SPECTROGRAPHY S P EH1 K T R OW1 G R AE1 F AY0 SPLACE S P L EY1 S SUBARUS S UW1 B ER0 UW0 S SWEARED S W EH1 R EH1 D TELECREDER T EH1 L IY0 K R EH1 D ER0 TR T IY1 AA1 R TRESTING T R EH1 S T IH1 NG TROPICALS T R AA1 P IH0 K AH0 L EH1 S UCF Y UW1 S IY1 EH1 F ULTRAWISE AH1 L T R AH0 W AY1 Z UNDERGRADS AH1 N D ER0 G R AE1 D Z UNFORSEEN AH1 N F AO1 R S IY1 N UPROARIOUS AH1 P R AO2 R IY1 AH1 S VCR V IY1 S IY1 AA1 R VOCALIZED V OW1 K AH0 L AY2 Z D VW V IY1 D AH1 B AH0 L Y UW0 WALKMAN'S W AO1 K M AE2 N . EH1 S . EH1 S WEASONABLE W IY1 Z AH0 N AH0 B AH0 L XT6 EH1 K S T IY1 EH1 S Z248 Z IY1 T UW1 F AO1 R EY1 T
On first execution, we discovered that the transcript contained invalid text for which the dictionary scripts and genTrans2.pl did not sort out. The words were "<A_ASIDE>" and "<B_ASIDE>", we believe that these are translators notes which somehow got left in.
This experiment was finished in Experiment 0075