Speech:Exps 0074

From Openitware
Jump to: navigation, search

Last_5hr Train


Description

Author: Spring 2013: Groups B&C

Date: 4/8 - 4/9/13

Purpose: Create an acoustic model on a 5 hour set of new corpus data.

Details: The Spring 2013 Data group has found a complete set of transcripts for the Switchboard corpus. This set of transcripts brings our total amount of available audio data to about 308 hours. Our goal for this experiment is to take the last 5 hours of this 308 hours worth of data, and create an acoustic model from it. This experiment also serves as a test for the new pruneDictionary2.pl and dictionary2.pl scripts which both drastically speeds up the time it takes to create an experiment dictionary, and simplifies the process of resolving missing dictionary entries by creating a list of any missing words at the script's completion.

The acoustic model created from this experiment will be tested in Experiment 0075. This is done to simplify the decoding process.

Results This experiment was a group effort, with various group members completing steps in the process. No issues were found out of the ordinary for most steps.

The new dictionary scripts were highly effective in both cutting down the time needed to execute the script; this will prove to be very helpful as the train sizes get larger. That being said, the new script's ability to create a list of words missing from the dictionary is invaluable. The script stated that we were missing 103 words. By creating this list early in the week, we were able to assign a set of words for each team member to define; we were subsequently able to make quick work of finding words.

The Experiment dictionary required the following words to be added:

1500  F IH0 F T IY1 N HH AH1 N D R AH0 D
260  T UW1 S IH1 K S T IY0
286 T UW1 EY1 T IY0 S IH1 K S
2CI  T UW1 S IY1 AY1
386  TH R IY1 EY1 T IY0 S IH1 K S
401K  F AO1 R OW1 W AH1 N K EY1
9050 N AY1 N T IY0 F IH1 F T IY0
990  N AY1 N N AY1 N T IY0
AARP   AH0 AH0 AA1 R P IY1
ABC  AH0 B IY1 S IY1
ALZHEIMERS AE1 L Z IY1 HH AY1 M ER0 S
ANTEKS AE1 N T EH1 K EH1 S
ANTIAMERICANISM AE1 N T IY0 AH0 M EH1 R IH0 K AH0 N IH2 Z AH0 M
APLIPLATIONS AH0 P IH1 L P L AA1 T IY1 AY1 AH0 N EH1 S
ATM AH0 T IY1 EH1 M
AT&T AH0 T AH0 N D T IY1
AT&T'S AH0 T AH0 N D T IY1 EH1 S
AUTOMETER AO1 T OW0 M IY1 T ER0
BACKSEATS B AE0 K S IY1 T S
BITISH  B IH1 T F IH1 SH
BLOW'S  B L OW1 EH1 S EH1 S
CEO  S IY1 IY1 OW1
COP'S  K AA1 P EH1 S EH1 S
COTR K OW1 T IY1 AA1 R
CRX  S IY1 AA1 R EH1 K S
CU  S IY1 Y UW1
CWA  S IY1 D AH1 B AH0 L Y UW0 AH0
DAHLMER D AA1 L M ER0
DC   D IY1 S IY1
DCOM  D IY1 K AA1 M
DEBITING  D EH1 B IH0 T IH1 NG
DFM D IY1 EH1 F EH1 M
DPMA  D IY1 P IY1 EH1 M AH0
DRAWED D R AO1 EH1 D
EINSTEINS  AY1 N S T AY1 N S
EMULATIONS  EH1 M Y AH1 L EY1 SH AH2 N S
EUROVAN  Y UW1 R OW1 V AE1 N
FBI  EH1 F B IY1 AY1
FEMTOMETER F EH M T OW s M IY T ER
FIDONET'S F AY1 D OW0 N EH1 T H1 S EH1 S
GABLERS JH IY1 AE1 B L AH0 AA1 R EH1 S
GEOS JH IY1 IY1 AA0 S
GLUING JH IY1 L UW1 IH1 NG
GM JH IY1 EH1 M
HM  EY1 CH EH1 M
HMOS EY1 CH EH1 M OW1 S
IBM  AY1 B IY1 EH1 M
INDEPTH  IH0 N D EH1 P TH
IRS  AY1 AA1 R EH1 S
ISDN AY1 EH1 S D IY1 EH1 N
ITERATIONS IH1 T ER0 EY1 SH AH0 N Z
LAOTIANS L EY1 OW1 SH AH0 N Z
MAJORLY M EY1 JH ER0 L IY1
MECHANIC'S M AH0 K AE1 N IH0 K Z
MIATAS M IY1 AH0 T AA1 S
MONONUCLEOSIS M OW1 N OW0 N UW1 K L IY1 OW0 S IH1 S
MORTICES M AO1 R T AY1 S EH1 S
NIH EH1 N AY1 EY1 CH
NONSALARY N AA1 N S AE1 L ER0 IY0
NUKING  N UW1 K IH1 NG
PAVERS  P EY1 V ER0 EH1 S
PAY'S P EY1 EH1 S EH1 S
PC  P IY1 S IY1
PENTALTY  P EH1 N AH0 L T IY0
PERSONALTIES  P ER1 S AH0 N AE1 L T AY1 Z
PFM   P IY1 EH1 F EH1 M
PLUSSES P L AH1 S S EH1 S
PPOS   P IY1 P IY1 OW1 EH1 S
PRECARE P R IY1 K EH1 R
PREFIXES  P R IY1 F IH0 K S EH1 S
PRERETIREMENT  P R IY1 R IY0 T AY1 ER0 M AH0 N T
QUADRATRON   K W AE1 D R AH0 T R AA1 N
RABBETING  R AE1 B AH0 T IH1 NG
RCA  AA1 R S IY1 AH0
RENOVATIVE R EH1 N AH0 V EY2 T IH0 V
ROTC AA1 R OW1 T IY1 S IY1
ROUTER  R UW1 T ER0
SCARIOUSLY  S K EH1 R IY0 AH0 S L IY0
SHEBANG  SH IY0 B AE1 NG
SHEETROCK  SH IY1 T R AA1 K
SHEETROCKING  SH IY1 T R AA1 K IH1 NG
SIMULAR  S IH1 M AH0 L ER0
SMATTERINGS S M AE1 T ER0 IH1 NG S
SPECTROGRAPHY S P EH1 K T R OW1 G R AE1 F AY0
SPLACE  S P L EY1 S
SUBARUS  S UW1 B ER0 UW0 S
SWEARED  S W EH1 R EH1 D
TELECREDER  T EH1 L IY0 K R EH1 D ER0
TR  T IY1 AA1 R
TRESTING T R EH1 S T IH1 NG
TROPICALS  T R AA1 P IH0 K AH0 L EH1 S
UCF   Y UW1 S IY1 EH1 F
ULTRAWISE  AH1 L T R AH0 W AY1 Z
UNDERGRADS AH1 N D ER0 G R AE1 D Z
UNFORSEEN AH1 N F AO1 R S IY1 N
UPROARIOUS  AH1 P R AO2 R IY1 AH1 S
VCR  V IY1 S IY1 AA1 R
VOCALIZED V OW1 K AH0 L AY2 Z D
VW V IY1  D AH1 B AH0 L Y UW0
WALKMAN'S W AO1 K M AE2 N . EH1 S . EH1 S
WEASONABLE W IY1 Z AH0 N AH0 B AH0 L
XT6 EH1 K S T IY1 EH1 S
Z248 Z IY1 T UW1 F AO1 R EY1 T

On first execution, we discovered that the transcript contained invalid text for which the dictionary scripts and genTrans2.pl did not sort out. The words were "<A_ASIDE>" and "<B_ASIDE>", we believe that these are translators notes which somehow got left in.

This experiment was finished in Experiment 0075