Speech:Exps 0017

From Openitware
Jump to: navigation, search


Title: 10hr Train w/Test on Train


Description

Author: Cedric Woodbury

Date: August 29, 2012

Purpose: To test a train and decode using 10 hrs worth of dialog

Details: This experiment was originally supposed to use 90 hours of dialog. However there appears to be a problem with the master transcript as it only has about 10 hours worth. I used soxi to calculate how much audio data we actually have:

soxi -Td /mnt/main/corpus/dist/Switchboard/flat/*.sph

This indicated that we have over 258 hrs of audio dialog so we must be missing a large part of the master transcript.

So instead of the 90 hrs I will use the whole transcript that we have and attempt to run a train and decode on that.

The corpus for this is under /mnt/main/corpus/switchboard/10hr/train

Results

I built the experiment as normal and attempted to start the train by executing the RunAll.pl script. As I expected it stopped because there were many missing words in the dictionary. I added the following words:

20  T W EH1 N T IY0
401K  F AO1 R OW1 W AH1 N
7  S EH1 V AH0 N
ABC'S  EY1 B IY1 S IY1 Z
ALBRIDGE  AO1 B R IH1 JH
ALRIGHTY  AO2 L R AY1 T IY1
AMARETTA  AH0 M AA1 R EH1 T AH0
ANTISUPPORTERS  AE1 N T IY0 S AH0 P AO1 R T ER0 Z
APA  EY1 P IY1 EY1
APPLIQUES  AH0 P L IY1 K S
ARF  AA1 R F
ASSUMABLY  AH0 S UW1 M EY1 B L IY0
ASSUMINGLY  AH0 S UW1 M IH0 NG L IY0
ATCHAFALAYA'S  
BACKPACKING  B AE1 K P AE1 K IH0 NG
BACKYARD'S  B AE1 K Y AA2 R D Z
BAJAING  B AA1 HH AA2 IH0 NG
BALLADY  B AE1 L AH0 D IY0
<B_ASIDE>  SIL
BELTLINE  B EH1 L T S L AY1 N
BETTE  B EH1 T IY0
BISQUICK  B IH1 S K W IH1 K
BMW'S  B IY1 EH1 M D AH1 B AH0 L Y UW0 Z
BU  B IY1 Y UW1
CADDO  K AE1 D OW1
CAMGONIAS  K AE1 M G OW1 N Y AH0 Z
CANSEGO  K AE1 N S EY1 G OW1
CARPOOLS  K AA1 R P UW2 L Z
CARRADINE'S  K AA1 R AH0 D IY1 N Z
CBS  S IY1 B IY1 EH1 S
CEOS  S IY1 IY1 OW1 Z
CHEVELLE  SH AH0 V EH1 L
CHLORINATION  K L AO1 R IH1 N EY1 SH AHO N
CHLOROFAR  K L AO1 R AH0 F AA1 R
CHOWPHERD  CH AW1 P ER0 D
CIVINAL  S IH1 V AH0 N AH0 L
CMU  S IY1 EH1 M Y UW1
COGNIZITIVE  K AA1 G N AH0 Z IH0 T IH0 V
COMPRESSOR'S  K AH0 M P R EH1 S ER0 Z
COOPS  K OW0 AA1 P Z
CORONARIES  K AO1 R AH0 N ER2 R IY0 Z
COZUMEL  K AA1 Z UW1 M EH1 L
CRAWLERS  K R AO1 L ER0 Z
DADERS  D AE1 D ER0 Z
DADGUM  D AE1 D G AH1 M
DC  D IY1 S IY1
DEFINETELY  D EH1 F AH0 N AH0 T L IY0
DESCENDANCY  D IH0 S EH1 N D AH0 N S IY1
DETHATCH  D IY1 TH AE1 CH
DINGER'S  D IH1 NG ER0 Z
DIRKSON  D ER1 K S AH1 N
DJ  D IY1 JH EY1
DUCTWORK  D AH1 K T W ER1 K
<E_ASIDE>  SIL
EDS  IY1 D IY1 EH1 S
ENCHILADAS  EH0 N CH IH0 L AA1 D AH0 Z
EVIL'S  IY1 V AH0 L Z
EXPERIENCEWISE  IH0 K S P IH1 R IY0 AH0 N S W AY1 Z
EXTANSION  EH1 K S T AH0 N ZH AH0 N
FEDERALES - F EH1 D ER0 AE1 L IY0 Z
FLATLINERS  F L AE1 T L AY1 N ER0 Z
GLENROSE  G L EH1 N R OW1 Z
GM  JH IY1 EH1 M
GOMPHRENA  G OW1 M F R IY1 N AA1
GSI  JH IY1 EH1 S AY1
GTE  JH IY1 T IY1 IY1
GTO  JH IY1 T IY1 OW1
HARRISVILLE  HH ER1 R IH0 S V IH1 L
HISKEN'S  HH IH1 S K EH1 N Z
HMO'S  EY1 CH EH1 M OW1 Z
HMO  EY1 CH EH1 M OW1
HM  EY1 CH EH1 M
HOPELY  HH OW1 P L IY0
IBM'S  AY1 B IY1 EH1 M Z
IBM  AY1 B IY1 EH1 M
IE  AY1 IY1
IHOP  AY1 HH AA1 P
INCOMEWISE  IH1 N K AH2 M W AY1 Z
INSTINCTUAL  IH1 N S T IH0 NG K CH UW0 AH0 L
INSTRUCTOR'S  IH0 N S T R AH1 K T ER0 Z
JALAPENOS  HH AE2 L AH0 P IY1 N Y OW0 Z
JC  JH EY1 S IY1
KALACHANDJI'S  K AH0 L AA1 CH AE1 N D JH IY0
KIBITZING  K IH1 B IH1 T S IH1 NG
KID'LL  K IH1 D AH0 L
KVIL  K EY1 V IY1 IY1 EH1 L
LAURAL  L AO1 R AH0 L
LAVON  L AE1 V AA0 N
LIBBER  L IH1 B ER0
LOCKHAVEN  L AA1 K HH EY1 V AH0 N
LX  EH1 L EH1 K S
MARINADE  M ER1 R AH0 N EY1 D
MARYLANDER'S  M EH1 R IY0 L AE2 N D ER0 Z
MARYLANDER  M EH1 R IY0 L AE2 N D ER0
MAYPORT  M EY1 P AO1 R T
MCCALLS  M IH1 K AO1 L Z
MEATHEAD  M IY1 T HH EH1 D
MESQUITE'S  M EH1 S K IY2 T Z
MISCLASSIFIED  M IH0 S K L AE1 S AH0 F AY2 D
MOOSEWOOD  M UW1 S W UH2 D
MOTTA  M OW1 T AA1
MTV  EH1 M T IY1 V IY1
MYNEER  M AY1 N IH1 R
NBA  EH1 N B IY1 AH0
NFL  EH1 N EH1 F EH1 L
NONCOLORED  N AA1 N K AH1 L ER0 D
NONEXEMPT  N AA1 N IH0 G Z EH1 M P T
NONFILM  N AA1 N F IH1 L M
NONSMOKER'S  N AA1 N S M OW1 K ER0 Z
NRA  EH1 N AA1 R AH0
NUTRISYSTEM  N UW TR IY0 S IH1 S T AH0 M
OFFENCING  AH0 F EH1 N S IH0 NG
PACKARDS  P AE1 K ER0 D Z
PAPPASITO'S  P AA1 P AH0 S IY1 T OW0 Z
PBS  P IY1 B IY1 EH1 S
PCS  P IY1 S IY1 EH1 S
PC  P IY1 S IY1
PENSACOPLA P EH2 N S AH0 K OW1 P L AH0
PE  P IY1 IY1
PHD  P IY1 EY1 CH D IY1
PLANOITE  P L AE1 N OY2 T
PREMED  P R IY1 M EH1 D
PROCTORED  P R AA1 K T ER0 D
PSYCHOS  S AY1 K OW0
PTO  P IY1 T IY1 OW1
PURDIS  P ER1 D IH1 Z
RALPHIE'S  R AE1 L F IY1 Z
RECARPET  R IY0 K AA1 R P AH0 T
REINJURING  R IY2 IH1 N JH ER0 IH0 NG
RETRIEVER'S  R IY0 T R IY0 V ER0 Z
SCHNAUZERS  SH N AW1 Z ER0 Z
SEDER'S  S EY1 D ER0 Z
SEVREN  S EH1 V R EH1 N
SHORTIES  SH AO1 R T IY0 Z
SKIDMORES  S K IH1 D M AO1 R Z
SMASHERS  S M AE1 SH ER0 Z
SMOCKS  S M AA1 K Z
SMOLDERS  S M OW1 L D ER0 Z
SMU  EH1 S EH1 M Y UW1
SNACKING  S N AE1 K IH0 NG
SOLBURNS  S OW1 L B ER1 N Z
SOUTHBEND  S AW1 TH B EH1 N D
SPORTSWISE  S P AO1 R T S W AY1 Z
SPRINTER'S  S P R IH1 N T ER0 Z
SQUASHES  S K W AA1 SH Z
SQUISH  S K W IH1 SH
STAIRMASTER  S T EH1 R M AE1 S T ER0
STICKSHIFT  S T IH1 K SH IH1 F T
STORLY  S T AO1 R L IY0
SUGARBAKERS  SH UH1 G ER0 B EY1 K ER0 Z
SUPERMOM  S UW1 P ER0 M AA1 M
SWITCHEROO  S W IH1 CH T ER0 UW1
SYNSI  S IH1 N S IY1
TAXWISE  T AE1 K S W AY1 Z
TCJC  T IY1 S IY1 JH EY1 S IY1
TEXINS  T EH1 K S IH1 N Z
THEIRSELVES DH EH2 R S EH1 L V Z
THEM'S  DH EH1 M Z
IT'S  IH1 T Z
TOLLWAY  T OW1 L W EY2
UNCOOKED  AH1 N K UH1 K T
UNDERGRADS  AH1 N D ER0 G R AE1 D Z
UNDERGRADUATE'S  AH2 N D ER0 G R AE1 JH AH0 W AH0 T S
UNICOLOR  Y UW2 N AH0 K AH1 L ER0
UNIVERSALS  Y UW2 N AH0 V ER1 S AH0 L Z
USSR'S  Y UW1 EH1 S EH1 S AA1 R Z
UT  Y UW1 T IY1
VCR  V IY1 S IY1 AA1 R
VELA'S  V EH1 L AH0 S
VH1  V IY1 EY1 CH W AH1 N
VIP  V IY1 AY1 P IY1
VOCALIZED  V OW1 K AH0 L AY2 Z D
WELFOR  W EH1 L F AO1 R
WET'N  W EH1 T N
WF  D AH1 B AH0 L Y UW0 EH1 F
WJ  D AH1 B AH0 L Y UW0 JH EY1
WOODHOLLOW  W UH1 D HH AA1 L OW0
WR  D AH1 B AH0 L Y UW0 AA1 R

After adding those words I attempted to run the train again. Again it failed because there were still more words missing. I added the following words in addition to the ones previously noted above:

20  T W EH1 N T IY0
30  TH ER1 D IY0
401K  F AO1 R OW1 W AH1 N K EY1
6S  S IH1 K S EH1 S
7  S EH1 V AH0 N
8S  EY1 T EH1 S
ANNUITITY  AH0 N UW1 IH0 T IH0 T IY0
ATCHAFALAYA'S  AE1 CH AO2 F AO1 L AY2 Y AA1
BETTLE  EH1 T L IY1
KLIF  K EY1 L AY1 EH1 F
LC  EH1 L S IY1
MOTHERBOARDS  M AH1 DH ER0 B AO1 R D Z
POLITICALNESS  P AH0 L IH1 T AH0 K AH0 L N AH0 S
RABBETING  R AE1 B AH0 T IH0 NG
REBILIHATATE  R IY0 B IH1 L HH AE1 T EY1 T
REBILITATE  R IY0 B IH1 L AE1 T EY1 T
REBIL R IY0 B IH1 L
SCARIOUSLY  S K EH1 R IY0 AH0 S L IY0
SWIND  S W IH1 N D
TESTSES  T EH1 S T S EH1 S
TI'S  T IY1 AY1 Z
TRESTS  T R EH1 S T S
VCR'S  V IY1 S IY1 AA1 R Z
VH1  V IY1 EY1 CH W AH1 N
VINDICTIVELY V IH0 N D IH1 K T IH0 V L IY1

I tried again. This time I received a new error:

FATAL_ERROR: "mk_mdef_gen.c", line 127: Bad entry triphone file /mnt/main/Exp/0017/model_architecture/0017.phonelist
Something failed: (/mnt/main/Exp/0017/scripts_pl/20.ci_hmm/slave_convg.pl)

I went and looked at the 0017.phonelist file referenced in that error. It became apparent that there were typos with the phones I used to add the custom words. For example instead of AH0, I had put AHO. There were several phone errors and it took several tries. I thought that the <B_ASIDE> and <E_ASIDE> entries were causing problems so I removed them from my transcript. I was still getting the same error. It turns out I had a - after one of the words and it was using that as a phone. I removed it and at last the train began.

Running the decode proceeded without incident.

I attempted to score the decode and sclite had errors. The following utterances are duplicates in the transcript that need to be removed:

    (sw2005a-ms98-a-0052)
    (sw2020b-ms98-a-0018)
    (sw2022a-ms98-a-0005)
    (sw2028a-ms98-a-0049)
    (sw2234a-ms98-a-0007)
    (sw2245a-ms98-a-0166)
    (sw2259a-ms98-a-0021)
    (sw2295b-ms98-a-0011)
    (sw2331a-ms98-a-0049)
    (sw2389b-ms98-a-0096)
    (sw2428a-ms98-a-0017)
    (sw2442b-ms98-a-0059)
    (sw2451b-ms98-a-0044)
    (sw2466b-ms98-a-0071)

I cleaned these up by using:

cat 0017_train.trans | uniq >> 0017_train.trans.uniq

That produced a new file that only has unique entries. I did the same thing for the hyp.trans file and created a new file called hyp.trans.uniq. Running it again produced another error:

Error: double reference text for id '(sw2245a-ms98-a-0166)'

There were two slightly different statements in the training transcript that used the same utterance ID so I removed one of the statements. This time sclite ran successfully.

Summary

sclite produced the following results:

                     SYSTEM SUMMARY PERCENTAGES by SPEAKER

      ,-----------------------------------------------------------------.
      |                         hyp.trans.uniq                          |
      |-----------------------------------------------------------------|
      | SPKR    | # Snt # Wrd | Corr    Sub    Del    Ins    Err  S.Err |
      |=================================================================|
      | Sum/Avg | 1110  23028 | 63.2   28.4    8.4   10.3   47.0   97.7 |
      |=================================================================|
      |  Mean   |  2.8   58.7 | 62.5   30.2    7.3   16.4   53.9   98.3 |
      |  S.D.   |  1.8   45.0 | 15.7   14.3    6.1   24.0   27.8    8.4 |
      | Median  |  2.0   47.0 | 64.7   27.9    6.8    9.7   48.7  100.0 |
      `-----------------------------------------------------------------'

Additional Notes

I created a new folder called custom under /mnt/main/corpus/dist From there I put an original copy of cmudict.0.6d named cmudict.0.6d.original. I made a copy called cmudict.0.6d.custom and added all the words I had to add to the dictionary to this file. The thought being that other students can use this custom dictionary so they do not have to keep re-adding these missing words. Any additional words that they find can be added to this dictionary until we eventually have captured all the missing words. All of the words that I added are listed under the file added_words. I also created a file called letters.phone that has the phones for single letters which is useful when pronouncing acronyms.