Speech:Exps 0017


 * Home
 * Experiments

Description
Author: Cedric Woodbury

Date: August 29, 2012

Purpose: To test a train and decode using 10 hrs worth of dialog

Details: This experiment was originally supposed to use 90 hours of dialog. However there appears to be a problem with the master transcript as it only has about 10 hours worth. I used soxi to calculate how much audio data we actually have: soxi -Td /mnt/main/corpus/dist/Switchboard/flat/*.sph This indicated that we have over 258 hrs of audio dialog so we must be missing a large part of the master transcript.

So instead of the 90 hrs I will use the whole transcript that we have and attempt to run a train and decode on that.

The corpus for this is under /mnt/main/corpus/switchboard/10hr/train

Results
I built the experiment as normal and attempted to start the train by executing the RunAll.pl script. As I expected it stopped because there were many missing words in the dictionary. I added the following words: 20 T W EH1 N T IY0 401K F AO1 R OW1 W AH1 N 7  S EH1 V AH0 N ABC'S  EY1 B IY1 S IY1 Z ALBRIDGE  AO1 B R IH1 JH ALRIGHTY  AO2 L R AY1 T IY1 AMARETTA AH0 M AA1 R EH1 T AH0 ANTISUPPORTERS AE1 N T IY0 S AH0 P AO1 R T ER0 Z APA  EY1 P IY1 EY1 APPLIQUES AH0 P L IY1 K S ARF  AA1 R F ASSUMABLY  AH0 S UW1 M EY1 B L IY0 ASSUMINGLY AH0 S UW1 M IH0 NG L IY0 ATCHAFALAYA'S BACKPACKING  B AE1 K P AE1 K IH0 NG BACKYARD'S  B AE1 K Y AA2 R D Z BAJAING  B AA1 HH AA2 IH0 NG BALLADY  B AE1 L AH0 D IY0  SIL BELTLINE B EH1 L T S L AY1 N BETTE  B EH1 T IY0 BISQUICK B IH1 S K W IH1 K BMW'S  B IY1 EH1 M D AH1 B AH0 L Y UW0 Z BU  B IY1 Y UW1 CADDO K AE1 D OW1 CAMGONIAS K AE1 M G OW1 N Y AH0 Z CANSEGO  K AE1 N S EY1 G OW1 CARPOOLS K AA1 R P UW2 L Z CARRADINE'S  K AA1 R AH0 D IY1 N Z CBS  S IY1 B IY1 EH1 S CEOS  S IY1 IY1 OW1 Z CHEVELLE  SH AH0 V EH1 L CHLORINATION  K L AO1 R IH1 N EY1 SH AHO N CHLOROFAR  K L AO1 R AH0 F AA1 R CHOWPHERD  CH AW1 P ER0 D CIVINAL  S IH1 V AH0 N AH0 L CMU  S IY1 EH1 M Y UW1 COGNIZITIVE K AA1 G N AH0 Z IH0 T IH0 V COMPRESSOR'S  K AH0 M P R EH1 S ER0 Z COOPS  K OW0 AA1 P Z CORONARIES  K AO1 R AH0 N ER2 R IY0 Z COZUMEL  K AA1 Z UW1 M EH1 L CRAWLERS  K R AO1 L ER0 Z DADERS  D AE1 D ER0 Z DADGUM  D AE1 D G AH1 M DC  D IY1 S IY1 DEFINETELY D EH1 F AH0 N AH0 T L IY0 DESCENDANCY D IH0 S EH1 N D AH0 N S IY1 DETHATCH D IY1 TH AE1 CH DINGER'S  D IH1 NG ER0 Z DIRKSON  D ER1 K S AH1 N DJ  D IY1 JH EY1 DUCTWORK D AH1 K T W ER1 K   SIL EDS IY1 D IY1 EH1 S ENCHILADAS  EH0 N CH IH0 L AA1 D AH0 Z EVIL'S  IY1 V AH0 L Z EXPERIENCEWISE  IH0 K S P IH1 R IY0 AH0 N S W AY1 Z EXTANSION  EH1 K S T AH0 N ZH AH0 N FEDERALES - F EH1 D ER0 AE1 L IY0 Z FLATLINERS  F L AE1 T L AY1 N ER0 Z GLENROSE  G L EH1 N R OW1 Z GM  JH IY1 EH1 M GOMPHRENA  G OW1 M F R IY1 N AA1 GSI JH IY1 EH1 S AY1 GTE JH IY1 T IY1 IY1 GTO JH IY1 T IY1 OW1 HARRISVILLE HH ER1 R IH0 S V IH1 L HISKEN'S  HH IH1 S K EH1 N Z HMO'S  EY1 CH EH1 M OW1 Z HMO  EY1 CH EH1 M OW1 HM EY1 CH EH1 M HOPELY  HH OW1 P L IY0 IBM'S AY1 B IY1 EH1 M Z IBM  AY1 B IY1 EH1 M IE  AY1 IY1 IHOP AY1 HH AA1 P INCOMEWISE  IH1 N K AH2 M W AY1 Z INSTINCTUAL  IH1 N S T IH0 NG K CH UW0 AH0 L INSTRUCTOR'S  IH0 N S T R AH1 K T ER0 Z JALAPENOS  HH AE2 L AH0 P IY1 N Y OW0 Z JC  JH EY1 S IY1 KALACHANDJI'S K AH0 L AA1 CH AE1 N D JH IY0 KIBITZING K IH1 B IH1 T S IH1 NG KID'LL  K IH1 D AH0 L KVIL  K EY1 V IY1 IY1 EH1 L LAURAL  L AO1 R AH0 L LAVON  L AE1 V AA0 N LIBBER  L IH1 B ER0 LOCKHAVEN L AA1 K HH EY1 V AH0 N LX  EH1 L EH1 K S MARINADE  M ER1 R AH0 N EY1 D MARYLANDER'S  M EH1 R IY0 L AE2 N D ER0 Z MARYLANDER  M EH1 R IY0 L AE2 N D ER0 MAYPORT M EY1 P AO1 R T MCCALLS  M IH1 K AO1 L Z MEATHEAD  M IY1 T HH EH1 D MESQUITE'S  M EH1 S K IY2 T Z MISCLASSIFIED  M IH0 S K L AE1 S AH0 F AY2 D MOOSEWOOD  M UW1 S W UH2 D MOTTA  M OW1 T AA1 MTV EH1 M T IY1 V IY1 MYNEER M AY1 N IH1 R NBA  EH1 N B IY1 AH0 NFL EH1 N EH1 F EH1 L NONCOLORED  N AA1 N K AH1 L ER0 D NONEXEMPT  N AA1 N IH0 G Z EH1 M P T NONFILM  N AA1 N F IH1 L M NONSMOKER'S  N AA1 N S M OW1 K ER0 Z NRA  EH1 N AA1 R AH0 NUTRISYSTEM N UW TR IY0 S IH1 S T AH0 M OFFENCING  AH0 F EH1 N S IH0 NG PACKARDS  P AE1 K ER0 D Z PAPPASITO'S  P AA1 P AH0 S IY1 T OW0 Z PBS  P IY1 B IY1 EH1 S PCS  P IY1 S IY1 EH1 S PC  P IY1 S IY1 PENSACOPLA P EH2 N S AH0 K OW1 P L AH0 PE P IY1 IY1 PHD P IY1 EY1 CH D IY1 PLANOITE P L AE1 N OY2 T PREMED  P R IY1 M EH1 D PROCTORED  P R AA1 K T ER0 D PSYCHOS  S AY1 K OW0 PTO P IY1 T IY1 OW1 PURDIS P ER1 D IH1 Z RALPHIE'S  R AE1 L F IY1 Z RECARPET  R IY0 K AA1 R P AH0 T REINJURING  R IY2 IH1 N JH ER0 IH0 NG RETRIEVER'S  R IY0 T R IY0 V ER0 Z SCHNAUZERS  SH N AW1 Z ER0 Z SEDER'S  S EY1 D ER0 Z SEVREN  S EH1 V R EH1 N SHORTIES  SH AO1 R T IY0 Z SKIDMORES  S K IH1 D M AO1 R Z SMASHERS  S M AE1 SH ER0 Z SMOCKS  S M AA1 K Z SMOLDERS  S M OW1 L D ER0 Z SMU  EH1 S EH1 M Y UW1 SNACKING S N AE1 K IH0 NG SOLBURNS  S OW1 L B ER1 N Z SOUTHBEND  S AW1 TH B EH1 N D SPORTSWISE  S P AO1 R T S W AY1 Z SPRINTER'S  S P R IH1 N T ER0 Z SQUASHES  S K W AA1 SH Z SQUISH  S K W IH1 SH STAIRMASTER  S T EH1 R M AE1 S T ER0 STICKSHIFT S T IH1 K SH IH1 F T STORLY  S T AO1 R L IY0 SUGARBAKERS SH UH1 G ER0 B EY1 K ER0 Z SUPERMOM  S UW1 P ER0 M AA1 M SWITCHEROO  S W IH1 CH T ER0 UW1 SYNSI S IH1 N S IY1 TAXWISE T AE1 K S W AY1 Z TCJC  T IY1 S IY1 JH EY1 S IY1 TEXINS T EH1 K S IH1 N Z THEIRSELVES DH EH2 R S EH1 L V Z THEM'S  DH EH1 M Z IT'S  IH1 T Z TOLLWAY  T OW1 L W EY2 UNCOOKED AH1 N K UH1 K T UNDERGRADS  AH1 N D ER0 G R AE1 D Z UNDERGRADUATE'S  AH2 N D ER0 G R AE1 JH AH0 W AH0 T S UNICOLOR  Y UW2 N AH0 K AH1 L ER0 UNIVERSALS Y UW2 N AH0 V ER1 S AH0 L Z USSR'S  Y UW1 EH1 S EH1 S AA1 R Z UT  Y UW1 T IY1 VCR V IY1 S IY1 AA1 R VELA'S  V EH1 L AH0 S VH1  V IY1 EY1 CH W AH1 N VIP  V IY1 AY1 P IY1 VOCALIZED V OW1 K AH0 L AY2 Z D WELFOR  W EH1 L F AO1 R WET'N  W EH1 T N WF  D AH1 B AH0 L Y UW0 EH1 F WJ  D AH1 B AH0 L Y UW0 JH EY1 WOODHOLLOW W UH1 D HH AA1 L OW0 WR D AH1 B AH0 L Y UW0 AA1 R

After adding those words I attempted to run the train again. Again it failed because there were still more words missing. I added the following words in addition to the ones previously noted above: 20 T W EH1 N T IY0 30 TH ER1 D IY0 401K F AO1 R OW1 W AH1 N K EY1 6S S IH1 K S EH1 S 7  S EH1 V AH0 N 8S  EY1 T EH1 S ANNUITITY  AH0 N UW1 IH0 T IH0 T IY0 ATCHAFALAYA'S AE1 CH AO2 F AO1 L AY2 Y AA1 BETTLE EH1 T L IY1 KLIF K EY1 L AY1 EH1 F LC  EH1 L S IY1 MOTHERBOARDS M AH1 DH ER0 B AO1 R D Z POLITICALNESS  P AH0 L IH1 T AH0 K AH0 L N AH0 S RABBETING  R AE1 B AH0 T IH0 NG REBILIHATATE  R IY0 B IH1 L HH AE1 T EY1 T REBILITATE  R IY0 B IH1 L AE1 T EY1 T REBIL R IY0 B IH1 L SCARIOUSLY  S K EH1 R IY0 AH0 S L IY0 SWIND S W IH1 N D TESTSES  T EH1 S T S EH1 S TI'S  T IY1 AY1 Z TRESTS  T R EH1 S T S VCR'S  V IY1 S IY1 AA1 R Z VH1  V IY1 EY1 CH W AH1 N VINDICTIVELY V IH0 N D IH1 K T IH0 V L IY1

I tried again. This time I received a new error: FATAL_ERROR: "mk_mdef_gen.c", line 127: Bad entry triphone file /mnt/main/Exp/0017/model_architecture/0017.phonelist Something failed: (/mnt/main/Exp/0017/scripts_pl/20.ci_hmm/slave_convg.pl)

I went and looked at the 0017.phonelist file referenced in that error. It became apparent that there were typos with the phones I used to add the custom words. For example instead of AH0, I had put AHO. There were several phone errors and it took several tries. I thought that the  and  entries were causing problems so I removed them from my transcript. I was still getting the same error. It turns out I had a - after one of the words and it was using that as a phone. I removed it and at last the train began.

Running the decode proceeded without incident.

I attempted to score the decode and sclite had errors. The following utterances are duplicates in the transcript that need to be removed: (sw2005a-ms98-a-0052) (sw2020b-ms98-a-0018) (sw2022a-ms98-a-0005) (sw2028a-ms98-a-0049) (sw2234a-ms98-a-0007) (sw2245a-ms98-a-0166) (sw2259a-ms98-a-0021) (sw2295b-ms98-a-0011) (sw2331a-ms98-a-0049) (sw2389b-ms98-a-0096) (sw2428a-ms98-a-0017) (sw2442b-ms98-a-0059) (sw2451b-ms98-a-0044) (sw2466b-ms98-a-0071) I cleaned these up by using: cat 0017_train.trans | uniq >> 0017_train.trans.uniq That produced a new file that only has unique entries. I did the same thing for the hyp.trans file and created a new file called hyp.trans.uniq. Running it again produced another error: Error: double reference text for id '(sw2245a-ms98-a-0166)' There were two slightly different statements in the training transcript that used the same utterance ID so I removed one of the statements. This time sclite ran successfully.

Summary
sclite produced the following results: SYSTEM SUMMARY PERCENTAGES by SPEAKER

,-.     |                         hyp.trans.uniq                          | |-|     | SPKR    | # Snt # Wrd | Corr    Sub    Del    Ins    Err  S.Err | |=================================================================|     | Sum/Avg | 1110  23028 | 63.2   28.4    8.4   10.3   47.0   97.7 | |=================================================================|     |  Mean   |  2.8   58.7 | 62.5   30.2    7.3   16.4   53.9   98.3 | | S.D.   |  1.8   45.0 | 15.7   14.3    6.1   24.0   27.8    8.4 | | Median |  2.0   47.0 | 64.7   27.9    6.8    9.7   48.7  100.0 | `-'

Additional Notes
I created a new folder called custom under /mnt/main/corpus/dist From there I put an original copy of cmudict.0.6d named cmudict.0.6d.original. I made a copy called cmudict.0.6d.custom and added all the words I had to add to the dictionary to this file. The thought being that other students can use this custom dictionary so they do not have to keep re-adding these missing words. Any additional words that they find can be added to this dictionary until we eventually have captured all the missing words. All of the words that I added are listed under the file added_words. I also created a file called letters.phone that has the phones for single letters which is useful when pronouncing acronyms.