Title: 10hr Train w/Test on Train
Author: Cedric Woodbury
Date: August 29, 2012
Purpose: To test a train and decode using 10 hrs worth of dialog
Details: This experiment was originally supposed to use 90 hours of dialog. However there appears to be a problem with the master transcript as it only has about 10 hours worth. I used soxi to calculate how much audio data we actually have:
soxi -Td /mnt/main/corpus/dist/Switchboard/flat/*.sph
This indicated that we have over 258 hrs of audio dialog so we must be missing a large part of the master transcript.
So instead of the 90 hrs I will use the whole transcript that we have and attempt to run a train and decode on that.
The corpus for this is under /mnt/main/corpus/switchboard/10hr/train
I built the experiment as normal and attempted to start the train by executing the RunAll.pl script. As I expected it stopped because there were many missing words in the dictionary. I added the following words:
20 T W EH1 N T IY0 401K F AO1 R OW1 W AH1 N 7 S EH1 V AH0 N ABC'S EY1 B IY1 S IY1 Z ALBRIDGE AO1 B R IH1 JH ALRIGHTY AO2 L R AY1 T IY1 AMARETTA AH0 M AA1 R EH1 T AH0 ANTISUPPORTERS AE1 N T IY0 S AH0 P AO1 R T ER0 Z APA EY1 P IY1 EY1 APPLIQUES AH0 P L IY1 K S ARF AA1 R F ASSUMABLY AH0 S UW1 M EY1 B L IY0 ASSUMINGLY AH0 S UW1 M IH0 NG L IY0 ATCHAFALAYA'S BACKPACKING B AE1 K P AE1 K IH0 NG BACKYARD'S B AE1 K Y AA2 R D Z BAJAING B AA1 HH AA2 IH0 NG BALLADY B AE1 L AH0 D IY0 <B_ASIDE> SIL BELTLINE B EH1 L T S L AY1 N BETTE B EH1 T IY0 BISQUICK B IH1 S K W IH1 K BMW'S B IY1 EH1 M D AH1 B AH0 L Y UW0 Z BU B IY1 Y UW1 CADDO K AE1 D OW1 CAMGONIAS K AE1 M G OW1 N Y AH0 Z CANSEGO K AE1 N S EY1 G OW1 CARPOOLS K AA1 R P UW2 L Z CARRADINE'S K AA1 R AH0 D IY1 N Z CBS S IY1 B IY1 EH1 S CEOS S IY1 IY1 OW1 Z CHEVELLE SH AH0 V EH1 L CHLORINATION K L AO1 R IH1 N EY1 SH AHO N CHLOROFAR K L AO1 R AH0 F AA1 R CHOWPHERD CH AW1 P ER0 D CIVINAL S IH1 V AH0 N AH0 L CMU S IY1 EH1 M Y UW1 COGNIZITIVE K AA1 G N AH0 Z IH0 T IH0 V COMPRESSOR'S K AH0 M P R EH1 S ER0 Z COOPS K OW0 AA1 P Z CORONARIES K AO1 R AH0 N ER2 R IY0 Z COZUMEL K AA1 Z UW1 M EH1 L CRAWLERS K R AO1 L ER0 Z DADERS D AE1 D ER0 Z DADGUM D AE1 D G AH1 M DC D IY1 S IY1 DEFINETELY D EH1 F AH0 N AH0 T L IY0 DESCENDANCY D IH0 S EH1 N D AH0 N S IY1 DETHATCH D IY1 TH AE1 CH DINGER'S D IH1 NG ER0 Z DIRKSON D ER1 K S AH1 N DJ D IY1 JH EY1 DUCTWORK D AH1 K T W ER1 K <E_ASIDE> SIL EDS IY1 D IY1 EH1 S ENCHILADAS EH0 N CH IH0 L AA1 D AH0 Z EVIL'S IY1 V AH0 L Z EXPERIENCEWISE IH0 K S P IH1 R IY0 AH0 N S W AY1 Z EXTANSION EH1 K S T AH0 N ZH AH0 N FEDERALES - F EH1 D ER0 AE1 L IY0 Z FLATLINERS F L AE1 T L AY1 N ER0 Z GLENROSE G L EH1 N R OW1 Z GM JH IY1 EH1 M GOMPHRENA G OW1 M F R IY1 N AA1 GSI JH IY1 EH1 S AY1 GTE JH IY1 T IY1 IY1 GTO JH IY1 T IY1 OW1 HARRISVILLE HH ER1 R IH0 S V IH1 L HISKEN'S HH IH1 S K EH1 N Z HMO'S EY1 CH EH1 M OW1 Z HMO EY1 CH EH1 M OW1 HM EY1 CH EH1 M HOPELY HH OW1 P L IY0 IBM'S AY1 B IY1 EH1 M Z IBM AY1 B IY1 EH1 M IE AY1 IY1 IHOP AY1 HH AA1 P INCOMEWISE IH1 N K AH2 M W AY1 Z INSTINCTUAL IH1 N S T IH0 NG K CH UW0 AH0 L INSTRUCTOR'S IH0 N S T R AH1 K T ER0 Z JALAPENOS HH AE2 L AH0 P IY1 N Y OW0 Z JC JH EY1 S IY1 KALACHANDJI'S K AH0 L AA1 CH AE1 N D JH IY0 KIBITZING K IH1 B IH1 T S IH1 NG KID'LL K IH1 D AH0 L KVIL K EY1 V IY1 IY1 EH1 L LAURAL L AO1 R AH0 L LAVON L AE1 V AA0 N LIBBER L IH1 B ER0 LOCKHAVEN L AA1 K HH EY1 V AH0 N LX EH1 L EH1 K S MARINADE M ER1 R AH0 N EY1 D MARYLANDER'S M EH1 R IY0 L AE2 N D ER0 Z MARYLANDER M EH1 R IY0 L AE2 N D ER0 MAYPORT M EY1 P AO1 R T MCCALLS M IH1 K AO1 L Z MEATHEAD M IY1 T HH EH1 D MESQUITE'S M EH1 S K IY2 T Z MISCLASSIFIED M IH0 S K L AE1 S AH0 F AY2 D MOOSEWOOD M UW1 S W UH2 D MOTTA M OW1 T AA1 MTV EH1 M T IY1 V IY1 MYNEER M AY1 N IH1 R NBA EH1 N B IY1 AH0 NFL EH1 N EH1 F EH1 L NONCOLORED N AA1 N K AH1 L ER0 D NONEXEMPT N AA1 N IH0 G Z EH1 M P T NONFILM N AA1 N F IH1 L M NONSMOKER'S N AA1 N S M OW1 K ER0 Z NRA EH1 N AA1 R AH0 NUTRISYSTEM N UW TR IY0 S IH1 S T AH0 M OFFENCING AH0 F EH1 N S IH0 NG PACKARDS P AE1 K ER0 D Z PAPPASITO'S P AA1 P AH0 S IY1 T OW0 Z PBS P IY1 B IY1 EH1 S PCS P IY1 S IY1 EH1 S PC P IY1 S IY1 PENSACOPLA P EH2 N S AH0 K OW1 P L AH0 PE P IY1 IY1 PHD P IY1 EY1 CH D IY1 PLANOITE P L AE1 N OY2 T PREMED P R IY1 M EH1 D PROCTORED P R AA1 K T ER0 D PSYCHOS S AY1 K OW0 PTO P IY1 T IY1 OW1 PURDIS P ER1 D IH1 Z RALPHIE'S R AE1 L F IY1 Z RECARPET R IY0 K AA1 R P AH0 T REINJURING R IY2 IH1 N JH ER0 IH0 NG RETRIEVER'S R IY0 T R IY0 V ER0 Z SCHNAUZERS SH N AW1 Z ER0 Z SEDER'S S EY1 D ER0 Z SEVREN S EH1 V R EH1 N SHORTIES SH AO1 R T IY0 Z SKIDMORES S K IH1 D M AO1 R Z SMASHERS S M AE1 SH ER0 Z SMOCKS S M AA1 K Z SMOLDERS S M OW1 L D ER0 Z SMU EH1 S EH1 M Y UW1 SNACKING S N AE1 K IH0 NG SOLBURNS S OW1 L B ER1 N Z SOUTHBEND S AW1 TH B EH1 N D SPORTSWISE S P AO1 R T S W AY1 Z SPRINTER'S S P R IH1 N T ER0 Z SQUASHES S K W AA1 SH Z SQUISH S K W IH1 SH STAIRMASTER S T EH1 R M AE1 S T ER0 STICKSHIFT S T IH1 K SH IH1 F T STORLY S T AO1 R L IY0 SUGARBAKERS SH UH1 G ER0 B EY1 K ER0 Z SUPERMOM S UW1 P ER0 M AA1 M SWITCHEROO S W IH1 CH T ER0 UW1 SYNSI S IH1 N S IY1 TAXWISE T AE1 K S W AY1 Z TCJC T IY1 S IY1 JH EY1 S IY1 TEXINS T EH1 K S IH1 N Z THEIRSELVES DH EH2 R S EH1 L V Z THEM'S DH EH1 M Z IT'S IH1 T Z TOLLWAY T OW1 L W EY2 UNCOOKED AH1 N K UH1 K T UNDERGRADS AH1 N D ER0 G R AE1 D Z UNDERGRADUATE'S AH2 N D ER0 G R AE1 JH AH0 W AH0 T S UNICOLOR Y UW2 N AH0 K AH1 L ER0 UNIVERSALS Y UW2 N AH0 V ER1 S AH0 L Z USSR'S Y UW1 EH1 S EH1 S AA1 R Z UT Y UW1 T IY1 VCR V IY1 S IY1 AA1 R VELA'S V EH1 L AH0 S VH1 V IY1 EY1 CH W AH1 N VIP V IY1 AY1 P IY1 VOCALIZED V OW1 K AH0 L AY2 Z D WELFOR W EH1 L F AO1 R WET'N W EH1 T N WF D AH1 B AH0 L Y UW0 EH1 F WJ D AH1 B AH0 L Y UW0 JH EY1 WOODHOLLOW W UH1 D HH AA1 L OW0 WR D AH1 B AH0 L Y UW0 AA1 R
After adding those words I attempted to run the train again. Again it failed because there were still more words missing. I added the following words in addition to the ones previously noted above:
20 T W EH1 N T IY0 30 TH ER1 D IY0 401K F AO1 R OW1 W AH1 N K EY1 6S S IH1 K S EH1 S 7 S EH1 V AH0 N 8S EY1 T EH1 S ANNUITITY AH0 N UW1 IH0 T IH0 T IY0 ATCHAFALAYA'S AE1 CH AO2 F AO1 L AY2 Y AA1 BETTLE EH1 T L IY1 KLIF K EY1 L AY1 EH1 F LC EH1 L S IY1 MOTHERBOARDS M AH1 DH ER0 B AO1 R D Z POLITICALNESS P AH0 L IH1 T AH0 K AH0 L N AH0 S RABBETING R AE1 B AH0 T IH0 NG REBILIHATATE R IY0 B IH1 L HH AE1 T EY1 T REBILITATE R IY0 B IH1 L AE1 T EY1 T REBIL R IY0 B IH1 L SCARIOUSLY S K EH1 R IY0 AH0 S L IY0 SWIND S W IH1 N D TESTSES T EH1 S T S EH1 S TI'S T IY1 AY1 Z TRESTS T R EH1 S T S VCR'S V IY1 S IY1 AA1 R Z VH1 V IY1 EY1 CH W AH1 N VINDICTIVELY V IH0 N D IH1 K T IH0 V L IY1
I tried again. This time I received a new error:
FATAL_ERROR: "mk_mdef_gen.c", line 127: Bad entry triphone file /mnt/main/Exp/0017/model_architecture/0017.phonelist Something failed: (/mnt/main/Exp/0017/scripts_pl/20.ci_hmm/slave_convg.pl)
I went and looked at the 0017.phonelist file referenced in that error. It became apparent that there were typos with the phones I used to add the custom words. For example instead of AH0, I had put AHO. There were several phone errors and it took several tries. I thought that the <B_ASIDE> and <E_ASIDE> entries were causing problems so I removed them from my transcript. I was still getting the same error. It turns out I had a - after one of the words and it was using that as a phone. I removed it and at last the train began.
Running the decode proceeded without incident.
I attempted to score the decode and sclite had errors. The following utterances are duplicates in the transcript that need to be removed:
(sw2005a-ms98-a-0052) (sw2020b-ms98-a-0018) (sw2022a-ms98-a-0005) (sw2028a-ms98-a-0049) (sw2234a-ms98-a-0007) (sw2245a-ms98-a-0166) (sw2259a-ms98-a-0021) (sw2295b-ms98-a-0011) (sw2331a-ms98-a-0049) (sw2389b-ms98-a-0096) (sw2428a-ms98-a-0017) (sw2442b-ms98-a-0059) (sw2451b-ms98-a-0044) (sw2466b-ms98-a-0071)
I cleaned these up by using:
cat 0017_train.trans | uniq >> 0017_train.trans.uniq
That produced a new file that only has unique entries. I did the same thing for the hyp.trans file and created a new file called hyp.trans.uniq. Running it again produced another error:
Error: double reference text for id '(sw2245a-ms98-a-0166)'
There were two slightly different statements in the training transcript that used the same utterance ID so I removed one of the statements. This time sclite ran successfully.
sclite produced the following results:
SYSTEM SUMMARY PERCENTAGES by SPEAKER ,-----------------------------------------------------------------. | hyp.trans.uniq | |-----------------------------------------------------------------| | SPKR | # Snt # Wrd | Corr Sub Del Ins Err S.Err | |=================================================================| | Sum/Avg | 1110 23028 | 63.2 28.4 8.4 10.3 47.0 97.7 | |=================================================================| | Mean | 2.8 58.7 | 62.5 30.2 7.3 16.4 53.9 98.3 | | S.D. | 1.8 45.0 | 15.7 14.3 6.1 24.0 27.8 8.4 | | Median | 2.0 47.0 | 64.7 27.9 6.8 9.7 48.7 100.0 | `-----------------------------------------------------------------'
I created a new folder called custom under /mnt/main/corpus/dist From there I put an original copy of cmudict.0.6d named cmudict.0.6d.original. I made a copy called cmudict.0.6d.custom and added all the words I had to add to the dictionary to this file. The thought being that other students can use this custom dictionary so they do not have to keep re-adding these missing words. Any additional words that they find can be added to this dictionary until we eventually have captured all the missing words. All of the words that I added are listed under the file added_words. I also created a file called letters.phone that has the phones for single letters which is useful when pronouncing acronyms.