Speech:Exps 0012

From Openitware
Jump to: navigation, search


Title: Mini Train w/Test on Train #1


Description

Author: Cedric Woodbury

Date: August 10, 2012

Purpose: A training w/ decoding using an hour's worth of dialog (mini train)

Details: Using the steps followed from previous experiments, I was able to set up the experiment in the normal way. However, this time I needed to create and use the transcript and wav files located under /mnt/main/corpus/switchboard/mini/train. I created a new script called createTranscript.pl located under /mnt/main/scripts/user which allowed me to create a transcript based on a specified length of spoken dialog. Once that was created and placed in the trans dir, I made links to the sph files in the wav dir.

After that was completed I created two scripts, pruneDictionary.pl and dictionary.pl that created a custom dictionary containing only the words used in that transcript. From there I attempted to run the train and decode and scoring using sclite.

Results

  • First I needed to create an hours worth of dialog:
    • I created a script that will pull a specified amount of dialog from the master transcript file called createTranscript.pl. It is a modified version of the genTrans.pl transcript. The master transcript is located under /mnt/main/corpus/dist/Switchboard/transcripts/ICSI_Transcriptions/trans/icsi/ms98_icsi_word.text
  • I needed to create a dictionary that only contained the words used in the dialog which means that I had to create a pruned version from the master dictionary.
    • I created two scripts dictionary.pl and pruneDictionary.pl Both scripts are located under /mnt/main/scripts/train/scripts_pl/
      • dictionary.pl is essentially the same as createdict.pl which is this script. There were some weird escape characters in the original that I couldn't get rid of so I made the dictionary.pl file.
      • pruneDictionary.pl runs text2wfreq which generates a list of unique words. It then crops out the statement ids and the numbers which results in a list of words that are unique (not repeated). The reason for this is that the dictionary.pl script goes through the entire dictionary for each word. Since the transcript contains thousands of words, this takes too long. By reducing the list to just the words we care about, the time it takes is more manageable.
      • dictionary.pl will go through the list of words created by pruneDictionary.pl and generate a dictionary file with the words and their phones.
  • After the dictionary had been completed, I attempted to run the train.
    • I received errors of certain words not being in the dictionary.
    • I attempted to build the phone list of each missing word, by looking up word pieces in the master dictionary and building them. Here is what I came up with from the first run:
IBM  AY B IY1 EH1 M
FEDERALES  F EH1 D ER AH0 L IY1 S
DUCTWORK  D AH1 K T W ER1 K
COGNIZITIVE  K AA1 G N AH0 Z IH0 T IH0 V
CHOWPERD CH AW1 P ER0 D
ALBRIDGE A01 L B R IH1 JH
SOUTHBEND S AW1 TH B EH1 N D
VOCALIZED V OW1 K AH0 L AY2 Z D
MOOSEWOOD M UW1 S W UH2 D
UNDERGRAD AH1 N D ER0 G R AE1 D
GTE JH IY1 T IY1 IY1
MARYLANDER M EH1 R IY0 L AE2 N D ER0
MARYLANDER'S M EH1 R IY0 L AE2 N D ER0 Z
PLANOITE P L EY1 N OW0 AY0 T
    • I added these entries to the dictionary and attempted to run the train again. Yet again, it complained about additional missing words. I added the following words to the dictionary. Where it notes CORRECT - I corrected what appeared to be typos in the transcript itself.
FEDERALDES - CORRECT TO FEDERALES
CHOWPHERD - CORRECT TO CHOWPERD
DADGUM  D AE1 D G AH1 M
EXPERIENCEWISE  IH0 K S P IH1 R IY0 AH0 N S W AY1 Z
CANSEGO  K AE1 N S EY1 G OW1
HOPELY  HH OW1 P L IY0
STORLY  S T AO1 R L IY0
KID'LL  K IH1 D L
REINJERING  R IY2 IH1 N JH ER0 IH0 NG
REINJURING  R IY2 IH1 N JH ER0 IH0 NG
NFL  EH1 N EH1 F EH1 L
PE  P IY1 IY1
UNDERGRADS AH1 N D ER0 G R AE1 D Z
MARYLANDER'S  M EH1 R IY0 L AE2 N D ER0 Z
    • From there I ran the train again. Results under 0012_run1.html. It got past the dictionary piece but failed again.
      • Under the log it looks like it failed:
FATAL: "main.c", line 167: Unable to open /mnt/main/Exp/0012/trees/0012.unpruned/AY-0.dtree for reading; No such file or directory 
      • However, I did see that it processed that file under the log. It noted that it processed it.
      • I tried to fake it out by adding the file AY-0.dtree manually using touch. Still failed.
      • I went into the log for 40: Build Trees and found the entry AY 0. When I opened the log I noted this statement:
        INFO: main.c(320): 0 of 1 models have observation count greater than 0.000010
      • That got me thinking that this doesn't appear enough. Sure enough doing a grep for that phone returned only one entry for IBM. The one I added. Appears that if the phone is for a vowel, it has to have an emphasis of 0-3. I will clean up what I have and put an emphasis on all vowels and run it again and see what happens.
      • I modified my dictionary remove the offending phone, rebuilt the phonelist and ran the train again. This time it worked.
  • I created the language model and ran the DECODE successfully.
  • I ran the parseDecode.pl script on decode.log and output the results to the etc dir as hyp.trans.
  • Then I attempted to score the results using sclite.
    • I copied the 0012_train.trans file to etc to use as the reference file.
    • I ran the command:
sclite -r 0012_train.trans -h hyp.trans -i swb >> scoring.log
    • Executing that command gave me the following error:
sclite: 2.3 TK Version 1.3
Begin alignment of Ref File: '0012_train.trans' and Hyp File: 'hyp.trans'
Error: double reference text for id '(sw2245a-ms98-a-0166)'
Error: Not enough Reference files loaded
Missing:
    (sw2005a-ms98-a-0052)
    (sw2020b-ms98-a-0018)
    (sw2022a-ms98-a-0005)
    (sw2028a-ms98-a-0049)
    (sw2234a-ms98-a-0007)
    (sw2245a-ms98-a-0166)
    • I saw this before when sitting with Prof Jonas. This is caused by duplicate entries in the transcripts. To fix this I executed these commands:
cat 0012_train.trans | uniq >> 0012_train_pruned.trans
cat hyp.trans | uniq >> hyp_pruned.trans
    • This eliminates any duplicate entries. So I ran sclite again using the new pruned files. This failed again with an error noting that there were two references to the utterance id sw2245a-ms-98-a-0116 in the reference file. Apparently there are two entries for it, but the sentences are a little different. I eliminated one of the sentences and ran sclite again.
    • This worked and it output the results to scoring.log.

After building the Language Model I ran the decode without any problems.

I attempted to use sclite to score the results, but ran into a few problems. The transcripts have some duplicate entry lines and it doesn't like that. I removed the duplicate entry lines and was able to run sclite and save the results to scoring.log under the etc directory.

Summary

I was able to successfully run a train, decode and scoring on a new section of the transcript. Sclite produced the following output:

                     SYSTEM SUMMARY PERCENTAGES by SPEAKER                      

      ,-----------------------------------------------------------------.
      |                        hyp_pruned.trans                         |
      |-----------------------------------------------------------------|
      | SPKR    | # Snt # Wrd | Corr    Sub    Del    Ins    Err  S.Err |
      |---------+-------------+-----------------------------------------|
      | Sum/Avg |  549  10919 | 81.7   11.6    6.7    6.9   25.2   90.3 |
      |=================================================================|
      |  Mean   |  2.9   57.8 | 80.1   13.8    6.1   12.6   32.5   92.6 |
      |  S.D.   |  1.9   45.0 | 14.0   12.3    7.0   21.0   25.9   16.7 |
      | Median  |  3.0   47.0 | 82.9   11.1    5.3    7.1   26.0  100.0 |
      `-----------------------------------------------------------------'

Addendum

The training log 0012.html indicates several different errors:

  • The first error appears several times during baum welch guassians:
    ERROR: "backward.c", line 431: final state not reached                
    
ERROR: "baum_welch.c", line 331: sw2092B-ms98-a-0069 ignored
    • Sphinx FAQ has the following notes on this: Q. During force-alignment, the log file has many messages which say "Final state not reached" and the corresponding transcripts do not get force-aligned. What's wrong?

A. The message means that the utterance likelihood was very low, meaning in turn that the sequence of words in your transcript for the corresponding feature file given to the force-aligner is rather unlikely. The most common reasons are that you may have the wrong model settings or the transcripts being considered may be inaccurate. For more on this go to Viterbi-alignment

  • During some of the normalization processes generated this error a few thousand times
     ERROR: "gauden.c", line 1700: var (mgau= 362, feat= 0, density=0, component=34) < 0 
    • Based on the info noted below from the Sphinx FAQ, it may be due to a number of reasons:

Q. The first iteration of Baum-Welch through my data has an error:

INFO: ../main.c(757): Normalizing var
ERROR: "../gauden.c", line 1389: var (mgau=0, feat=2, density=176,

component=1) < 0 Is this critical? A.This happens because we use the following formula to estimate variances:

variance = avg(x2) - [avg(x)]2

There are a few weighting terms included (the baum-welch "gamma" weights), but they are immaterial to this discussion. The *correct* way to estimate variances is

variance = avg[(x - avg(x)]2)

The two formulae are equivalent, of course, but the first one is far more sensitive to arithmetic precision errors in the computer and can result in negative variances. The second formula is too expensive to compute (we need one pass through the data to compute avg(x), and another to compute the variance). So we use the first one in the sphinx and we therefore get the errors of the kind we see above, sometimes.

The error is not critical (things will continue to work), but may be indicative of other problems, such as bad initialization, or isolated clumps of data with almost identical values (i.e. bad data).

Another thing that usually points to bad initialization is that you may have mixture-weight counts that are exactly zero (in the case of semi-continuous models) or the gaussians may have zero means and variances (in the case of continuous models) after the first iteration.

If you are computing semi-continuous models, check to make sure the initial means and variances are OK. Also check to see if all the cepstra files are being read properly.

  • The Sphinx FAQ can be found here