Speech:Spring 2014 Ramon Whitman Log/Train


 * Master Script Train Walkthrough

The goal is to create a train from scratch. In order to do this we completed a master script that runs through a step by step process that runs the /mnt/main/scripts/user/master_run_train.pl.



When the master script is executed the above text appears. This window gives you two options: 1. create a master experiment or 2. create a child experiment in the sub folder of an existing experiment.

The answer is "M" for master experiment and "C" for child experiment. The question what this step actually does:


 * Part 1. Getting Information about this Experiment


 * 1) We need to find out if this experiment is a new MASTER experiment OR a new CHILD experiment.
 * 2) A MASTER experiment is one that someone creates a new Experiment Number on the /mnt/main/Exp directory.
 * 3) A CHILD experiment is one that somone creates a new Experiment INSIDE a MASTER experiment
 * 4) i.e. /mnt/main/Exp/0200/d12/s2000
 * 5) The reason why someone would create a CHILD experiment would be to use the same data and models as the MASTER
 * 6) experiment, but change the Density (d) and Senone (s) values in the sphinx_train.cfg file.

Once the the experiment is created it moves on to Part2.




 * Part 2. Setting up Experiment Directory


 * 1) Scripts being used: exp_dir_setup.pl OR child_exp_dir_setup.pl
 * 2) From this moment on, we have to first check if they are selecting to do a new MASTER or CHILD Experiment.
 * 3) This is important to know because there is totally different information we have to give to sphinx_train.cfg and other important areas.
 * 4) Below will direct to whichever Experiment the user decides to run and adjust accordingly.




 * Part 3. Configuring the Sphinx Configuration CFG File


 * 1) Scripts being used: exp_sphinx_config.pl OR child_exp_sphinx_config.pl
 * 2) This part is configuring the sphinx_train.cfg file.




 * Part 4. Generate the Transcripts
 * 1) Need to move to the base experiment folder we just created.
 * 2) Scripts being used: genTrans5.pl
 * 3) Arguments:
 * 4) REQUIRED: Transcript Dictionary name (i.e. first5_hr/train, 10hr/train, tiny/train)



We now need to generate the transcripts to be used.

Transcripts consist of two portions:

The text transcript files: _train.trans The audio file ID list which contains the list of audio files which make up the transcript: _train.fileids

To generate these, we need to do the following:

Determine a corpus subset to use. The main Corpus subsets are found in /mnt/main/corpus/switchboard/ These directories represent different corpus subsets to use for various stages of making and testing models. Each of those directories contain both audio files and textual transcripts (though neither are in a format that we can use directly). For example, /mnt/main/corpus/switchboard/mini/ contains "./dev", "./eval", and "./train". "./train" would be used for training and "./eval" would be used for evaluating the resulting model in a subsequent experiment. Note: Do not use /mnt/main/corpus/switchboard/mini/dev, it references missing audio files, causing issues. Now, once you pick a corpus subset to use.


 * 1) 100hr
 * 2) 10hr
 * 3) 308hr
 * 4) 3170
 * 5) 50hr
 * 6) first_5hr
 * 7) full
 * 8) last_5hr
 * 9) mini
 * 10) mini2
 * 11) tiny




 * Part 5. Create the Dictionary and Insert SIL Into


 * 1) Need to move to the /etc directory before doing anything.
 * 2) Scripts being used: pruneDictionary2.pl and genPhones.csh
 * 3) cd etc
 * 4) /mnt/main/scripts/train/scripts_pl/pruneDictionary2.pl _train.trans /mnt/main/corpus/dist/cmudict.0.6d .dic



There are multiple dictionaries that you can choose from


 * 1) cmudict.06d
 * 2) cmudict.0.7a
 * 3) cmudict.0.7c
 * 4) cmudict.0.6d
 * 5) cmudict.0.7b
 * 6) cmudict.0.7d





Generate the phone list.

We now need to generate the phone list.

Phones are the smallest component of a phonetic transcription code (such as Arpabet), they represent how each part of a word sounds like.






 * Start the Train!

NOW: We can finally start the train.

Run the following in your Base experiment folder:

% /mnt/main/scripts/train/scripts_pl/RunAll.pl

The first thing the trainer does is verify that it has everything it needs to build a model. Checking to see if: Transcript list is valid, it can find audio files the transcript references and vice versa. The experiment dictionary contains all the words used in the transcript. All Phones used in the dictionary are defined in the .phone file.

Please note: Trains will usually fail the first time executing RunAll.pl! It will output what is wrong on the terminal, but also in a HTML file located at the Experiment base directory. It is called .html and can be opened with

% lynx .html

Resolve the issues and execute RunAll.pl again.

Usually these initial errors are related to the trainer finding a word used in the transcript but not defined in the experiment dictionary (.dic). To resolve this: Reference the instructions in the next section.

Trains will usually run between 1.5-2 times the length of audio data provided, though this isn't an exact rule. Unlike the decoder, it outputs to the terminal a nice steady flow of status messages, of which it will also put this data in the .html file at the base experiment directory for future reference. It has a series of stages it calls "Modules" ranging from module 1 to module 99; the error checking part is module 1, the final module is 99 which is the Sphinx-II model conversion (which you have disabled by editing the sphinx_train.cfg file, right?). The most time consuming portions of the train, which is the actual model building parts, are modules 40-49. Please note that some module numbers are skipped over, so there may not actually be 99 individual modules.

After completion, you have successfully created the Acoustic model! Common issues experienced while Training and how to resolve them: Issue 1:

The Trainer cannot find words referenced in the transcript within the dictionary!

Symptoms:

When looking at the trainer output (both terminal output and within the .html logfile), you will see these errors show up similar to:

WARNING: This word: DUCTWORK was in the transcript file, but is not in  the dictionary ([DEL: WHICH IS TOTALLY LEGAL BUT THE COST OF DOING THIS   IS ASTRONOMICAL THEY ACTUALLY SHAVE UP DUCTWORK AND THINGS AND SO WE'RE   UH VERY VERY UH COGNIZITIVE AND AWARE OF ALL THESE TYPE OF UH :DEL] ). Do cases match?


 * Summary:

This is perhaps the most common issue when starting a train. The transcripts will contain files which aren't found in the master dictionary (/mnt/main/corpus/dist/cmudict.0.6d), or even contain words which aren't even spelled correctly!


 * Solution:

There are three steps needed to resolve this issue:

Getting a list of words to be added to the dictionary. Generating the dictionary entries for these words. Inserting these entries into the experiment dictionary.


 * 1. Getting the list of words to be added to the dictionary.

The Sphinx trainer is fairly vocal in regards to missing word errors. It will spit out this list onto the terminal before quitting. That being said, if there are many words that are missing, the list may be longer than the terminal client's buffer, effectively cutting it off partway.

Thankfully, the trainer will create an HTML logfile at your base experiment directory with the name .html, this document contains everything that was outputted to the screen by the trainer. To take a look at it, use the terminal-based web-browser lynx.

lynx .html

Use the up and down arrows to scroll up and down. Press q then y to exit lynx.

The list of words which caused the issue are usually at the bottom of the output.

Please note: Each time you run the Sphinx trainer, the output will be added to the end of this document. So in other words, to get the list of words preventing the last executed train from running: Scroll all the way down!

Before proceeding to the next step:

Open up a text editor on your local desktop. You know, like notepad. Copy each word from the terminal and paste it over to this document. Please note that some of these words may be slightly misspelled. Using copy-paste is recommended.


 * 2. To get the phonetic spelling for a word:

You could search for the word at the CMU Pronouncing dictionary.. Be sure to click on the "Show lexical stress" check-box before searching! The trainer expects these lexical stress indicators, which are the numbers 0 through 2 which are attached to the end of certain phones, they slightly modify how the phone is pronounced. If you are trying to find a number, type the number out as a word instead of an actual numeric character. (I.E. "seven" instead of "7"). Also, do not include the periods that the dictionary puts at the end of each word! It will cause the trainer to error out. Generate the phonetic spelling based on similar words. This method is especially useful when pronouncing compound words. For example, to create the phonetic spelling for Sawmill, get the phonetic spellings of Saw (S AO1) and Mill (M IH1 L)from the CMU pronouncing dictionary, concatenating each one at the end to form S AO1 M IH1 L    Generate the phonetic spelling yourself. This way is a bit harder, I only recommend doing it if you can't find word in the previous methods. Get the IPA spelling from a good dictionary Using the IPA to Arpabet phoneme comparisonlist. Translate each IPA symbol from the dictionary to the matching Arpabet symbol. You will need to add the stress values at the end of each stressed syllabic vowel.

Prepare each word for which you have gotten a pronunciation for by making a new file either on the remote machine (call it add.txt or something like that, its needed for the second dictionary update method), or on your local desktop (best for the first dictionary update method). The dictionary file is in the following format:

SOUTHBEND S AW1 TH B EH1 N D VOCALIZED V OW1 K AH0 L AY2 Z D MOOSEWOOD M UW1 S W UH2 D UNDERGRAD AH1 N D ER0 G R AE1 D GTE JH IY1 T IY1 IY1 MARYLANDER M EH1 R IY0 L AE2 N D ER0 MARYLANDER'S M EH1 R IY0 L AE2 N D ER0 Z PLANOITE P L EY1 N OW0 AY0 T DADGUM D AE1 D G AH1 M EXPERIENCEWISE  IH0 K S P IH1 R IY0 AH0 N S W AY1 Z CANSEGO  K AE1 N S EY1 G OW1 HOPELY HH OW1 P L IY0 STORLY S T AO1 R L IY0 KID'LL K IH1 D L

Notice how the entries in the dictionary are:

Entirely upper-case. There is one word entry per line. There is a space (or two) between the grammatical spelling of the word and the first phone of its phonetic spelling. Vowel phones have stress indicators at the end, which are numbers ranging from 0 to 2.

This format is crucial, deviating from it is not recommended!

Important! Always keep a record of all additions you make to the dictionary! We can add them to the master dictionary, thus creating less problems for others when they try to run trains! Insert this list along with the results of your experiment!


 * 3. To add the updated word list to the dictionary:

There are two ways we can proceed. The first way is easiest if you only have not too many additions and aren't updating any existing pronunciations. The latter method isn't as tedious and repetitive than the first and thus MUCH more practical for adding lots of new dictionary entries. It also includes some more error checking as it looks for redundant dictionary entries in both the addition list and the dictionary; however,it requires more prep-work.


 * Method 1: Use built-in Unix commands.

Go to your experiment's etc directory if you aren't already there. Make an initial backup of your dictionary. This step is optional, but is highly recommended in case you need to start over!

% cp ./.dic ./.dic.backup

Rename the dictionary file by executing in your experiment's ETC directory:

% mv ./.dic ./.dic.old

For each line of pronunciations to be added, execute:

% echo " " >> <experiment #>.dic.old

This will append each line to the bottom of the dictionary. You really should only do this one entry at at a time. IMPORTANT: Ensure that each line you enter follows the format described above! The trainer will NOT accept the newly added words otherwise. Now sort the updated dictionary alphabetically by executing:

% sort <experiment #>.dic.old >> <experiment #>.dic

Now you have a nice updated dictionary! Start the train again (RunTrain.pl) and repeat the process if necessary.


 * Method 2: Use updateDict.pl

See Speech:Spring_2013_updateDict.pl or execute % updateDict.pl -h for more information on the script and its usage. This script essentially merges two separate dictionary files together.

Make a new directory called "temp" (or whatever you want really, the name itself doesn't matter)in your <experiment #>/etc folder:

mkdir temp

move the dictionary file into the newly created "temp" directory

mv <experiment #>.dic temp

Go into <experiment #>/etc/temp.

cd temp

Create or move the addition text file into etc/test. Insure that it is in the same format as the dictionary, see above. Copy over the updateDict.pl script to etc/test.

% cp -i /mnt/main/scripts/user/updateDict.pl.

Execute updateDict.pl

% ./updateDict.pl -m <experiment #>.dic <addition List>

The '-m' argument (short for 'merge') is required; not supplying it will result in script failure. updateDict.pl assumes that the dictionary will be given first, followed by the addition list. Reversing this order is not a good idea. After updateDict.pl is done, move it back to the level above.

% mv <experment #>.dic ..

You may notice that updateDict.pl by default will create a new file called <experiment #>.dic.old in the directory it currently is in. This isn't truly a "new" file, but rather a backup of the initial dictionary you started with. Its useful in case you did something wrong and need to start over. The addition file is not edited by the program and thus no backup file is needed.

This script is useful when updating existing dictionary entries as well. Just put the updated entry (with updated pronunciation) into the addition text file. When updateDict.pl sees a redundant entry in the addition file but with a different pronunciation, if will prompt you as to which one to keep. You can force the script to assume that the pronunciation in the addition file is correct by adding 'f' to the list of parameters.


 * Issue 2:

make_feats.pl and/or genTrans2.pl are Erroring out! Hey, there aren't any wavefiles in my <experiment #>/wav directory either!


 * Symptoms:

make_feats.pl is giving you the error message like:

INFO: fe_sigproc.c(771): Will not use double bandwidth in mel filter INFO: wave2feat.c(139): /mnt/main/Exp/0030/wav/sw2001B-ms98-a-0012.sph ERROR: "wave2feat.c", line 655: Cannot read /mnt/main/Exp/0030/wav/sw2001B-ms98-a-0012.sph FATAL_ERROR: "wave2feat.c", line 90: error converting files...exiting

genTrans2.pl is giving you:

Error executing: <long Sox command> Is sox installed?

There aren't any wavefiles in <experiment #>/wav after running the old genTrans.pl script (not genTrans2.pl).


 * Summary:

The following servers are known to be affected by this:

Miraculix

If you experience any of the above symptoms. It is likely that Sox isn't installed on your specific server. Sox is used to extract Wav-files from the corpus's .sph files. With genTrans.pl, there was an annoying issue where it expected that Sox was already installed, and subsequently ignored any errors resulting from attempting to execute it; essentially this created a situation where the script would appear to run successfully, but actually didn't create any of the audio files for the experiment. As make_feats.pl needs these audio files, it would error out.


 * Solution:

There are a few solutions to this issue:

To prevent issues later in the training process.

Use genTrans2.pl       genTrans2.pl performs all the functions of genTrans.pl. Except it will stop itself and warn the user if Sox errors out. See [Speech:Spring_2013_genTrans2.pl] for more information. If you experience an error in genTrans2.pl, continue to the following steps: If make_feats.pl (or genTrans2.pl) errors out: Verify sox is installed on the machine. Simply execute the following command:

% sox

It you get a command-not-found error, then Sox isn't installed. If it prints out a usage sheet, then Sox is installed. If Sox isn't installed: Re-run the "Generate the transcript and its associated audio-file list" step while on a server that has Sox installed. You could use Caesar for this. Once you are done running genTrans2.pl, go back to your original server! If Sox is installed and you are still having issues: Contact the modelling team. Once genTrans2.pl has run successfully. Verify that you have wave-files in <experiment #>/wav If you do, then re-run the Generate Feats data. step. Then start the train (assuming you have completed all the previous steps) If you are going through this process after you had ran the original genTrans.pl script, please note that you do not need to restart the experiment. Even though genTrans.pl didn't make the wavefiles for the experiment, it did do everything else it needed to do, such as making the transcript.


 * Issue 3:

The Trainer can't find phones used in the dictionary!

Symptoms:

The Sphinx trainer is giving you errors similar to:

WARNING: This phone (AA) occurs in the dictionary (/mnt/main/Exp/0025/etc/0025.dic), but not in the phonelist (/mnt/main/Exp/0025/etc/0025.phone)


 * Summary:

Essentially the issue is what the error messages suggest, you use a phone in the dictionary that isn't in the phone list. This usually occurs after adding new entries in the experiment dictionary using invalid phones.

Solution:


 * Step 1,

For each phone that is having the issue:

If the phone is a Vowel: Make sure it has a stress indicator at the end! Verify that the given phone is valid. Reference the phone list on Wikipedia's Arpabet page.


 * Step 2,

Once you determine what is wrong with the phones listed by the trainer:

Go into the experiment dictionary using a text editor. Remember, the dictionary is in your experiment's etc directory and has the name <experiment #>.dic Search for each instance of the phone given. In Vi, you can do this by hitting the forward-slash (/) key, typing in the search term, and pressing "Enter". Pressing the "n" key will progress forward, pressing the "Shift" and "n" key will search backwards. Tip: Add a space at the end of the phone while searching. It will eliminate almost all results within the grammatical part of the dictionary entries. For each instance of the provided phone: Fix the phone as determined in Step 1. Once you are finished fixing all the appropriate phones. Restart the RunAll.pl script to retry the train. Repeat the above process if necessary.