Speech:Summer 2012 Create Experiment


 * Home
 * Information

Creating and Running an Experiment
This guide provides step by step instructions for creating corpuses, setting up an experiment, running a train, running a decode, and scoring the results of a decode.

Corpus Setup
This part is not necessarily required for setting up an experiment as a corpus may have already been created. If so, the experiment just needs to use the appropriate corpus depending on the purpose of the experiment. If a corpus needs to be set up, this guide will help.

Currently the corpuses are located under /mnt/main/corpus/switchboard From there all the different corpuses used by various experiments are created. A corpus contains several sub folders:
 * train - for running a train and testing a train
 * dev - for testing training and decode settings
 * eval - used to evaluate training and decoding settings. This serves to confirm that we are not "tuning on our training data in dev.  The error rate should be about the same for dev as it is for eval.  If it is not, most likely we are tuning in on our data.
 * Each of these sub folders has two additional sub folders:
 * trans - this folder contains the transcript
 * wav - this folder contains the sph files used by the transcript.

To create a corpus to be used, these steps need to be followed:
 * Create the corpus folder under switchboard - for example I used mini for an hours worth of training.
 * Create the appropriate sub folder. I already did train for training so I created dev
 * cd to dev and create the folders trans and wav
 * Now we need grab part of the master transcript to evaluate a smaller portion.
 * The createTranscript.pl script is used to accomplish this.
 * createTranscript.pl has 4 parameters:
 *  - the location of the master transcript
 *  - the location to save the new transcript
 *  - the amount of dialog to grab (in seconds)
 *  - how far much of the transcript (in seconds) the script should skip before beginning.
 * To create a transcript that contains an hours worth of spoken dialog, execute the following command: /mnt/main/scripts/user/createTranscript.pl /mnt/main/corpus/dist/Switchboard/transcripts/ICSI_Transcriptions/trans/icsi/ms98_icsi_word.text /mnt/main/corpus/switchboard/mini/dev/trans/train.trans 3600 3600
 * This grabbed the master transcript and a copy of the new transcript in the mini/dev/trans/ folder. The transcript skipped over the first hour of dialog and grabbed an hours worth of dialog.
 * Now that the transcript has been created for use with our experiments, we need to grab the corresponding sph files that the transcript calls for.
 * The copySph.pl script will accomplish this task
 * This script has one parameter:
 *  - the directory that has the corpus we need to grab sph files for
 * To grab the sph files for this transcript, execute the following command: /mnt/main/scripts/user/copySph.pl /mnt/main/corpus/switchboard/mini/dev
 * If you receive an error indicating no such file or directory, that means the sph file the transcript references does not exist. Unfortunately there appears to be some missing files.  To correct the issue, go into the transcript file in the corpus directory, and locate the lines that refer to that sph file and delete them.

Experiment Setup and Running the Train
/mnt/main/scripts/train/scripts_pl/RunAll.pl
 * First create a new experiment folder under /mnt/main/Exp
 * cd Exp#
 * Execute the script that copies over the relevant files and creates the required folders by typing /mnt/main/root/tools/SphinxTrain-1.0/scripts_pl/setup_SphinxTrain.pl -task
 * Next we need to modify the sphinx_train.cfg to work with the new experiment:
 * cd etc
 * edit sphinx_train.cfg (with vi, nano, etc)
 * Set $CFG_DB_NAME = "";
 * Set $CFG_BASE_DIR = "/mnt/main/Exp/";
 * Set $CFG_SPHINXTRAIN_DIR = "/mnt/main/Exp";
 * Save the file and exit the editor
 * Now we need to copy over the scripts that we need
 * cp -i /mnt/main/scripts/user/genPhones.csh.
 * cd ..
 * Now we need to convert the transcript in the corpus directory to be in a format that training can use by executing genTrans.pl. This script takes 2 parameters:
 *  - the base directory that has the trans and wav folders
 *  - the experiment we are using
 * To set up experiment 0015 for example, type the following from the top level of exp 0015: /mnt/main/scripts/user/genTrans.pl /mnt/main/corpus/switchboard/mini/dev 0015
 * This may take some time to process if this is a long transcript.
 * Now we need to create a custom dictionary. This is done because we do not need all of the words in the master dictionary.  We need to prune it down so that we only have a list of words that appear in the transcript which speeds up training and decoding time.
 * The script pruneDictionary.pl accomplishes this task and it takes 3 arguments:
 * - location of the input transcript (the one in the etc folder created by genTrans.pl)
 * - location of the master dictionary
 *  - name of the pruned dictionary - should be exp#.dic in the etc folder of your experiment
 * To create a custom dictionary for exp 0015, execute the following command from the etc dir of your experiment: /mnt/main/scripts/train/scripts_pl/pruneDictionary.pl 0015_train.trans /mnt/main/corpus/dist/cmudict.0.6d 0015.dic
 * This will take some time.
 * IMPORTANT: Be sure to use cmudict.0.6d for the master dictionary and NOT cmudict.06d or you will not be able to run the train or decode.
 * Once the dictionary has been created, copy over the filler dictionary: cp -i /mnt/main/root/tools/SphinxTrain-1.0/train1/etc/train1.filler .filler
 * Where  is your experiment.
 * Now that the dictionary has been prepared, the file that contains all the phones used in the dictionary needs to be built:
 * In the etc folder of the experiment execute the following command:
 * ./genPhones.csh 
 * Where exp# is your Experiment number.
 * Edit the file it created - exp#.phone and insert SIL in the file where it should be located alphabetically. The save and close the file.
 * Now the feats data needs to be built
 * Under the base experiment folder execute the following: /mnt/main/scripts/train/scripts_pl/make_feats.pl -ctl /mnt/main/Exp/ /etc/ _train.fileids
 * Where is your experiment number
 * Now you can run the train.
 * IMPORTANT: If you are using the training data (Acoustic Model) from another experiment stop here and skip to the section that creates the Language Model.
 * Now all that is left to do is build the Acoustic Model by executing the following from the base directory of your experiment:

Create the Language Model

 * Setup the Language Model folder and copy over the unedited transcript
 * From your base Experiment folder make a folder called LM
 * cd LM
 * Copy over the transcript used from the corpus directory:
 * cp -i /mnt/main/corpus/switchboard/mini/dev/trans/train.trans trans_unedited
 * Execute the script that will build the language model:
 * /mnt/main/corpus/dist/Switchboard/transcripts/ICSI_Transcriptions/trans/icsi/ParseTranscript.perl trans_unedited trans_parsed
 * Copy the script that creates the language model and use it to build it: cp -i /mnt/main/scripts/user/lm_create.pl.
 * Execute the script: ./lm_create.pl trans_parsed
 * The Language Model has been created

Run Decode

 * Create the decode folder from your base Experiment directory
 * mkdir DECODE
 * cd DECODE
 * cp -i /mnt/main/scripts/user/run_decode.pl.
 * run_decode.pl takes two arguments:
 *  - your current experiment number
 *  - the experiment that has the training (Acoustic Model)
 * The following command runs the decode for experiment 0015 using the Acoustic Model from experiment 0012: ./run_decode.pl 0015 0012
 * Once you execute the command it may take some time depending on how large the transcript is.
 * Once completed you should have a file called decode.log - this will be used for scoring.

Scoring the Experiment
Error: Not enough Reference files loaded Missing: (sw2259a-ms98-a-0021) (sw2295b-ms98-a-0011) (sw2331a-ms98-a-0049) (sw2389b-ms98-a-0096) (sw2428a-ms98-a-0017) (sw2442b-ms98-a-0059) (sw2451b-ms98-a-0044) cat hyp.trans | uniq >> hyp.trans_uniq SYSTEM SUMMARY PERCENTAGES by SPEAKER
 * We use sclite to "score" or determine how well decoding was at predicting what was spoken by comparing the output - decode.log with the original transcript.
 * The decode.log file has a lot of extra info in it that we do not need so we need to extract all the useful information out of it. That is where the  parseDecode.pl script comes in.
 * parseDecode.pl takes in two arguments:
 * - the location of the decode.log file
 * - the name of the hypothetical transcript
 * From the DECODE folder, execute this command /mnt/main/scripts/user/parseDecode.pl decode.log ../etc/hyp.trans
 * This will place the hypothetical transcript hyp.trans in the etc folder of your experiment
 * Go to the etc folder - cd ../etc
 * Now we can use sclite to score the results by issuing the following command: sclite -r _train.trans -h hyp.trans -i swb >> scoring.log
 * This will output the results to the file scoring.log
 * NOTE: you may get errors like this:
 * This is a list of duplicate utterance IDs. What this means is that there are duplicate entries in the hyp.trans and <exp#>_train.trans file.  You need to find these duplicates in both files.  Easiest way is to do the following:
 * This will take out the unique entries and save it to file hyp.trans_uniq. You will need to do the same for <exp#>_train.trans
 * The scoring.log file should have something that looks like this:

,-.     |                         hyp.trans_uniq                          | |-|     | SPKR    | # Snt # Wrd | Corr    Sub    Del    Ins    Err  S.Err | |-+-+-|     |=================================================================|      | Sum/Avg |  524  11185 | 45.3   44.6   10.1   11.4   66.1   99.6 | |=================================================================|     |  Mean   |  2.7   58.0 | 46.4   44.9    8.7   17.3   70.9   99.7 | | S.D.   |  1.7   44.0 | 15.4   14.7    7.2   23.9   25.5    3.4 | | Median |  2.0   45.0 | 45.5   44.4    7.7   10.4   68.9  100.0 | `-'
 * If it does you have successfully scored your train/decode.