Speech:Run Decode Unseen Data


 * Home
 * Semesters - Project Work by Semester
 * [Information]
 * System Description
 * Experiments - List of speech experiments

Project Notes

 * Unix Notes
 * Speech Corpus Setup - Switchboard,  NOAA
 * Speech Recognition Related Readings
 * Experiment Setup
 * Scripts Page
 * Model Building - more info on data prep,  language models, &  building models
 * Step 1: Run a Train
 * Step 2: Create the Language Model
 * Step 3: Run a Decode
 * Trained Test Data
 * [Unseen Test Data]

Run a Decode on Unseen Data
Read following before starting: 


 * 1) Replace all instances of:  with your experiment number!
 * 2) *Experiment numbers are 4 digits long (includes any preceding zeros), starting from 0001 to 9999.
 * 3) *Do not include the '<' or '>'.
 * 4) Similarly, replace all items encapsulated in < and > with the appropriate text.
 * 5) * Usually its a filename/path.
 * 6) *Do not include the '<' or '>'.
 * 7) Pay attention as to what directory you execute scripts in!
 * 8) *Certain scripts need to be executed in specific directories.
 * 9) DO copy and paste commands from this page. Do NOT copy and paste multiple commands from this page at once.
 * 10) *Most commands/scripts on this page need specific information added specific to your experiment. If you paste multiple commands at once into the terminal without adding in this information, bad things may result.
 * 11) Percent signs (%) indicate a command to be executed on the shell.
 * 12) *Leave them out  when copying a command from this page.
 * 13) Do NOT execute any of the following commands as root.
 * 14) *While it won't result in any of the following consequences, it does mess up the permissions for any directory and files created during the process.
 * 15) **This effectively blocks others from accessing the data derived from the experiment. Which isn't a very nice thing to do.


 * Please note:
 * The Base Experiment directory is specific to each experiment, and refers to
 * The Root Experiment directory is generic to all experiments, and refers to


 * Failure to pay heed to the above may result in:


 * 1) At best: Script failure.
 * 2) At worst: Data deletion.
 * 3) Very annoyingly: Will create a mess.
 * 4) But most annoyingly: Will create a mess in a publicly used directory such as /mnt/main/Exp.

Steps for Running the Decode

 * <font color='green'>Before starting the Decode/Scoring:
 * 1) First make sure that training completed successfully
 * 2) Build the Language Model

Setup the Decode Directory and Run the Decode

 * Now execute makeTest.pl.
 * makeTest.pl
 * Example:
 * makeTest.pl -d switchboard/30hr 0261/001 0280/001
 * Speech:MakeTest.pl has some additional info on other flags you can use.


 * If for some reason makeTest.pl does not work or you are making additional modifications outside of the scope of the script, then the following steps will allow you to set up the experiment for unseen data yourself, especially when you're using an acoustic model that is not in the current experiment.


 * First we need to create a subset to test on, we call it<taskName>_decode.fileids in etc. In your base experiment directory navigate into etc
 * cd etc


 * We call the subset <taskName>_decode.fileids
 * Example:
 * head -1000 001_train.fileids > 001_decode.fileids
 * In the example below, 001 is the task (sub experiment) and 0261/001 0283/018 is the path to the task. In this example we take the first 1000 audio files (about an hour) although you should do something more complex.
 * Alternative Example (better):
 * awk '{print $1}' /path/to/test/trans dir/dev.trans >> /path/to/etc dir/001_decode.fileids
 * Detailed Example Below:
 * awk '{print $1}' /mnt/main/corpus/switchboard/145hr/test/trans/dev.trans >> /mnt/main/Exp/0283/018/etc/018_decode.fileids
 * The reason for doing the alternative method is because the .trans files under /test/trans/ are evenly sampled utterances from the /train/trans/train.trans file. This kind of decode provides better insight as the decode is done across the whole corpus instead of just say the first 1000 utterances. Also, /test/trans/train.trans can be BOTH dev.trans or eval.trans, too. These two are unseen data transcripts. These will provide insight into real world performance.


 * Now we go back up a level to set up additional directories and files.
 * cd ..


 * Check and make sure you have your language model. If the LM directory does not exist, then copy it from your training experiment as such...
 * cp -ir ../../ /LM.


 * Copy the binaries from the training experiment (these are needed for feat generation)...
 * cp -ir ../../ /bin.


 * ...and copy the scripts_pl directory from the training experiment for access to make_feats.pl (and other scripts that may need to be referred to)
 * cp -ir ../../ /scripts_pl.


 * The model_parameters from the training directory needs to be symlinked. That command is...
 * ln -s ../../ /model_parameters ../../ /model_parameters


 * The feat and wav directories should also be generated. For the feat directory, we will be generating the feats later on, but for now, we'll just make the directory.
 * mkdir feat
 * mkdir wav


 * Now navigate into the etc directory. We will be copying a few more required files here.
 * cd etc


 * You will be copying a series of files from the training experiment's etc directory to the destination's etc directory. They are...
 * cp -i ../../../ /etc/ .dic ./ .dic
 * cp -i ../../../ /etc/ .filler ./ .filler
 * cp -i ../../../ /etc/feat.params.
 * cp -i ../../../ /etc/sphinx_train.cfg.


 * In sphinx_train.cfg, change $CFG_DB_NAME to the current sub-experiment ID, and change $CFG_BASE_DIR to use the directory that your test experiment is located in.


 * Next, we need to copy the run_decode.pl script into this newly created directory.
 * cp -i /mnt/main/scripts/user/run_decode.pl.


 * Do not forget the period at the end of the above command. The period is a shortcut for the current directory.


 * At this point, execute the following statements regardless of whether or not you executed makeTest.pl.

genFeats.pl -d
 * Now we need to generate the feats. To do that, execute...


 * Now that we have the files copied and the feats are generated, we need to run it and specify a few parameters.
 * In the etc directory run the following command...
 * nohup run_decode.pl
 * Example:
 * nohup run_decode.pl 0261/001 0280/001 1000
 * 0280/001 is the test experiment, 0261/001 is the train experiment


 * If you do not know the senone count of your current experiment, then execute this command before running run_decode:
 * ls ../model_parameters
 * You will find <taskName>.cd_cont_ (example: 001.cd_cont_1000) along with some other similarly named folders.


 * Running this script will create a file called decode.log in the etc directory. This will be used in the following step to get a score from the decode.

Overview
Scoring refers to the process of rating the quality of the models created. We use a program called SCLite to generate the scores for us.

Essentially, the process compares two transcripts:
 * The reference transcript.
 * This will be
 * It is a 100% accurate transcription of audio data sent into the decoder.
 * The hypothesis transcript
 * This will start off as, but it needs to be prepped before scoring to become

SCLite will compare each line of the hypothesis transcript with the reference transcript, counting the number of times a word is substituted for another, a unrelated word is added, and when a word is deleted. These types of errors are called, substitution, addition, and deletion errors respectively. Although none are good, certain types of errors are considered "worse" than others types; thus when SCLite is generating a score, each error type is weighted accordingly.

Instructions

 * <font color='green'>Prepare the hypothesis transcript.
 * 1) Go to the  directory.
 * 2) *You may already be in there if you just decoded.
 * 3) Transform the decode.log file to hyp.trans:
 * 4) **The Decoder combines output and status/error text into that single decode.log file. We need to strip out all the status/error text, leaving only the decoded sentences.
 * 5) *To do this, use parseDecode.pl
 * 6) * % /mnt/main/scripts/user/parseDecode.pl decode.log ../etc/hyp.trans
 * 7) **This will place the newly created hypothesis transcript (hyp.trans) to your etc directory.
 * 8) ***Which, conveniently enough, is where our reference transcript (<experiment #>_train.trans) is.
 * 9) **The parseDecode.pl script shouldn't take very long compared to other steps. But of course, it depends on the transcript size.

<font color='red'>PLEASE NOTE: When running parseDecode.pl, it isn't uncommon for it to return: rm: cannot remove `../etc/hyp.trans': No such file or directory <font color='green'>This is normal and is expected. The script should still have run successfully. The script is trying to remove an existing hypothesis file before making a new one. But of course, since we never ran the script before, the hypothesis file doesn't exist; thus returning an error. This is a bug that probably should be fixed.


 * <font color='green'>Run the sclite scorer!
 * 1) Go to your etc directory.
 * 2) * % cd ../etc
 * 3) Run SCLite
 * 4) *SCLite will return its results to standard output (the screen). We want it in a file for prosperity's sake.
 * 5) **To do this, we need to pipe the results into a new file. By attaching  to the end of the command.
 * 6) ***For more information on this, check out Wikipedia's explanation.
 * 7) ** sclite -r <exp#>_train.trans -h hyp.trans -i swb >> scoring.log
 * 8) ***In the above command, the results of the scoring will be appended to scoring.log in the same directory (etc).
 * 9) ****Don't worry if scoring.log doesn't exist yet. The Machine will create it if it isn't there, and will append to the file if it is.

Now, in a perfect world, SCLite will run nicely and you will get a nice long table detailing the amount of errors, words, and averages for each audio file and transcript, along with a total summary. It will look something like: SYSTEM SUMMARY PERCENTAGES by SPEAKER

,-.     |                         hyp.trans                               | |-|     | SPKR    | # Snt # Wrd | Corr    Sub    Del    Ins    Err  S.Err | |-+-+-|     |=================================================================|      | Sum/Avg |  524  11185 | 45.3   44.6   10.1   11.4   66.1   99.6 | |=================================================================|     |  Mean   |  2.7   58.0 | 46.4   44.9    8.7   17.3   70.9   99.7 | | S.D.   |  1.7   44.0 | 15.4   14.7    7.2   23.9   25.5    3.4 | | Median |  2.0   45.0 | 45.5   44.4    7.7   10.4   68.9  100.0 | `-' There will be a lot more stuff in between, of course. But you get the idea.

Unfortunately, we don't live in a perfect world. More than likely, the first run of SCLite will result in errors being outputted to the terminal. Reference the relevant entry in the following <font color='green'>Issues commonly found when Decoding/Scoring and how to resolve them. section to resolve this.

Issue 1:

 * <font color='red'>SCLITE is returning an error!

Error: double reference text for id '(sw2479a-ms98-a-0071)' Error: Not enough Reference files loaded Missing: (sw2259a-ms98-a-0021) (sw2295b-ms98-a-0011) (sw2331a-ms98-a-0049) (sw2389b-ms98-a-0096) (sw2428a-ms98-a-0017) (sw2442b-ms98-a-0059) (sw2451b-ms98-a-0044)
 * <font color='green'>Symptoms:
 * You receive any of the following error messages:


 * <font color='green'>Summary:

There are two types of errors that SCLite will commonly return and are caused by two slightly different, yet similar causes: redundancies issues with the transcripts. To be more precise:
 * The transcript files have multiple identical entries.
 * The transcript files have multiple entries of slightly differing transcripts with the same transcript ID.
 * This is similar to the first issue, except the two transcripts are slightly differing in some way. For example: The offending lines (in either the hypothesis or reference transcript) could look like:
 * THE LAZY RED FOX JUMPED OVER THE FENCE (sw2460B-ms98-a-0066)
 * THE LAZY RED FOX JUMPED OVER THAT FENCE (sw2460B-ms98-a-0066)
 * THE LAZY RED FOX JUMPED OVER THAT FENCE (sw2460B-ms98-a-0066)

With either issue, there will be a redundant entry in the <experiment #>_train.fileids file as well. That being said, you don't have to worry about this file for Decoding or scoring. Its only important to note if you wish to remove redundant entries before training (there is no tangible benefit in doing so, see the results of experiments Speech:Exps_0024 and Speech:Exps_0025 for more information).

Although Standard terminal output SHOULD be going to scoring.log (due to the >> scoring.log attached to the end of the sclite command), error messages are treated differently (they technically don't go to Std. Out and instead go to Std. Error)and thus the above messages go straight to the terminal, you shouldn't see them in the scoring.log file. In fact, perhaps the only thing you will see is: sclite: 2.3 TK Version 1. <experiment #>_train.trans' and Hyp File: 'hyp.trans'


 * <font color='green'>Solutions:

This is due to the existence of redundant transcript entries that differ in content. To resolve this issue, we need to edit either the <font color='green'>hyp.trans or the <font color='green'><experiment #>_train.trans files based on the type of message we get (if it says "<font color='green'>reference " its the latter, while "<font color='green'>hypothesis " its the former; we then find the offending lines, choose one entry and remove the other.
 * <font color='red'>"Error: double reference text for id ''":

You may be asking why we can't just run each through  like the other solution. The problem is that  removes all identical lines; since each of the offending transcripts differ, they are unique, and thus   will leave them alone.

For this example, lets assume that we received the following message when running SCLite for the first time: Error: double reference text for id '(sw2479a-ms98-a-0071)' <font color='green'>Short version: For those who know what they are doing or wish to use something besides Vi/Vim.
 * 1) Open up the reference transcript file. I.E.
 * 2) Search for all instances of
 * 3) *<font color='red'>It is important to note that sometimes the one of the alphabetic characters (especially the one immediately preceding the first dash)may be appear lowercase in the error but is upper-case in the transcripts.
 * 4) **For example, it may show up in the transcript as
 * 5) **I know, it's weird. Try a case insensitive search, see the long tutorial to see how to do this in vi.
 * 6) Pick one, remove the other(s) by removing any trace of its existence.
 * 7) Save and quit.
 * 8) Check the other transcript to see if there are doubles in there.

<font color='green'>Long version: Using vi/vim Check out [this resource] for a quick refresher on vi.


 * 1) Go to your experiment's etc folder if you aren't already there.
 * 2) Make a backup of your reference transcript:
 * 3) * % cp <experiment #>_train.trans <experiment #>_train.trans.old
 * 4) **This ensures that you can restart if you accidentally do something wrong.
 * 5) Open up the reference transcript file in vi.
 * 6) * vi <experiment #>_train.trans
 * 7) Set Vi to case-insensitive search mode.
 * 8) *Vi's search is by default case-sensitive. The problem is that the error message we get may have the wrong cases. Yes I know, it's weird.
 * 9) *Enter the following:
 * : set ignorecase
 * 1) Search for all instances of
 * 2) *To search in Vi, use the forward-slash key / followed by the search text, then press enter.
 * 3) * /sw2479a-ms98-a-0071
 * 4) *It should bring you to the first instance it finds.
 * 5) **<font color='red'>If it doesn't find anything.
 * 6) ***Check your search terms and make sure that you set ignorecase in the previous step.
 * 7) *<font color='green'>To move between entries: use the n to go forward, use shift + n to backwards.
 * 8) *In our example, the offending lines are:
 * RIGHT SHE GOES TO SEMINARS AND UH SHE GETS HOME VISITATIONS BY THE STATE UH UH I DON'T KNOW THE STATE BOARDS I GUESS SOME OF THEM AND THEN SOME BY THE ASSOCIATION AND THEY (sw2479A-ms98-a-0071)
 * RIGHT SHE GOES TO SEMINARS AND UH SHE GETS HOME VISITATIONS BY THAT STATE UH UH I DON'T KNOW THE STATE BOARDS I GUESS SOME OF THEM AND THEN SOME BY THE ASSOCIATION AND THEY (sw2479A-ms98-a-0071)
 * 1) Pick one, remove the other(s) by removing the entire line it's on.
 * 2) *If there is only one entry, skip this step and quit the editor by typing in:
 * :q
 * 1) *To delete an entire line quickly: While in command mode (press the ESC key if you aren't sure),
 * 2) **Move your cursor using the arrow keys to the line you wish to remove.
 * 3) **Then enter:
 * :d
 * 1) Save and quit by entering:
 * :wq
 * 1) Check the other transcript to see if there are doubles in there.
 * 1) Check the other transcript to see if there are doubles in there.


 * <font color='red'>"Not enough reference files loaded, Missing:"

This error is caused by duplicate identical transcript entries in either the hypothesis transcript and/or the reference transcript. Usually it is the hypothesis transcript that causes the error, so we will focus on that.


 * 1) Go to your experiment's etc directory if you aren't already there.
 * 2) Remove all redundant lines.
 * 3) *We use a built-in Unix tool called  to do this for us. The output of this tool needs to go to a new file.
 * 4) * % uniq hyp.trans >> hyp.trans.uniq
 * 5) Restart SCLite while using the newly created <font color='green'>hyp.trans.uniq file.
 * 6) * sclite -r <experiment #>_train.trans -h hyp.trans.uniq -i swb >> scoring.log

If you get the same error again: Repeat the above process, but for the <font color='green'><experiment #>_train.trans file. Be sure to specify the new <font color='green'><experiment #>_train.trans.uniq file where appropriate in the sclite statement.