Speech:Spring 2017 Data Group


 * Home
 * Semesters
 * Spring 2017
 * Proposal
 * Report
 * Information - General Project Information
 * Experiments - List of speech experiments

Groups

 * Systems Group
 * Experiment Group
 * Tools Group
 * [Data Group]
 * Modeling Group

Group Member Logs

 * Maryjean Emerson
 * Matthew Fintonis
 * Dylan Lindstrom
 * John Vey

Creating New Corpora
This semester, we updated instructions from last year's data group on how to create new corpora sizes. The following instructions are up-to-date as of 2/28/2017.

Note: If copy/pasting, do not copy '#' at the beginning of commands. Please read through EACH bullet point before continuing on to next step.

1) From /mnt/main/corpus/switchboard, run 'makeCorpus.pl' script 2) Navigate to /info/misc directory 3) Copy the full transcript file from /mnt/main/corpus/switchboard/full/train/trans to your /info/misc directory 4) Double check to see how many hours the copied transcript file covers 5) In order to trim down the transcript to our desired corpus size (in my case, 5 hours), we must do a bit of math:
 * Ex: # perl /mnt/main/scripts/user/makeCorpus.pl 
 * ( Ex: 5hr)
 * Ex: # cd /info/misc
 * Ex: # cp /mnt/main/corpus/switchboard/full/train/trans/train.trans.
 * Note the period at the end
 * Ex: # awk '{total += $3 - $2} END {print total / 3600}' train.trans
 * > 311.761
 * This command tells you the length of the transcript file in hours
 * Take the total amount of hours of the full transcript (311 hours) and divide it by your desired corpus size plus 10 additional hours (~5 hours each for 'dev.trans' and 'eval.trans')
 * Ex: 311 / (5 + 10) = 20 (always round down to the closest whole number)
 * This number will act as our sample count for the following step

6) Take a sample of the transcript by running 'sampleTrans.pl' script 7) Double check that the size of the sampled transcript matches your desired corpus size (plus ~10 hours) 8) Rename 'train.trans' to 'train.trans-full' 9) Rename 'train.trans-sampled' to 'train.trans'
 * Ex: # sampleTrans.pl  
 * Ex: # perl /mnt/main/scripts/user/sampleTrans.pl 20 train.trans
 * This will create 'train.trans-sampled', which grabs every 20th line from the transcript, and the optional '-r' flag will create train.trains-remaining, which grabs all the remaining lines
 * For this step, you do not need to include the -r flag in the command
 * Ex: # awk '{total += $3 - $2} END {print total / 3600}' train.trans-sampled
 * > 15.513
 * Ex: # mv train.trans train.trans-full
 * Ex: # mv train.trans-sampled train.trans

11) While still in the /info/misc directory, run the 'sampleTrans.pl' script again. However, this time, add in the '-r' flag to create the 'train.trans-remaining' file and change the sample count to a value equal to the size of your current 'train.trans' file, divided by 5. (In my case, 5 + 10 = 15, and 15 / 5 = 3) 12) Rename 'train.trans' to 'train.trans-old1' 13) Move 'train.trans-sampled' into your /test/trans directory and rename it as 'dev.trans' 14) Rename 'train.trans-remaining' to 'train.trans'
 * Ex: # awk '{total += $3 - $2} END {print total / 3600}' train.trans
 * > 15.513
 * Ex: # perl /mnt/main/scripts/user/sampleTrans.pl -r 3 train.trans
 * Note: In the following steps, we will be generating our 'dev.trans', 'eval.trans', and 'train.trans' files. Both 'dev.trans' and 'eval.trans' contain about 5 hours each worth of transcript that we will remove from our sampled transcript. 'train.trans' will also contain about 5 hours worth of transcript, but that portion will not be removed from the sample transcript. These files will be for people who want to train on unseen data.
 * Ex: # mv train.trans train.trans-old1
 * Ex: # mv train.trans-sampled ../../test/trans/dev.trans
 * Ex: # mv train.trans-remaining train.trans

15) While still in the /info/misc directory, run the 'sampleTrans.pl' script again. Do not forget to change the sample count to a value equal to the size of your 'train.trans' file, divided by 5. 16) Rename 'train.trans' to 'train.trans-old2' 17) Move 'train.trans-sampled' into your /test/trans directory and rename it as 'eval.trans' 18) Rename 'train.trans-remaining' to 'train.trans'  19) One last time, run the 'sampleTrans.pl' script. Do not forget to change the sample count to a value equal to the size of your 'train.trans' file, divided by 5. 20) Move the 'train.trans-sampled' file to the /test/trans directory 21) Copy the 'train.trans' file to the /train/trans directory 22) Navigate to the /test/trans directory 23) Double check that the size of each file (dev.trans, eval.trans, train.trans) is equal to roughly 5 hours
 * Ex: # awk '{total += $3 - $2} END {print total / 3600}' train.trans
 * > 10.2451
 * Ex: # perl /mnt/main/scripts/user/sampleTrans.pl -r 2 train.trans
 * Ex: # mv train.trans train.trans-old2
 * Ex: # mv train.trans-sampled ../../test/trans/eval.trans
 * Ex: # mv train.trans-remaining train.trans
 * Ex: # awk '{total += $3 - $2} END {print total / 3600}' train.trans
 * > 5.2134
 * Ex: # perl /mnt/main/scripts/user/sampleTrans.pl -r 1 train.trans
 * Ex: # mv train.trans-sampled ../../test/trans/train.trans
 * Ex: # cp train.trans ../../train/trans/train.trans
 * Ex: # cd ../../test/trans
 * Ex: # awk '{total += $3 - $2} END {print total / 3600}' dev.trans
 * > 5.2679
 * Ex: # awk '{total += $3 - $2} END {print total / 3600}' eval.trans
 * > 5.03174
 * Ex: # awk '{total += $3 - $2} END {print total / 3600}' train.trans
 * > 5.2134

24) The final steps are to create the links to the utterance files. To do this, first navigate to the /test/audio/utt directory. '''25) Run the 'linkTransUtt.pl' script for 'dev.trans', 'eval.trans', and 'train.trans', as shown below. (Note the '/' at the end of the final path)''' 26) Run the 'ls -l' command to verify that the links are good 27) Navigate to the /train/audio/utt directory 28) Run the 'linkTransUtt.pl' script for 'train.trans', as shown below. (Note the '/' at the end of the final path) 29) Run the 'ls -l' command to verify that the links are good 30) You have now successfully set up a new corpus! Try running a full train/decode on the corpus to ensure that everything works.
 * Ex: # cd <corpus_name>/test/audio/utt
 * Ex: # perl /mnt/main/scripts/user/linkTransUtt.pl /mnt/main/corpus/switchboard/<corpus_name>/test/trans/dev.trans /mnt/main/corpus/switchboard/full/train/audio/utt/
 * Ex: # perl /mnt/main/scripts/user/linkTransUtt.pl /mnt/main/corpus/switchboard/<corpus_name>/test/trans/eval.trans /mnt/main/corpus/switchboard/full/train/audio/utt/
 * Ex: # perl /mnt/main/scripts/user/linkTransUtt.pl /mnt/main/corpus/switchboard/<corpus_name>/test/trans/train.trans /mnt/main/corpus/switchboard/full/train/audio/utt/
 * Ex: # ls -l
 * Note: If the links are good, the utterance file's font color (ex: sw4940B-ms98-a-0073.sph) should be Cyan. If they are not, delete all the links and try again, making sure your paths are typed correctly.
 * Ex: # cd <corpus_name>/train/audio/utt
 * Ex: # perl /mnt/main/scripts/user/linkTransUtt.pl /mnt/main/corpus/switchboard/<corpus_name>/train/trans/train.trans /mnt/main/corpus/switchboard/full/train/audio/utt/
 * Ex: # ls -l
 * Note: If the links are good, the utterance file's font color (ex: sw4940B-ms98-a-0073.sph) should be Cyan. If they are not, delete all the links and try again, making sure your paths are typed correctly.

Helpful Links

 * CMUSphinx(Speech Recognition Information)
 * Corpus (General Corpus Information)
 * Run a Train (Scripts to run trains, use Steps 1, 2, and 3)
 * Accessing Drones (Should be done in class but this is useful in case you were absent or the process didn't work for you)
 * Count # of Files (The command 'ls -1 | wc -l' is used)
 * Moving Directories (The command 'mv [source] [destination]' is used)
 * Creating Soft Links(The command 'ln -s ../disk1/swb1/*.sph -t .' is used)
 * Videolan.org(Link to videolan)
 * Filezilla-project.org(Link to Filezilla free FTP)
 * CMU Sphinx Sourceforge Page - CMUDict 0.7b update (for help w/ dictionary file - found by DL 2/12/17)

Thanks to Dylan for finding this info. I just posted it it here for everyone to see.

Data Team: Feel free to add further important links as you find them. - JFV 2/6/17

Regular Expression Changes for genTrans.pl Script
Working closely with Cody from the Experiments group, we were able to add regular expressions that, in theory, should improve the W.E.R. of the train and decode. However we were unable to get a working train and decode running despite our best efforts using the new scripts. For more details, you can look at Matt's or  Dylan's logs.


 * {| border="1"

!Case !Description !What should stay/be removed
 * [laughter]
 * Laughter
 * Remove
 * [laughter-word]
 * Laughter while speaking
 * Remove the laughter tag so [laughter-word] becomes word
 * [noise]
 * Random noise
 * Remove
 * [vocalized-noise]
 * Vocal noise (Ex: pfft)
 * Remove
 * wo[rd]-
 * Person said “wo” but got cut off or stuttered. “Word” was the intended thing to be said
 * Remove “[rd]-” and keep “wo”
 * -[wo]rd
 * Person said “rd” but meant to say “word”. Usually occurs when clip begins with someone already speaking
 * Remove “-[wo]” and keep “rd”
 * [worm/word]
 * Person said “worm” but “word” makes sense in the context (misspeak)
 * Keep “worm” as the voice recognition will pick this up and not “word”
 * [laughter-wo[rd]-]
 * Combination of laughter and unfinished/cut off word
 * Keep “wo” and remove “[laughter-[rd]-]”
 * [wokd/word]
 * “Wokd is what the person said but is not in the English language. “Word” makes sense in the context
 * Keep “wokd” as the voice recognition will pick this up and not “word”
 * }
 * [laughter-wo[rd]-]
 * Combination of laughter and unfinished/cut off word
 * Keep “wo” and remove “[laughter-[rd]-]”
 * [wokd/word]
 * “Wokd is what the person said but is not in the English language. “Word” makes sense in the context
 * Keep “wokd” as the voice recognition will pick this up and not “word”
 * }
 * Keep “wokd” as the voice recognition will pick this up and not “word”
 * }

Below are the regular expressions added to genTrans.pl and you can find them in /mnt/main/scripts/user/genTrans.new.pl:

$message =~ s/noise]//g;#changed - [noise] $message =~ s/\[laughter//g;#added $message =~ s/\[vocalized//g;#added $message =~ s/\w*\[\w*\]-//g; $message =~ s/-\[\w*\]\w*//g; $message =~ s/\[.*?\]-//g; $message =~ s/-\[.*?\]//g;