Speech:Spring 2017 Data Group

From Openitware
Jump to: navigation, search


Groups


Group Member Logs


Creating New Corpora

This semester, we updated instructions from last year's data group on how to create new corpora sizes. The following instructions are up-to-date as of 2/28/2017.

Note: If copy/pasting, do not copy '#' at the beginning of commands. Please read through EACH bullet point before continuing on to next step.

1) From /mnt/main/corpus/switchboard, run 'makeCorpus.pl' script

  • Ex: # perl /mnt/main/scripts/user/makeCorpus.pl <corpus_name>
    • (<corpus_name> Ex: 5hr)

2) Navigate to <corpus_name>/info/misc directory

  • Ex: # cd <corpus_name>/info/misc

3) Copy the full transcript file from /mnt/main/corpus/switchboard/full/train/trans to your <corpus_name>/info/misc directory

  • Ex: # cp /mnt/main/corpus/switchboard/full/train/trans/train.trans .
    • Note the period at the end

4) Double check to see how many hours the copied transcript file covers

  • Ex: # awk '{total += $3 - $2} END {print total / 3600}' train.trans
    • > 311.761
  • This command tells you the length of the transcript file in hours

5) In order to trim down the transcript to our desired corpus size (in my case, 5 hours), we must do a bit of math:

  • Take the total amount of hours of the full transcript (311 hours) and divide it by your desired corpus size plus 10 additional hours (~5 hours each for 'dev.trans' and 'eval.trans')
    • Ex: 311 / (5 + 10) = 20 (always round down to the closest whole number)
    • This number will act as our sample count for the following step

6) Take a sample of the transcript by running 'sampleTrans.pl' script

  • Ex: # sampleTrans.pl <optional_flag> <sample_count> <transcript>
    • Ex: # perl /mnt/main/scripts/user/sampleTrans.pl 20 train.trans
      • This will create 'train.trans-sampled', which grabs every 20th line from the transcript, and the optional '-r' flag will create train.trains-remaining, which grabs all the remaining lines
      • For this step, you do not need to include the -r flag in the command

7) Double check that the size of the sampled transcript matches your desired corpus size (plus ~10 hours)

  • Ex: # awk '{total += $3 - $2} END {print total / 3600}' train.trans-sampled
    • > 15.513

8) Rename 'train.trans' to 'train.trans-full'

  • Ex: # mv train.trans train.trans-full

9) Rename 'train.trans-sampled' to 'train.trans'

  • Ex: # mv train.trans-sampled train.trans

11) While still in the <corpus_name>/info/misc directory, run the 'sampleTrans.pl' script again. However, this time, add in the '-r' flag to create the 'train.trans-remaining' file and change the sample count to a value equal to the size of your current 'train.trans' file, divided by 5. (In my case, 5 + 10 = 15, and 15 / 5 = 3)

  • Ex: # awk '{total += $3 - $2} END {print total / 3600}' train.trans
    • > 15.513
  • Ex: # perl /mnt/main/scripts/user/sampleTrans.pl -r 3 train.trans
    • Note: In the following steps, we will be generating our 'dev.trans', 'eval.trans', and 'train.trans' files. Both 'dev.trans' and 'eval.trans' contain about 5 hours each worth of transcript that we will remove from our sampled transcript. 'train.trans' will also contain about 5 hours worth of transcript, but that portion will not be removed from the sample transcript. These files will be for people who want to train on unseen data.

12) Rename 'train.trans' to 'train.trans-old1'

  • Ex: # mv train.trans train.trans-old1

13) Move 'train.trans-sampled' into your <corpus_name>/test/trans directory and rename it as 'dev.trans'

  • Ex: # mv train.trans-sampled ../../test/trans/dev.trans

14) Rename 'train.trans-remaining' to 'train.trans'

  • Ex: # mv train.trans-remaining train.trans

15) While still in the <corpus_name>/info/misc directory, run the 'sampleTrans.pl' script again. Do not forget to change the sample count to a value equal to the size of your 'train.trans' file, divided by 5.

  • Ex: # awk '{total += $3 - $2} END {print total / 3600}' train.trans
    • > 10.2451
  • Ex: # perl /mnt/main/scripts/user/sampleTrans.pl -r 2 train.trans

16) Rename 'train.trans' to 'train.trans-old2'

  • Ex: # mv train.trans train.trans-old2

17) Move 'train.trans-sampled' into your <corpus_name>/test/trans directory and rename it as 'eval.trans'

  • Ex: # mv train.trans-sampled ../../test/trans/eval.trans

18) Rename 'train.trans-remaining' to 'train.trans'

  • Ex: # mv train.trans-remaining train.trans

19) One last time, run the 'sampleTrans.pl' script. Do not forget to change the sample count to a value equal to the size of your 'train.trans' file, divided by 5.

  • Ex: # awk '{total += $3 - $2} END {print total / 3600}' train.trans
    • > 5.2134
  • Ex: # perl /mnt/main/scripts/user/sampleTrans.pl -r 1 train.trans

20) Move the 'train.trans-sampled' file to the <corpus_name>/test/trans directory

  • Ex: # mv train.trans-sampled ../../test/trans/train.trans

21) Copy the 'train.trans' file to the <corpus_name>/train/trans directory

  • Ex: # cp train.trans ../../train/trans/train.trans

22) Navigate to the <corpus_name>/test/trans directory

  • Ex: # cd ../../test/trans

23) Double check that the size of each file (dev.trans, eval.trans, train.trans) is equal to roughly 5 hours

  • Ex: # awk '{total += $3 - $2} END {print total / 3600}' dev.trans
    • > 5.2679
  • Ex: # awk '{total += $3 - $2} END {print total / 3600}' eval.trans
    • > 5.03174
  • Ex: # awk '{total += $3 - $2} END {print total / 3600}' train.trans
    • > 5.2134

24) The final steps are to create the links to the utterance files. To do this, first navigate to the <corpus_name>/test/audio/utt directory.

  • Ex: # cd <corpus_name>/test/audio/utt

25) Run the 'linkTransUtt.pl' script for 'dev.trans', 'eval.trans', and 'train.trans', as shown below. (Note the '/' at the end of the final path)

  • Ex: # perl /mnt/main/scripts/user/linkTransUtt.pl /mnt/main/corpus/switchboard/<corpus_name>/test/trans/dev.trans /mnt/main/corpus/switchboard/full/train/audio/utt/
  • Ex: # perl /mnt/main/scripts/user/linkTransUtt.pl /mnt/main/corpus/switchboard/<corpus_name>/test/trans/eval.trans /mnt/main/corpus/switchboard/full/train/audio/utt/
  • Ex: # perl /mnt/main/scripts/user/linkTransUtt.pl /mnt/main/corpus/switchboard/<corpus_name>/test/trans/train.trans /mnt/main/corpus/switchboard/full/train/audio/utt/

26) Run the 'ls -l' command to verify that the links are good

  • Ex: # ls -l
    • Note: If the links are good, the utterance file's font color (ex: sw4940B-ms98-a-0073.sph) should be Cyan. If they are not, delete all the links and try again, making sure your paths are typed correctly.

27) Navigate to the <corpus_name>/train/audio/utt directory

  • Ex: # cd <corpus_name>/train/audio/utt

28) Run the 'linkTransUtt.pl' script for 'train.trans', as shown below. (Note the '/' at the end of the final path)

  • Ex: # perl /mnt/main/scripts/user/linkTransUtt.pl /mnt/main/corpus/switchboard/<corpus_name>/train/trans/train.trans /mnt/main/corpus/switchboard/full/train/audio/utt/

29) Run the 'ls -l' command to verify that the links are good

  • Ex: # ls -l
    • Note: If the links are good, the utterance file's font color (ex: sw4940B-ms98-a-0073.sph) should be Cyan. If they are not, delete all the links and try again, making sure your paths are typed correctly.

30) You have now successfully set up a new corpus! Try running a full train/decode on the corpus to ensure that everything works.

Helpful Links


Thanks to Dylan for finding this info. I just posted it it here for everyone to see.


Data Team: Feel free to add further important links as you find them. - JFV 2/6/17


Regular Expression Changes for genTrans.pl Script

Working closely with Cody from the Experiments group, we were able to add regular expressions that, in theory, should improve the W.E.R. of the train and decode. However we were unable to get a working train and decode running despite our best efforts using the new scripts. For more details, you can look at Matt's or Dylan's logs.

Case Description What should stay/be removed
[laughter] Laughter Remove
[laughter-word] Laughter while speaking Remove the laughter tag so [laughter-word] becomes word
[noise] Random noise Remove
[vocalized-noise] Vocal noise (Ex: pfft) Remove
wo[rd]- Person said “wo” but got cut off or stuttered. “Word” was the intended thing to be said Remove “[rd]-” and keep “wo”
-[wo]rd Person said “rd” but meant to say “word”. Usually occurs when clip begins with someone already speaking Remove “-[wo]” and keep “rd”
[worm/word] Person said “worm” but “word” makes sense in the context (misspeak) Keep “worm” as the voice recognition will pick this up and not “word”
[laughter-wo[rd]-] Combination of laughter and unfinished/cut off word Keep “wo” and remove “[laughter-[rd]-]”
[wokd/word] “Wokd is what the person said but is not in the English language. “Word” makes sense in the context Keep “wokd” as the voice recognition will pick this up and not “word”

Below are the regular expressions added to genTrans.pl and you can find them in /mnt/main/scripts/user/genTrans.new.pl:

$message =~ s/noise]//g;#changed - [noise]
$message =~ s/\[laughter//g;#added
$message =~ s/\[vocalized//g;#added
$message =~ s/\w*\[\w*\]-//g;
$message =~ s/-\[\w*\]\w*//g;
$message =~ s/\[.*?\]-//g;
$message =~ s/-\[.*?\]//g;

Tasks