Speech:Spring 2016 Data Group

From Openitware
Jump to: navigation, search


Group Member Logs

Authors: Brian Anker Justin Gauthier Brenden Collins Brian Doehner

This log is being used to reflect tasks that the Data Group has finished this semester and how future Data Group's can achieve the same results.

Creating a specific corpus size:

The actual full switchboard corpus size has determined to be 311 hours. With this knowledge, the Data Group restructured the entire /mnt/main/corpus/switchboard directory with new corpora 300hr, 145hr, 30hr, dist, full, and old. old contains all previous built corpora by previous semesters that the Data Group, Spring 2016 edition deem unneeded. dist contains the original CD audio for the whole switchboard corpus and full contains the entire 311 hours of the data which is referenced when building new corpora. Each of the corpora contain an eval.trans, dev.trans and test/train.trans'. eval.trans and dev.trans are 5 hours each and have been removed from the each corpus. This was done to capture "unseen" data so that we may decode on unseen data. test/train.trans is 5 hours of sampled data that wasn't removed from the corpus and is used as "seen" data when running a basic decode. That's why the corpus is called 300hr, it was originally 311 hours with 5 hours removed for eval.trans and 5 hours removed for dev.trans. All other corpora are structured in this manner with the exception of those located in the old directory.

First navigate to /mnt/main/corpus/switchboard (You can also navigate to Brenden Collins' 2016 Logs to see this process in action)

1. Run makeCorpus.pl <corpus_name> 2. Copy a transcript file to <corpus_name>/info/misc 3. CD into <corpus_name>/info/misc 4. Sample Transcripts:

  • 1. Run sampleTrans.pl -r <sample_count> <transcript>
    • 1. This will create train.trans-sampled (every nth line) and train.trains-remaining (the remainder)
    • 2. Rename train.trains-sampled to dev.trans, move it into the <corpus_name>/test/trans directory
    • 3. Rename train.trains to train.trans-orig1 (archiving the untouched train.trans file)
    • 4. Rename train.trans-sample to train.trans (allows us to repeat a sample on the trans file that has the dev.trans lines remove from it).
  • 2. Run sampleTrans.pl -r <sample_count> <transcript>
    • 1. This will create train.trans-sampled (every nth line) and train.trains-remaining (the remainder)
    • 2. Rename train.trains-sampled to eval.trans, move it into the <corpus_name>/test/trans directory
    • 3. Rename train.trains to train.trans-orig2 (archiving the untouched train.trans file)
    • 4. Rename train.trans-sample to train.trans (allows us to repeat a sample on the trans file that has the dev.trans and eval.trans lines remove from it).
  • 3. Run sampleTrans.pl <sample_count> <transcript> NO -R HERE
    • 1. This will create train.trans-sampled file, no train.trans-remaining will be created
    • 2. Move train.trans-remaining <corpus_name>/test/utt/trans and rename it train.trans
  • 4. Copy <corpus_name>/info/misc/train.trans to <corpus_name>/train/trans/train.trans (this is the trans file remaining after all our samples, it is what we will use for the trains)
  • 5. Create Links to utterances
    • 1. The train/audio/utt files
      • 1. CD into train/audio/utt
      • 2. Run linkTransAudio.pl <path to train/trans/train.trans> <path to src utterances (such as /mnt/main/corpus/switchboard/full/train/audio/utt/)>
      • 3. Ls afterward to verify you have good links
    • 2. The test/audio/utt files

Repeat the same process as above 3 times: eval.trans, dev.trans, and train.trans

Where data audio files are located:

Utterance audio files were listened to this semester by the Data Group. We identified that many audio files did not match the transcripts which allowed the Modeling Group to reload audio files and also remove some conversations from the transcripts that we did not have audio files for. This finding greatly helped decreasing Word Error Rate as now we had more accurate data to run our train and decodes on.

Their location is /mnt/main/corpus/switchboard/full/train/audio/utt and the format of the files are .sph which can be played on Windows machines using VLC Media Player. http://www.videolan.org/vlc/

Other corpora that were created this semester include links to the utterance files located in this directory. This was done to save space rather than continually copying over the same utterance files into multiple different directories.

sw2717B-ms98-a-0044.sph is an example of one utterance file which corresponds with a line in the transcript file that textualizes the spoken audio file.

How to listen to data audio files:

The Data Group used FileZilla to navigate to /mnt/main/corpus/switchboard/full/train/audio/utt and copied the randomly sampled audio files unto our personal desktops. We used VLC Media Player to play each of the utterance files and ensured that they matched with the transcript file. The transcript file that corresponds with the utterance files is located at /mnt/main/corpus/switchboard/full/train/trans train.trans which is the entire 311 hours.

(Justin Gauthier's 2016 Logs contain some commands that were run to generate a random sample size of the total 250,330 audio files that were to be evaluated.)

How to generate per utterance scoring report:

The idea behind scoring the entire 300hr corpus is to identify those utterance files which might be performing badly and if corrections can't be made to these audio files, they would be omitted from the training and decoding process thus decreasing total Word Error Rate (WER). Current scoring logs that are generated with SCLite only score per speaker which averages all utterances for that speaker and gives a score. Justin Gauthier was able to find a way to generate a Labeled Utterance Report (LUR) which lists every utterance and it's score which will be the basis for grading every utterance. Future semesters will be responsible for deciding what a poor score is, maybe anything above 60% and they will have to further investigate those audio files. If the audio files prove to be mumbling or unclear it is advised they remove the audio file and also remove the line in the transcript file that has the issue.

To generate a Labeled Utterance Report (LUR) you are first going to have to go into the either the etc directory in your experiment if you ran a train on seen data or you want to be in the DECODE directory of your experiment if the data is unseen. Once in that corresponding directory you might find one or more decode.log files. You will need to run the following command (/mnt/main/scripts/user/parseDecode.pl decode.log ../etc/hyp.trans) to create a hyp.trans for one or all of the decode.log files. In our case the data was unseen. After that is finished you are going to want to run this scoring command (sclite -r <exp#>_train.trans -h hyp.trans -i swb -o all lur) for all of the hyp.trans files using the correct train.trans. That command will provide your with 4 different files when it is run one time. The most inmportant file is the hyp.trans.sys. This file has the WER for each utterance and the total WER of that section of the corpus.

300hr full decode 2 parts location:

Two experiments were run to decode the whole 300hr corpus. Experiment 0284/007 ran the first half of the 300 hour corpus or 121,165 utterances and Experiment 0284/008 ran the second half of the 300hr corpus or the last 121,165 utterances. After creating the LUR we have came to the conclusion that the total WER of the full corpus averages to ~41%. We did not go through each specific utterance which is something that next years capstone should take a look into. All of the LUR tables that were produced are to big to put on the wiki. The locations of each report are as follows:

  • Experiment 007
    • /mnt/main/Exp/0284/007/etc/hyp_1.trans.sys 42.6% WER
    • /mnt/main/Exp/0284/007/etc/hyp_2.trans.sys 42.2% WER
    • /mnt/main/Exp/0284/007/etc/hyp_3.trans.sys 41.3% WER
    • /mnt/main/Exp/0284/007/etc/hyp_4.trans.sys 41.1% WER
  • Experiment 008
    • /mnt/main/Exp/0284/008/etc/hyp_1.trans.sys 41.8% WER
    • /mnt/main/Exp/0284/008/etc/hyp_2.trans.sys 44.3% WER
    • /mnt/main/Exp/0284/008/etc/hyp_3.trans.sys 41.0% WER
    • /mnt/main/Exp/0284/008/etc/hyp_4.trans.sys 38.9% WER
  • Total WER: 41.65%