Speech:Corpus


 * Home
 * Semesters - Project Work by Semester
 * [Information]
 * System Description
 * Experiments - List of speech experiments

Project Notes

 * Unix Notes
 * [Speech Corpus Setup] - Switchboard,  NOAA
 * Speech Recognition Related Readings
 * Experiment Setup
 * Scripts Page
 * Model Building - more info on data prep,  language models, &  building models
 * Step 1: Run a Train
 * Step 2: Create the Language Model
 * Step 3: Run a Decode

Speech Corpus
A speech corpus (or spoken corpus) is a database of speech audio files and text translations. Transcriptions, in the linguistic sense, are the systematic representation of language in written form. In Speech technology speech corpora are used, among other things, to create acoustic models. An acoustic model is created by taking audio recordings of speech, and their text transcriptions, and using software to create statistical representations of the sounds that make up each word. It is used by a speech recognition engine to recognize speech. There are two types of Speech Corpora (corpora is the plural of corpus):

Read Speech - which includes:
 * Book excerpts
 * Broadcast news
 * Lists of words
 * Sequences of numbers

Spontaneous Speech - which includes:
 * Dialogs - between two or more people
 * Narratives - a person telling a story
 * Map-tasks - one person explains a route on a map to another
 * Appointment-tasks - two people try to find a common meeting time based on individual schedules

See https://foss.unh.edu/projects/index.php/Speech:Readings for links to the Switchboard home page and also research that IBM and others have done on how to choose the speakers in the various corpora you create.

Creating New Corpora
To create new corpora, follow the instructions below. (Last updated 4/24/2017)

Note: If copy/pasting, do not copy '#' at the beginning of commands. Please read through each bullet point before continuing on to next step.

1) From /mnt/main/corpus/switchboard, run 'makeCorpus.pl' script 2) Navigate to /info/misc directory 3) Copy the full transcript file from /mnt/main/corpus/switchboard/full/train/trans to your /info/misc directory 4) Double check to see how many hours the copied transcript file covers 5) In order to trim down the transcript to our desired corpus size (in my case, 5 hours), we must do a bit of math:
 * Ex: # perl /mnt/main/scripts/user/makeCorpus.pl 
 * ( Ex: 5hr)
 * Ex: # cd /info/misc
 * Ex: # cp /mnt/main/corpus/switchboard/full/train/trans/train.trans.
 * Note the period at the end
 * Ex: # awk '{total += $3 - $2} END {print total / 3600}' train.trans
 * > 311.761
 * This command tells you the length of the transcript file in hours
 * Take the total amount of hours of the full transcript (311 hours) and divide it by your desired corpus size plus 10 additional hours (~5 hours each for 'dev.trans' and 'eval.trans')
 * Ex: 311 / (5 + 10) = 20 (always round down to the closest whole number)
 * This number will act as our sample count for the following step

6) Take a sample of the transcript by running 'sampleTrans.pl' script 7) Double check that the size of the sampled transcript matches your desired corpus size (plus ~10 hours) 8) Rename 'train.trans' to 'train.trans-full' 9) Rename 'train.trans-sampled' to 'train.trans'
 * Ex: # sampleTrans.pl  
 * Ex: # perl /mnt/main/scripts/user/sampleTrans.pl 20 train.trans
 * This will create 'train.trans-sampled', which grabs every 20th line from the transcript, and the optional '-r' flag will create train.trains-remaining, which grabs all the remaining lines
 * For this step, you do not need to include the -r flag in the command
 * Ex: # awk '{total += $3 - $2} END {print total / 3600}' train.trans-sampled
 * > 15.513
 * Ex: # mv train.trans train.trans-full
 * Ex: # mv train.trans-sampled train.trans

11) While still in the /info/misc directory, run the 'sampleTrans.pl' script again. However, this time, add in the '-r' flag to create the 'train.trans-remaining' file and change the sample count to a value equal to the size of your current 'train.trans' file, divided by 5. (In my case, 5 + 10 = 15, and 15 / 5 = 3) 12) Rename 'train.trans' to 'train.trans-old1' 13) Move 'train.trans-sampled' into your /test/trans directory and rename it as 'dev.trans' 14) Rename 'train.trans-remaining' to 'train.trans'
 * Ex: # awk '{total += $3 - $2} END {print total / 3600}' train.trans
 * > 15.513
 * Ex: # perl /mnt/main/scripts/user/sampleTrans.pl -r 3 train.trans
 * Note: In the following steps, we will be generating our 'dev.trans', 'eval.trans', and 'train.trans' files. Both 'dev.trans' and 'eval.trans' contain about 5 hours each worth of transcript that we will remove from our sampled transcript. 'train.trans' will also contain about 5 hours worth of transcript, but that portion will not be removed from the sample transcript. These files will be for people who want to train on unseen data.
 * Ex: # mv train.trans train.trans-old1
 * Ex: # mv train.trans-sampled ../../test/trans/dev.trans
 * Ex: # mv train.trans-remaining train.trans

15) While still in the /info/misc directory, run the 'sampleTrans.pl' script again. Do not forget to change the sample count to a value equal to the size of your 'train.trans' file, divided by 5. 16) Rename 'train.trans' to 'train.trans-old2' 17) Move 'train.trans-sampled' into your /test/trans directory and rename it as 'eval.trans' 18) Rename 'train.trans-remaining' to 'train.trans'  19) One last time, run the 'sampleTrans.pl' script. Do not forget to change the sample count to a value equal to the size of your 'train.trans' file, divided by 5. 20) Move the 'train.trans-sampled' file to the /test/trans directory 21) Copy the 'train.trans' file to the /train/trans directory 22) Navigate to the /test/trans directory 23) Double check that the size of each file (dev.trans, eval.trans, train.trans) is equal to roughly 5 hours
 * Ex: # awk '{total += $3 - $2} END {print total / 3600}' train.trans
 * > 10.2451
 * Ex: # perl /mnt/main/scripts/user/sampleTrans.pl -r 2 train.trans
 * Ex: # mv train.trans train.trans-old2
 * Ex: # mv train.trans-sampled ../../test/trans/eval.trans
 * Ex: # mv train.trans-remaining train.trans
 * Ex: # awk '{total += $3 - $2} END {print total / 3600}' train.trans
 * > 5.2134
 * Ex: # perl /mnt/main/scripts/user/sampleTrans.pl -r 1 train.trans
 * Ex: # mv train.trans-sampled ../../test/trans/train.trans
 * Ex: # cp train.trans ../../train/trans/train.trans
 * Ex: # cd ../../test/trans
 * Ex: # awk '{total += $3 - $2} END {print total / 3600}' dev.trans
 * > 5.2679
 * Ex: # awk '{total += $3 - $2} END {print total / 3600}' eval.trans
 * > 5.03174
 * Ex: # awk '{total += $3 - $2} END {print total / 3600}' train.trans
 * > 5.2134

24) The final steps are to create the links to the utterance files. To do this, first navigate to the /test/audio/utt directory. '''25) Run the 'linkTransUtt.pl' script for 'dev.trans', 'eval.trans', and 'train.trans', as shown below. (Note the '/' at the end of the final path)''' 26) Run the 'ls -l' command to verify that the links are good 27) Navigate to the /train/audio/utt directory 28) Run the 'linkTransUtt.pl' script for 'train.trans', as shown below. (Note the '/' at the end of the final path) 29) Run the 'ls -l' command to verify that the links are good 30) You have now successfully set up a new corpus! Try running a full train/decode on the corpus to ensure that everything works.
 * Ex: # cd <corpus_name>/test/audio/utt
 * Ex: # perl /mnt/main/scripts/user/linkTransUtt.pl /mnt/main/corpus/switchboard/<corpus_name>/test/trans/dev.trans /mnt/main/corpus/switchboard/full/train/audio/utt/
 * Ex: # perl /mnt/main/scripts/user/linkTransUtt.pl /mnt/main/corpus/switchboard/<corpus_name>/test/trans/eval.trans /mnt/main/corpus/switchboard/full/train/audio/utt/
 * Ex: # perl /mnt/main/scripts/user/linkTransUtt.pl /mnt/main/corpus/switchboard/<corpus_name>/test/trans/train.trans /mnt/main/corpus/switchboard/full/train/audio/utt/
 * Ex: # ls -l
 * Note: If the links are good, the utterance file's font color (ex: sw4940B-ms98-a-0073.sph) should be Cyan. If they are not, delete all the links and try again, making sure your paths are typed correctly.
 * Ex: # cd <corpus_name>/train/audio/utt
 * Ex: # perl /mnt/main/scripts/user/linkTransUtt.pl /mnt/main/corpus/switchboard/<corpus_name>/train/trans/train.trans /mnt/main/corpus/switchboard/full/train/audio/utt/
 * Ex: # ls -l
 * Note: If the links are good, the utterance file's font color (ex: sw4940B-ms98-a-0073.sph) should be Cyan. If they are not, delete all the links and try again, making sure your paths are typed correctly.

Making the directories
To create the needed folders we went to where they were going to be located in command line and the used the MD command to create the new folders, an example is MD corpus. This command can be used to create multiple folders at once as long as they are at the same location. An example of this is creating the full and mini folders, MD full mini. /mnt/main/corpus/switchboard> MD full mini

Once the folders were created the ownership of the folders was changed, using root access, to root using the chown command which changes the ownership of a folder. /mnt/main/corpus/switchboard> chown root full

The group ownership of the folders was also changed to cis790 using the chgrp command. /mnt/main/corpus/switchboard> chgrp cis790 mini

The command ls -l can be used to view folder permissions and ownerships. /mnt/main/corpus/switchboard> ls -l total 8 drwxr-xr-x 5 root cis790 4096 2012-03-06 03:24 full drwxr-xr-x 5 root cis790 4096 2012-03-06 03:25 mini

The Switchboard directory in /mnt/main/corpus/dist is a redirected directory from /media/data/Switchboard. We used the ln –s command, which is symbolic link command.

Perl Scripts
1.The Perl script that is being used to complete everything that needs to be completed is *[[Speech:GenTrans| GenTrans].]

SOX
1. The syntax that will create a .wav file from a specified time range in the .sph file is: sox .sph .wav trim [SECOND TO START] [SECONDS DURATION]

For an example of this syntax I used:

sox sw02001.sph 02001.wav trim 64 7

It then created a .wav file where file sw02001.sph had an utterance starting at 64 seconds and lasted 7 seconds.

Commands
1. To create symbolic (soft) links from one file to another file the ln -s command is to be used.

ln -s ../disk1/swb1/*.sph -t.

This command will go back one directory and navigate to disk1 and then to swb1. It then will grab all the files ending with .sph. The -t is the target directory which in this case was the folder that I was running the command from. This command was being ran from /mnt/main/corpus/dist/Switchboard/flat. Since it was being ran from the flat folder it had to navigate backwards, hence the .. at the beginning of the command, to the disk1 directory.