Speech:Corpus

From Openitware
Jump to: navigation, search


Project Notes


Speech Corpus

A speech corpus (or spoken corpus) is a database of speech audio files and text translations. Transcriptions, in the linguistic sense, are the systematic representation of language in written form. In Speech technology speech corpora are used, among other things, to create acoustic models. An acoustic model is created by taking audio recordings of speech, and their text transcriptions, and using software to create statistical representations of the sounds that make up each word. It is used by a speech recognition engine to recognize speech.

There are two types of Speech Corpora (corpora is the plural of corpus):

Read Speech - which includes:

  • Book excerpts
  • Broadcast news
  • Lists of words
  • Sequences of numbers

Spontaneous Speech - which includes:

  • Dialogs - between two or more people
  • Narratives - a person telling a story
  • Map-tasks - one person explains a route on a map to another
  • Appointment-tasks - two people try to find a common meeting time based on individual schedules


See https://foss.unh.edu/projects/index.php/Speech:Readings for links to the Switchboard home page and also research that IBM and others have done on how to choose the speakers in the various corpora you create.

Creating New Corpora

To create new corpora, follow the instructions below. (Last updated 4/24/2017)

Note: If copy/pasting, do not copy '#' at the beginning of commands. Please read through each bullet point before continuing on to next step.

1) From /mnt/main/corpus/switchboard, run 'makeCorpus.pl' script

  • Ex: # perl /mnt/main/scripts/user/makeCorpus.pl <corpus_name>
    • (<corpus_name> Ex: 5hr)

2) Navigate to <corpus_name>/info/misc directory

  • Ex: # cd <corpus_name>/info/misc

3) Copy the full transcript file from /mnt/main/corpus/switchboard/full/train/trans to your <corpus_name>/info/misc directory

  • Ex: # cp /mnt/main/corpus/switchboard/full/train/trans/train.trans .
    • Note the period at the end

4) Double check to see how many hours the copied transcript file covers

  • Ex: # awk '{total += $3 - $2} END {print total / 3600}' train.trans
    • > 311.761
  • This command tells you the length of the transcript file in hours

5) In order to trim down the transcript to our desired corpus size (in my case, 5 hours), we must do a bit of math:

  • Take the total amount of hours of the full transcript (311 hours) and divide it by your desired corpus size plus 10 additional hours (~5 hours each for 'dev.trans' and 'eval.trans')
    • Ex: 311 / (5 + 10) = 20 (always round down to the closest whole number)
    • This number will act as our sample count for the following step

6) Take a sample of the transcript by running 'sampleTrans.pl' script

  • Ex: # sampleTrans.pl <optional_flag> <sample_count> <transcript>
    • Ex: # perl /mnt/main/scripts/user/sampleTrans.pl 20 train.trans
      • This will create 'train.trans-sampled', which grabs every 20th line from the transcript, and the optional '-r' flag will create train.trains-remaining, which grabs all the remaining lines
      • For this step, you do not need to include the -r flag in the command

7) Double check that the size of the sampled transcript matches your desired corpus size (plus ~10 hours)

  • Ex: # awk '{total += $3 - $2} END {print total / 3600}' train.trans-sampled
    • > 15.513

8) Rename 'train.trans' to 'train.trans-full'

  • Ex: # mv train.trans train.trans-full

9) Rename 'train.trans-sampled' to 'train.trans'

  • Ex: # mv train.trans-sampled train.trans

11) While still in the <corpus_name>/info/misc directory, run the 'sampleTrans.pl' script again. However, this time, add in the '-r' flag to create the 'train.trans-remaining' file and change the sample count to a value equal to the size of your current 'train.trans' file, divided by 5. (In my case, 5 + 10 = 15, and 15 / 5 = 3)

  • Ex: # awk '{total += $3 - $2} END {print total / 3600}' train.trans
    • > 15.513
  • Ex: # perl /mnt/main/scripts/user/sampleTrans.pl -r 3 train.trans
    • Note: In the following steps, we will be generating our 'dev.trans', 'eval.trans', and 'train.trans' files. Both 'dev.trans' and 'eval.trans' contain about 5 hours each worth of transcript that we will remove from our sampled transcript. 'train.trans' will also contain about 5 hours worth of transcript, but that portion will not be removed from the sample transcript. These files will be for people who want to train on unseen data.

12) Rename 'train.trans' to 'train.trans-old1'

  • Ex: # mv train.trans train.trans-old1

13) Move 'train.trans-sampled' into your <corpus_name>/test/trans directory and rename it as 'dev.trans'

  • Ex: # mv train.trans-sampled ../../test/trans/dev.trans

14) Rename 'train.trans-remaining' to 'train.trans'

  • Ex: # mv train.trans-remaining train.trans

15) While still in the <corpus_name>/info/misc directory, run the 'sampleTrans.pl' script again. Do not forget to change the sample count to a value equal to the size of your 'train.trans' file, divided by 5.

  • Ex: # awk '{total += $3 - $2} END {print total / 3600}' train.trans
    • > 10.2451
  • Ex: # perl /mnt/main/scripts/user/sampleTrans.pl -r 2 train.trans

16) Rename 'train.trans' to 'train.trans-old2'

  • Ex: # mv train.trans train.trans-old2

17) Move 'train.trans-sampled' into your <corpus_name>/test/trans directory and rename it as 'eval.trans'

  • Ex: # mv train.trans-sampled ../../test/trans/eval.trans

18) Rename 'train.trans-remaining' to 'train.trans'

  • Ex: # mv train.trans-remaining train.trans

19) One last time, run the 'sampleTrans.pl' script. Do not forget to change the sample count to a value equal to the size of your 'train.trans' file, divided by 5.

  • Ex: # awk '{total += $3 - $2} END {print total / 3600}' train.trans
    • > 5.2134
  • Ex: # perl /mnt/main/scripts/user/sampleTrans.pl -r 1 train.trans

20) Move the 'train.trans-sampled' file to the <corpus_name>/test/trans directory

  • Ex: # mv train.trans-sampled ../../test/trans/train.trans

21) Copy the 'train.trans' file to the <corpus_name>/train/trans directory

  • Ex: # cp train.trans ../../train/trans/train.trans

22) Navigate to the <corpus_name>/test/trans directory

  • Ex: # cd ../../test/trans

23) Double check that the size of each file (dev.trans, eval.trans, train.trans) is equal to roughly 5 hours

  • Ex: # awk '{total += $3 - $2} END {print total / 3600}' dev.trans
    • > 5.2679
  • Ex: # awk '{total += $3 - $2} END {print total / 3600}' eval.trans
    • > 5.03174
  • Ex: # awk '{total += $3 - $2} END {print total / 3600}' train.trans
    • > 5.2134

24) The final steps are to create the links to the utterance files. To do this, first navigate to the <corpus_name>/test/audio/utt directory.

  • Ex: # cd <corpus_name>/test/audio/utt

25) Run the 'linkTransUtt.pl' script for 'dev.trans', 'eval.trans', and 'train.trans', as shown below. (Note the '/' at the end of the final path)

  • Ex: # perl /mnt/main/scripts/user/linkTransUtt.pl /mnt/main/corpus/switchboard/<corpus_name>/test/trans/dev.trans /mnt/main/corpus/switchboard/full/train/audio/utt/
  • Ex: # perl /mnt/main/scripts/user/linkTransUtt.pl /mnt/main/corpus/switchboard/<corpus_name>/test/trans/eval.trans /mnt/main/corpus/switchboard/full/train/audio/utt/
  • Ex: # perl /mnt/main/scripts/user/linkTransUtt.pl /mnt/main/corpus/switchboard/<corpus_name>/test/trans/train.trans /mnt/main/corpus/switchboard/full/train/audio/utt/

26) Run the 'ls -l' command to verify that the links are good

  • Ex: # ls -l
    • Note: If the links are good, the utterance file's font color (ex: sw4940B-ms98-a-0073.sph) should be Cyan. If they are not, delete all the links and try again, making sure your paths are typed correctly.

27) Navigate to the <corpus_name>/train/audio/utt directory

  • Ex: # cd <corpus_name>/train/audio/utt

28) Run the 'linkTransUtt.pl' script for 'train.trans', as shown below. (Note the '/' at the end of the final path)

  • Ex: # perl /mnt/main/scripts/user/linkTransUtt.pl /mnt/main/corpus/switchboard/<corpus_name>/train/trans/train.trans /mnt/main/corpus/switchboard/full/train/audio/utt/

29) Run the 'ls -l' command to verify that the links are good

  • Ex: # ls -l
    • Note: If the links are good, the utterance file's font color (ex: sw4940B-ms98-a-0073.sph) should be Cyan. If they are not, delete all the links and try again, making sure your paths are typed correctly.

30) You have now successfully set up a new corpus! Try running a full train/decode on the corpus to ensure that everything works.

Speech Data Setup

Making the directories

To create the needed folders we went to where they were going to be located in command line and the used the MD command to create the new folders, an example is MD corpus. This command can be used to create multiple folders at once as long as they are at the same location. An example of this is creating the full and mini folders, MD full mini.

 /mnt/main/corpus/switchboard> MD full mini

Once the folders were created the ownership of the folders was changed, using root access, to root using the chown command which changes the ownership of a folder.

 /mnt/main/corpus/switchboard> chown root full

The group ownership of the folders was also changed to cis790 using the chgrp command.

 /mnt/main/corpus/switchboard> chgrp cis790 mini

The command ls -l can be used to view folder permissions and ownerships.

 /mnt/main/corpus/switchboard> ls -l
 total 8
 drwxr-xr-x 5 root cis790 4096 2012-03-06 03:24 full
 drwxr-xr-x 5 root cis790 4096 2012-03-06 03:25 mini

The Switchboard directory in /mnt/main/corpus/dist is a redirected directory from /media/data/Switchboard. We used the ln –s command, which is symbolic link command.

Current Directory Setup

Filestructure.PNG

Perl Scripts

1.The Perl script that is being used to complete everything that needs to be completed is * GenTrans.

SOX

1. The syntax that will create a .wav file from a specified time range in the .sph file is:

sox <old>.sph <new>.wav trim [SECOND TO START] [SECONDS DURATION]

For an example of this syntax I used:

sox sw02001.sph 02001.wav trim 64 7

It then created a .wav file where file sw02001.sph had an utterance starting at 64 seconds and lasted 7 seconds.

Commands

1. To create symbolic (soft) links from one file to another file the ln -s command is to be used.

ln -s ../disk1/swb1/*.sph -t .

This command will go back one directory and navigate to disk1 and then to swb1. It then will grab all the files ending with .sph. The -t is the target directory which in this case was the folder that I was running the command from. This command was being ran from /mnt/main/corpus/dist/Switchboard/flat. Since it was being ran from the flat folder it had to navigate backwards, hence the .. at the beginning of the command, to the disk1 directory.