Difference between revisions of "Speech:Models Data Prep"

From Openitware
Jump to: navigation, search
(Data Preparation Steps)
 
(41 intermediate revisions by 3 users not shown)
Line 1: Line 1:
 
*[[Speech:Home| Home]]
 
*[[Speech:Home| Home]]
*[[Speech:Info| Information]]
+
*[[Speech:Semesters| Semesters]] - Project Work by Semester
 +
*[Information]
 +
**[[Speech:Details| System Description]]
 +
*[[Speech:Exps| Experiments]] - List of speech experiments
 +
 
 
__NOTOC__
 
__NOTOC__
  
==Class Notes==
+
==Project Notes==
  
 
*[[Speech:Unix| Unix Notes]]
 
*[[Speech:Unix| Unix Notes]]
*[[Speech:Backups| Data Backup]]
+
*[[Speech:Corpus| Speech Corpus Setup]] - [[Speech:Switchboard| Switchboard]], [[Speech:NOAA| NOAA]]
*[[Speech:Network| Network Bridge]]
+
*[[Speech:Readings| Speech Recognition Related Readings]]
*[[Speech:Install| Speech S/W Installation]]
+
*[[Speech:Corpus| Speech Corpus Setup]]
+
 
*[[Speech:Exp| Experiment Setup]]
 
*[[Speech:Exp| Experiment Setup]]
*[[Speech:Models| Model Building]]
+
*[[Speech:Scripts| Scripts Page]]
 +
*[[Speech:Models| Model Building]] - more info on [data prep], [[Speech:Models LM Build| language models]], & [[Speech:Models AM Build| building models]]
 +
**[[Speech:Training| Step 1: Run a Train]]
 +
**[[Speech:Create LM| Step 2: Create the Language Model]]
 +
**[[Speech:Run Decode| Step 3: Run a Decode]]
  
  
===Speech System Model Building===
+
==Model Building: Data Preparations==
 +
Description of initial setup and preparation of data needed to build statistical language models and generate a robust set of acoustic models (training) and verifying them by testing (decoding) on the trained corpus. <font color="red">For detailed steps on how to train and decode, see the sub-steps under Model Building above.</font>
  
#[Data Preparation]
+
===General Overview===
#[[Speech:Models LM Build| Language Modeling]]
+
#[[Speech:Models AM Build| Building & Verifying Models]]
+
 
+
 
+
==Data Preparation Steps==
+
 
<ul>
 
<ul>
<li>In order to do the Train and Decode we are going to need to have a all our correct files in place. The main 3 things that we are going to need in order to accomplish this is actual audio files in .SPH format, a transcript of the files and working dictionary that has the current words and the phonetic spelling of the word (pronounce names.) </li>
+
<li>In order to successfully run a train and decode, all of the correct files need to be in place. The three main groups of files that are needed in order to accomplish this are: the actual audio files in .SPH format, a transcript of the audio files, and a working dictionary. The dictionary must have the current words and the phonetic spelling of the words as well as the pronunciations for names. </li>
<li>You can find a current copy of the transcripts on Caesar under caesar:/media/data/Switchboard/disk1/swb1 </li>
+
<li>Current copies of the .SPH audio files can be found on Caesar in the following directory: /media/data/Switchboard/disk1/swb1 </li>
<li>You can find a copy of the transcripts currently under caesar:~/speechtools/SphinxTrain-1.0/train1/etc its is the file called trans_uneditied. </li>
+
<li>Current copies of the transcripts can be found on Caesar in the following directory: ~/speechtools/SphinxTrain-1.0/train1/etc/trans_unedited </li>
<li> The last piece you will need is a working dictionary that contains the words and the pronounce names. The problem with giant dictionaries though is they are too hard to process so we need to shrink them down to contain only words that are contained in the transcripts. Last years group created a script that did that but it doesn't seem to work perfectly I have begun working on a new script and am making progress but have not get it to output to a new file yetI hope to in the next few weeks and update this part of the wiki.  In the meantime you can use the dictionary they created last year it is under caesar:~/speechtools/SphinxTrain-1.0/train1/etc train1.dic.  Below is what we need to do to create a more modern dictionary. </li></ul>
+
<li>The last item required for performing a train and decode is a working dictionary that contains all English words and their pronunciations. The problem with large dictionaries is that they can be hard to process. To accommodate this problem, it's recommended that a dictionary be created which contains only words that are used in the transcripts. The group of students who worked on speech in 2011 created a script that facilitated the need for a dictionary which was relative to a transcript, but it doesn't work perfectly. Ted began working on a new script and was making progress but faced an issue with outputting the results to a file.  This may become part of the wiki in the next few weeks.  In the meantime, the 2011 group's dictionary can be used. It is located on Caesar in the following directory: /speechtools/SphinxTrain-1.0/train1/etc train1.dic.  Listed below are the requirements for creating a new dictionary. </li></ul>
  
''Create a new dictionary''
+
===Create a new dictionary===
 
<ul>
 
<ul>
<li> Find a master dictionary that meaning one with a huge list of words you can currently find one CMU's site under https://cmusphinx.svn.sourceforge.net/svnroot/cmusphinx/trunk/cmudict/cmudict.0.6d  
+
<li> Find a master dictionary. A master dictionary is essentially a pairing of all English words associated with their phonetic sounds. A master dictionary exists at CMU's site under https://cmusphinx.svn.sourceforge.net/svnroot/cmusphinx/trunk/cmudict/cmudict.0.6d  
Though this also currently resides on Caesar under caesar:~/speechtools/SphinxTrain-1.0 cmudict.06d </li>
+
This master dictionary also currently resides on Caesar under caesar:~/speechtools/SphinxTrain-1.0 cmudict.06d </li>
<li>Next you want to create a new directory which will contain a copy of the master dictionary the transcripts and script we are running which will shrink down the dictionary and create a more slim efficient version</li>
+
<li> Next, create a new dictionary which is based on the master dictionary and contains a distinct list of the words found in the transcripts. This can be accomplished through a script. The script currently used by this class is written below this section.</li>
<li>Once these are all in the same directory you need to change the name of dictionary file to dictfile and the transcripts to wordfile.  Once this is done you will want to run the script.  The code is below</li> </ul>
+
<li> Once these files are all in the same directory change the name of dictionary file to dictfile and the name of the transcripts to wordfile.  Once this is done you will want to run the script.  The code is below</li>
 +
</ul>
  
  
Line 62: Line 65:
 
</syntaxhighlight>
 
</syntaxhighlight>
  
*Once you have the transcripts the larger dictionary and the script to refine the dictionary all in one folder you will need to type in this command to run it  
+
Once you have the transcripts the larger dictionary and the script to refine the dictionary all in one folder you will need to type in this command to run it  
 
%  perl create.pl wordfile dictfile |tee -a train1.dic
 
%  perl create.pl wordfile dictfile |tee -a train1.dic
  
 
This command basically activate the perl script and tells the files to run besides which are needed the second part |tee -a train1.dic takes what is outputted from just a screen print to output it to an actual file in this case name train1.dic which is your new trimmed down dictionary!!
 
This command basically activate the perl script and tells the files to run besides which are needed the second part |tee -a train1.dic takes what is outputted from just a screen print to output it to an actual file in this case name train1.dic which is your new trimmed down dictionary!!
  
''A few things you might still need to know''
+
===A few things you might still need to know===
*After you have run the script to slim down the dictionary to only contain the words used in the transcript you will notice it also gives you words which are not in the original larger dictionary.  This is useful because you will need to add the phonetic spelling next to them, the downside though is you might notice that the begging in of every transcript contains the letter s in bracket and which is not in the dictionary.  The good news is you can remove it with a simple sed command in unix.
+
*After you have run the script to slim down the dictionary to only contain the words used in the transcript you will notice it also gives you words which are not in the original larger dictionary.  This is useful because you will need to add the phonetic spelling next to them, the downside though is you might notice that the beginning of every transcript contains the letter s in bracket and which is not in the dictionary.  The good news is you can remove it with a simple sed command in unix. One thing with it though is you want to run this to the transcript BEFORE you run the dictionary script above. 
 +
<syntaxhighlight lang=perl>
 +
% cat wordfile | sed "s|<s>||g" | sed "s|</s>||g" > wordfile2
 +
</syntaxhighlight>
 +
This will remove all the stray s's in the transcript and output to a new file without them.  Notice the original file was called wordfile and the second file is wordfile2.  After it is done you will want to delete the wordfile and rename wordfile2 wordfile again.  It is necessary the transcript is named wordfile for the dictionary creation script to work

Latest revision as of 15:12, 14 March 2018


Project Notes


Model Building: Data Preparations

Description of initial setup and preparation of data needed to build statistical language models and generate a robust set of acoustic models (training) and verifying them by testing (decoding) on the trained corpus. For detailed steps on how to train and decode, see the sub-steps under Model Building above.

General Overview

  • In order to successfully run a train and decode, all of the correct files need to be in place. The three main groups of files that are needed in order to accomplish this are: the actual audio files in .SPH format, a transcript of the audio files, and a working dictionary. The dictionary must have the current words and the phonetic spelling of the words as well as the pronunciations for names.
  • Current copies of the .SPH audio files can be found on Caesar in the following directory: /media/data/Switchboard/disk1/swb1
  • Current copies of the transcripts can be found on Caesar in the following directory: ~/speechtools/SphinxTrain-1.0/train1/etc/trans_unedited
  • The last item required for performing a train and decode is a working dictionary that contains all English words and their pronunciations. The problem with large dictionaries is that they can be hard to process. To accommodate this problem, it's recommended that a dictionary be created which contains only words that are used in the transcripts. The group of students who worked on speech in 2011 created a script that facilitated the need for a dictionary which was relative to a transcript, but it doesn't work perfectly. Ted began working on a new script and was making progress but faced an issue with outputting the results to a file. This may become part of the wiki in the next few weeks. In the meantime, the 2011 group's dictionary can be used. It is located on Caesar in the following directory: /speechtools/SphinxTrain-1.0/train1/etc train1.dic. Listed below are the requirements for creating a new dictionary.

Create a new dictionary

  • Find a master dictionary. A master dictionary is essentially a pairing of all English words associated with their phonetic sounds. A master dictionary exists at CMU's site under https://cmusphinx.svn.sourceforge.net/svnroot/cmusphinx/trunk/cmudict/cmudict.0.6d This master dictionary also currently resides on Caesar under caesar:~/speechtools/SphinxTrain-1.0 cmudict.06d
  • Next, create a new dictionary which is based on the master dictionary and contains a distinct list of the words found in the transcripts. This can be accomplished through a script. The script currently used by this class is written below this section.
  • Once these files are all in the same directory change the name of dictionary file to dictfile and the name of the transcripts to wordfile. Once this is done you will want to run the script. The code is below


#!/usr/bin/perl
use strict;
use warnings;
 
@ARGV > 0 or die "Insufficient arguments: Need word file, and Dict file names";
my ($wordfile, $dictfile) = @ARGV;
 
open my $d, "<", $dictfile or die "Cannot open $dictfile: $!";
open my $w, "<", $wordfile or die "Cannot open $wordfile: $!";
 
my %dict = map split( ' ', $_, 2 ), <$d>;
close $d;
 
while ( <$w> ) {
  for my $word ( split ) {
    if ( exists $dict{ $word } ) {
       print "$word: $dict{ $word }";
       next;
    }
    print "$word is not in the dictionary\n";
  }
}
close $w;

Once you have the transcripts the larger dictionary and the script to refine the dictionary all in one folder you will need to type in this command to run it % perl create.pl wordfile dictfile |tee -a train1.dic

This command basically activate the perl script and tells the files to run besides which are needed the second part |tee -a train1.dic takes what is outputted from just a screen print to output it to an actual file in this case name train1.dic which is your new trimmed down dictionary!!

A few things you might still need to know

  • After you have run the script to slim down the dictionary to only contain the words used in the transcript you will notice it also gives you words which are not in the original larger dictionary. This is useful because you will need to add the phonetic spelling next to them, the downside though is you might notice that the beginning of every transcript contains the letter s in bracket and which is not in the dictionary. The good news is you can remove it with a simple sed command in unix. One thing with it though is you want to run this to the transcript BEFORE you run the dictionary script above.
% cat wordfile | sed "s|<s>||g" | sed "s|</s>||g" > wordfile2

This will remove all the stray s's in the transcript and output to a new file without them. Notice the original file was called wordfile and the second file is wordfile2. After it is done you will want to delete the wordfile and rename wordfile2 wordfile again. It is necessary the transcript is named wordfile for the dictionary creation script to work