Speech:Models Data Prep

Class Notes

Speech System Model Building

  1. [Data Preparation]
  2. Language Modeling
  3. Building & Verifying Models

Data Preparation Steps

  • In order to do the Train and Decode we are going to need to have a all our correct files in place. The main 3 things that we are going to need in order to accomplish this is actual audio files in .SPH format, a transcript of the files and working dictionary that has the current words and the phonetic spelling of the word (pronounce names.)
  • You can find a current copy of the transcripts on Caesar under caesar:/media/data/Switchboard/disk1/swb1
  • You can find a copy of the transcripts currently under caesar:~/speechtools/SphinxTrain-1.0/train1/etc its is the file called trans_uneditied.
  • The last piece you will need is a working dictionary that contains the words and the pronounce names. The problem with giant dictionaries though is they are too hard to process so we need to shrink them down to contain only words that are contained in the transcripts. Last years group created a script that did that but it doesn't seem to work perfectly I have begun working on a new script and am making progress but have not get it to output to a new file yet. I hope to in the next few weeks and update this part of the wiki. In the meantime you can use the dictionary they created last year it is under caesar:~/speechtools/SphinxTrain-1.0/train1/etc train1.dic. Below is what we need to do to create a more modern dictionary.

Create a new dictionary

  • Find a master dictionary that meaning one with a huge list of words you can currently find one CMU's site under https://cmusphinx.svn.sourceforge.net/svnroot/cmusphinx/trunk/cmudict/cmudict.0.6d Though this also currently resides on Caesar under caesar:~/speechtools/SphinxTrain-1.0 cmudict.06d
  • Next you want to create a new directory which will contain a copy of the master dictionary the transcripts and script we are running which will shrink down the dictionary and create a more slim efficient version
  • Once these are all in the same directory you need to change the name of dictionary file to dictfile and the transcripts to wordfile. Once this is done you will want to run the script. The code is below

use strict;
use warnings;
@ARGV > 0 or die "Insufficient arguments: Need word file, and Dict file names";
my ($wordfile, $dictfile) = @ARGV;
open my $d, "<", $dictfile or die "Cannot open $dictfile: $!";
open my $w, "<", $wordfile or die "Cannot open $wordfile: $!";
my %dict = map split( ' ', $_, 2 ), <$d>;
close $d;
while ( <$w> ) {
  for my $word ( split ) {
    if ( exists $dict{ $word } ) {
       print "$word: $dict{ $word }";
    print "$word is not in the dictionary\n";
close $w;
  • Once you have the transcripts the larger dictionary and the script to refine the dictionary all in one folder you will need to type in this command to run it

% perl create.pl wordfile dictfile |tee -a train1.dic

This command basically activate the perl script and tells the files to run besides which are needed the second part |tee -a train1.dic takes what is outputted from just a screen print to output it to an actual file in this case name train1.dic which is your new trimmed down dictionary!!

A few things you might still need to know

  • After you have run the script to slim down the dictionary to only contain the words used in the transcript you will notice it also gives you words which are not in the original larger dictionary. This is useful because you will need to add the phonetic spelling next to them, the downside though is you might notice that the beginning of every transcript contains the letter s in bracket and which is not in the dictionary. The good news is you can remove it with a simple sed command in unix. One thing with it though is you want to run this to the transcript BEFORE you run the dictionary script above.

% cat wordfile | sed "s|||g" | sed "s|||g" > wordfile2 This will remove all the stray s's in the transcript and output to a new file without them. Notice the original file was called wordfile and the second file is wordfile2. After it is done you will want to delete the wordfile and rename wordfile2 wordfile again. It is necessary the transcript is named wordfile for the dictionary creation script to work