Speech:Models Data Prep

From Openitware
Revision as of 09:00, 10 April 2012 by Cpc2 (Talk | contribs)

Jump to: navigation, search

Class Notes

Speech System Model Building

  1. [Data Preparation]
  2. Language Modeling
  3. Building & Verifying Models

Data Preparation Steps

  • In order to do the Train and Decode we are going to need to have a all our correct files in place. The main 3 things that we are going to need in order to accomplish this is actual audio files in .SPH format, a transcript of the files and working dictionary that has the current words and the phonetic spelling of the word (pronounce names.)
  • You can find a current copy of the transcripts on Caesar under caesar:/media/data/Switchboard/disk1/swb1
  • You can find a copy of the transcripts currently under caesar:~/speechtools/SphinxTrain-1.0/train1/etc its is the file called trans_uneditied.
  • The last piece you will need is a working dictionary that contains the words and the pronounce names. The problem with giant dictionaries though is they are too hard to process so we need to shrink them down to contain only words that are contained in the transcripts. Last years group created a script that did that but it doesn't seem to work perfectly I have begun working on a new script and am making progress but have not get it to output to a new file yet. I hope to in the next few weeks and update this part of the wiki. In the meantime you can use the dictionary they created last year it is under caesar:~/speechtools/SphinxTrain-1.0/train1/etc train1.dic. Below is what we need to do to create a more modern dictionary.

Create a new dictionary

  • Find a master dictionary that meaning one with a huge list of words you can currently find one CMU's site under https://cmusphinx.svn.sourceforge.net/svnroot/cmusphinx/trunk/cmudict/cmudict.0.6d Though this also currently resides on Caesar under caesar:~/speechtools/SphinxTrain-1.0 cmudict.06d
  • Next you want to create a new directory which will contain a copy of the master dictionary the transcripts and script we are running which will shrink down the dictionary and create a more slim efficient version
  • Once these are all in the same directory you need to change the name of dictionary file to dictfile and the transcripts to wordfile. Once this is done you will want to run the script. The code is below

use strict;
use warnings;
@ARGV > 0 or die "Insufficient arguments: Need word file, and Dict file names";
my ($wordfile, $dictfile) = @ARGV;
open my $d, "<", $dictfile or die "Cannot open $dictfile: $!";
open my $w, "<", $wordfile or die "Cannot open $wordfile: $!";
my %dict = map split( ' ', $_, 2 ), <$d>;
close $d;
while ( <$w> ) {
  for my $word ( split ) {
    if ( exists $dict{ $word } ) {
       print "$word: $dict{ $word }";
    print "$word is not in the dictionary\n";
close $w;

  • As of right now the script only prints the words to your screen once I get it to copy to another file though we should have a more efficient dictionary. I will update this part of the wiki in the next few weeks when more information is available.