Speech:Models Data Prep

From Openitware
Jump to: navigation, search


Project Notes


Model Building: Data Preparations

Description of initial setup and preparation of data needed to build statistical language models and generate a robust set of acoustic models (training) and verifying them by testing (decoding) on the trained corpus. For detailed steps on how to train and decode, see the sub-steps under Model Building above.

General Overview

  • In order to successfully run a train and decode, all of the correct files need to be in place. The three main groups of files that are needed in order to accomplish this are: the actual audio files in .SPH format, a transcript of the audio files, and a working dictionary. The dictionary must have the current words and the phonetic spelling of the words as well as the pronunciations for names.
  • Current copies of the .SPH audio files can be found on Caesar in the following directory: /media/data/Switchboard/disk1/swb1
  • Current copies of the transcripts can be found on Caesar in the following directory: ~/speechtools/SphinxTrain-1.0/train1/etc/trans_unedited
  • The last item required for performing a train and decode is a working dictionary that contains all English words and their pronunciations. The problem with large dictionaries is that they can be hard to process. To accommodate this problem, it's recommended that a dictionary be created which contains only words that are used in the transcripts. The group of students who worked on speech in 2011 created a script that facilitated the need for a dictionary which was relative to a transcript, but it doesn't work perfectly. Ted began working on a new script and was making progress but faced an issue with outputting the results to a file. This may become part of the wiki in the next few weeks. In the meantime, the 2011 group's dictionary can be used. It is located on Caesar in the following directory: /speechtools/SphinxTrain-1.0/train1/etc train1.dic. Listed below are the requirements for creating a new dictionary.

Create a new dictionary

  • Find a master dictionary. A master dictionary is essentially a pairing of all English words associated with their phonetic sounds. A master dictionary exists at CMU's site under https://cmusphinx.svn.sourceforge.net/svnroot/cmusphinx/trunk/cmudict/cmudict.0.6d This master dictionary also currently resides on Caesar under caesar:~/speechtools/SphinxTrain-1.0 cmudict.06d
  • Next, create a new dictionary which is based on the master dictionary and contains a distinct list of the words found in the transcripts. This can be accomplished through a script. The script currently used by this class is written below this section.
  • Once these files are all in the same directory change the name of dictionary file to dictfile and the name of the transcripts to wordfile. Once this is done you will want to run the script. The code is below


#!/usr/bin/perl
use strict;
use warnings;
 
@ARGV > 0 or die "Insufficient arguments: Need word file, and Dict file names";
my ($wordfile, $dictfile) = @ARGV;
 
open my $d, "<", $dictfile or die "Cannot open $dictfile: $!";
open my $w, "<", $wordfile or die "Cannot open $wordfile: $!";
 
my %dict = map split( ' ', $_, 2 ), <$d>;
close $d;
 
while ( <$w> ) {
  for my $word ( split ) {
    if ( exists $dict{ $word } ) {
       print "$word: $dict{ $word }";
       next;
    }
    print "$word is not in the dictionary\n";
  }
}
close $w;

Once you have the transcripts the larger dictionary and the script to refine the dictionary all in one folder you will need to type in this command to run it % perl create.pl wordfile dictfile |tee -a train1.dic

This command basically activate the perl script and tells the files to run besides which are needed the second part |tee -a train1.dic takes what is outputted from just a screen print to output it to an actual file in this case name train1.dic which is your new trimmed down dictionary!!

A few things you might still need to know

  • After you have run the script to slim down the dictionary to only contain the words used in the transcript you will notice it also gives you words which are not in the original larger dictionary. This is useful because you will need to add the phonetic spelling next to them, the downside though is you might notice that the beginning of every transcript contains the letter s in bracket and which is not in the dictionary. The good news is you can remove it with a simple sed command in unix. One thing with it though is you want to run this to the transcript BEFORE you run the dictionary script above.
% cat wordfile | sed "s|<s>||g" | sed "s|</s>||g" > wordfile2

This will remove all the stray s's in the transcript and output to a new file without them. Notice the original file was called wordfile and the second file is wordfile2. After it is done you will want to delete the wordfile and rename wordfile2 wordfile again. It is necessary the transcript is named wordfile for the dictionary creation script to work