Speech:Models Data Prep


 * Home
 * Semesters - Project Work by Semester
 * [Information]
 * System Description
 * Experiments - List of speech experiments

Project Notes

 * Unix Notes
 * Speech Corpus Setup - Switchboard,  NOAA
 * Speech Recognition Related Readings
 * Experiment Setup
 * Scripts Page
 * Model Building - more info on [data prep], language models, &  building models
 * Step 1: Run a Train
 * Step 2: Create the Language Model
 * Step 3: Run a Decode

Model Building: Data Preparations
Description of initial setup and preparation of data needed to build statistical language models and generate a robust set of acoustic models (training) and verifying them by testing (decoding) on the trained corpus. For detailed steps on how to train and decode, see the sub-steps under Model Building above.

General Overview
 In order to successfully run a train and decode, all of the correct files need to be in place. The three main groups of files that are needed in order to accomplish this are: the actual audio files in .SPH format, a transcript of the audio files, and a working dictionary. The dictionary must have the current words and the phonetic spelling of the words as well as the pronunciations for names.  Current copies of the .SPH audio files can be found on Caesar in the following directory: /media/data/Switchboard/disk1/swb1  Current copies of the transcripts can be found on Caesar in the following directory: ~/speechtools/SphinxTrain-1.0/train1/etc/trans_unedited  The last item required for performing a train and decode is a working dictionary that contains all English words and their pronunciations. The problem with large dictionaries is that they can be hard to process. To accommodate this problem, it's recommended that a dictionary be created which contains only words that are used in the transcripts. The group of students who worked on speech in 2011 created a script that facilitated the need for a dictionary which was relative to a transcript, but it doesn't work perfectly. Ted began working on a new script and was making progress but faced an issue with outputting the results to a file. This may become part of the wiki in the next few weeks. In the meantime, the 2011 group's dictionary can be used. It is located on Caesar in the following directory: /speechtools/SphinxTrain-1.0/train1/etc train1.dic. Listed below are the requirements for creating a new dictionary. 

Create a new dictionary
  Find a master dictionary. A master dictionary is essentially a pairing of all English words associated with their phonetic sounds. A master dictionary exists at CMU's site under https://cmusphinx.svn.sourceforge.net/svnroot/cmusphinx/trunk/cmudict/cmudict.0.6d This master dictionary also currently resides on Caesar under caesar:~/speechtools/SphinxTrain-1.0 cmudict.06d   Next, create a new dictionary which is based on the master dictionary and contains a distinct list of the words found in the transcripts. This can be accomplished through a script. The script currently used by this class is written below this section.  Once these files are all in the same directory change the name of dictionary file to dictfile and the name of the transcripts to wordfile. Once this is done you will want to run the script. The code is below</li> </ul>

Once you have the transcripts the larger dictionary and the script to refine the dictionary all in one folder you will need to type in this command to run it % perl create.pl wordfile dictfile |tee -a train1.dic

This command basically activate the perl script and tells the files to run besides which are needed the second part |tee -a train1.dic takes what is outputted from just a screen print to output it to an actual file in this case name train1.dic which is your new trimmed down dictionary!!

A few things you might still need to know
This will remove all the stray s's in the transcript and output to a new file without them. Notice the original file was called wordfile and the second file is wordfile2. After it is done you will want to delete the wordfile and rename wordfile2 wordfile again. It is necessary the transcript is named wordfile for the dictionary creation script to work
 * After you have run the script to slim down the dictionary to only contain the words used in the transcript you will notice it also gives you words which are not in the original larger dictionary. This is useful because you will need to add the phonetic spelling next to them, the downside though is you might notice that the beginning of every transcript contains the letter s in bracket and which is not in the dictionary.  The good news is you can remove it with a simple sed command in unix. One thing with it though is you want to run this to the transcript BEFORE you run the dictionary script above.