Speech:Spring 2011 Proposal


 * Home
 * Spring 2011
 * [Proposal]
 * Report

Introduction
The following proposal will outline a plan to develop robust speech models using the LDC Switchboard corpus with the CMU Sphinx Speech Recognition Toolkit. The goal will be to develop these models along with a set of tools and experiment paradigms that enable with ease the reproduction of all results obtained.

Initial setup
Before training can begin, we need to first configure our set of servers, select tools and develop an information infrastructure to record and store results for future retrieval. The categories below will be critical to achieve this goal. A brief description of required tasks as well as a list of resources, participants and their specific tasks and timelines will be discussed. The team will subdivide into specific groups based on skill sets to focus on optimally achieving each required task.


 * Note that for setup we will not split up time estimates by individual groups or members, instead all initial setup tasks are planned to be completed by March 1st

Hardware Configuration
The members of the Hardware Group are Brian Avery, James Bartoldus, and Matt Wakim. Brian will be the team leader. The group will work collectively to accomplish the goals put forth, specifically whether or not to update, what to use for backups, and eventually develop a queue monitor. The group has outlined a multilevel system for backups. The system consists of 1 PowerEdge 2650 fileserver named Caesar and 9 PowerEdge 1750 queue clients named after characters in the French Asterix and Obelix comic book series (using German translation names). An initial full system image of the servers Caesar and Asterix will be performed with available software as soon as initial setup of the systems has completed and preferably before training and other modifications of the initial systems has occurred (an image of a prepared vanilla sphinx training setup that can be reverted to at any time). Asterix will be the only queue client receiving a full backup due to similar hardware and software configurations to the rest of the systems. This will be followed by weekly backups of the data folder when the system is available as well as backups of trained models. Two previous versions of the data folder and models will be kept at all times. All of these backups will be placed on a 300GB USB external hard drive which will be connected directly to the server currently being backed up. These backups will only be performed when Justin is available. Audio files from SwitchBoard are already backed up on CDs and therefore will not be included in the backup plan. This backup plan may change if the amount of available storage space comes into question. Software for the full backups still needs to be found, but Clonezilla is a possible solution. Clonezilla has been found to be compatible with the ext4 file system used be our servers making it a viable alternative to ghost. There are currently no foreseen problems with using Clonezilla, but only a full backup will reveal whether any actually exist. The data folder backup and model backup will both be performed using a short batch script. This script will be two or three lines long. After careful consideration, several limitations have been placed on updates for the systems due to the fact that the results from training could be affected by the updates. At this time, no general updates will occur on any system, and security updates will only be performed on Caesar at Professor Jonas's discretion. None of the systems aside from Caesar will be on the network so it will not be possible for them to receive updates unless a repository is created for them on Caesar.

Software Tools
The members of the Software Group are KC Ibey, Brian Avery, and James Bartoldus. The team leader will be KC Ibey. KC and James have the task of determining whether to use Python or PERL, and Brian will communicate with Chris to ensure the best tools to for subversioning are being used.

The purpose of this group is to decide on the software tools that will be used for the Capstone Project. Choosing a programming language will require one that can script to link the audio file to the Sphinx Speech Recognition Software, as well as to link the output text file from Sphinx into the database, which will house the experimental directory. The programming languages that were considered were Python and Perl. Both languages are high-level, general-purpose, interpreted scripting languages. However, it is believed that Perl is the better language to use for Capstone, because it favors text based manipulation. It has good text-handling abilities, which make it useful for parsing. Perl also has different modules available and the ability to embed other executables to accomplish tasks. A version control system will be used to back up code, and merge the efforts of the various developers on the project. This will be done via Subversion (SVN). Many IDEs, such as Netbeans, Eclipse, and Notepad++ have plugins that aide in SVN. A central repository will be used, and will be located on Caesar. A local repository, versus one hosted publicly, such as SourceForge, will provide controlled accessibility to specific group members, isolation from public servers, and thus greater security for the project.

Speech Tools
The members of the Speech Tools Group are Matt Wakim, Nicholas Sandberg, and Corey Mooney. Matt will be the team leader. All members will learn about Sphinx, and the CMU Language Modeling Toolkit. This group is responsible for understanding the speech tools and installing them. There is prerequisite software that must be installed prior to Sphinx being installed. This list of software includes: Java SE 6 Development Kit, Ant 1.6.0 or later, and Subversion (SVN). Once the group installs this software Sphinx can be downloaded and compiled. Sphinx is a speech recognition system that is completely written in Java. Sphinx 4 will be used. It is the most current version of the software. Complete download and installation instructions for sphinx can be found on the Sphinx website at


 * http://cmusphinx.sourceforge.net/sphinx4/#download_and_install.

The most current trainer will be used. It is interchangeable with v3 and v4 of Sphinx. It is important to note that Sphinx developers are currently working on a rewrite of the trainer specifically for Sphinx 4, and may be considered for later projects. The group will also evaluate, and later incorporate the CMU Language Modeling Toolkit. This toolkit is software for Unix that will help facilitate research in language modeling. The latest version of CMU toolkit is Version 2. This toolkit is compatible with Sphinx v3 and v4.

Experiment Database
The members of the Experiment DB Group are Nicholas Sandberg, Matt Wakim, and KC Ibey. The team leader will be KC Ibey. The group will work cooperatively with Chris Reekie to develop the most efficient system for recording experiments. This group is responsible for developing a system for recording experiments and their results. The database will be set up on the main server, and will be remotely accessible by all teams. The rows in the table will represent individual experiments. The primary attribute of each row will be the Experiment-ID, which will be the primary key and will have an auto-generated integer value. Other table attributes will be name of the experiment, date it was created, author that created it, and a description field allowing for a complete writeup of experiment setup and results. Additional attributes will be added as the experiment database matures. A simple interface will be developed for ease of adding data to the database.

Directory Structure
The members of the Directory Structure Group are Mike Jonas and Scott Innes. Mike will be the team leader, and will be providing an overview of the Speech Corpus data directory structure, and a detailed experiment structure.


 * create Switchboard corpus directory structure, completed by Mike Jonas on March 29th
 * create initial Experiment directory structure for both training and testing, completed b Mike Jonas on March 29th

Documentation
The members of the Documentation Group are Scott Innes and Chris Reekie. Scott Innes will be the team leader, and will draft the proposal, and will also be in charge of maintaining an ongoing project report to be finalized at the semesters’ close. Chris will keep the wiki updated, and will be communicating closely with team leaders about their respective groups’ current progress. This group will be responsible for recording and translating progress, research, successes, and failures. They will also be integral in facilitating communication between group members and the client, and between the groups themselves. The communication amongst the various teams will be carried out through a wiki page located at: http://foss.unh.edu/mediawiki/index.php/Speech:Home

Building models
Building models requires several steps. First our Switchboard data set will need to be re-organized into a suitable format. Several subsets will be created, one to demonstrate proof of concepts (called a Mini set) and the other to allow for generation of a workable baseline set of models (called our Full set).

In addition, Sphinx will need to be configured to not only generate acoustic models in batch mode but to do so in parallel on a queue of 9 machines all accessing a single fileserver. We will need to generate a set of tools in Perl to enable us to no only accomplish this but to also reproduce results and generate a new set of experiments quickly and easily.

Data Group: Preparing Switchboard
This group, which we’ll call the Data Group, is led James Bartoldus and also consists of KC Ibey, Nick Sandberg and Scott Innes. The Switchboard corpus is made up of hundreds of hours of overseas phone conversations by native English speakers. This data will be used to generate models during the training phase and then verify the performance of the models on a test set during the evaluation phase.

In order to prepare the Switchboard data, this group must first familiarize itself with the format of the Switchboard transcriptions. This will involve some independent studying of the system. The next step will be to parse the words from the text transcription file to pull out unique words and run comparisons against a standard dictionary that will be pulled from the web. The subset of unique words from the transcriptions will then be analyzed. All words not deemed useful to the training process will be removed.

Create Train Set
The Data Group will need to correspond with the Modeling Group in order to determine what is needed to create the different training sets. The Data Group will independently research Sphinx formatting. The Modeling Group, being more versed in the subject, will then assist in familiarizing the Data Group with the specific steps and requirements associated with Sphinx formatting to help facilitate their creation of sufficient train sets.

Mini Train Set
A “mini train” set will be created, comprised of 1 hour of audio data and should be a subset of the full train set below. This is a small set of data that is broken off of a larger set which can be used for a much faster training situation since a full train set takes much longer to create.


 * a Mini Switchboard train set will be created by James Bartoldus on April 19th

Full Train Set
After the group has created a mini train set, a full set will be created. We will take 90% of our entire Switchboard corpus for training, leaving 5% for each of the remaining development and evaluation tests sets.


 * a full train set will be created by James Bartoldus on April 26th

Create Test Set
A collection of test sets will be created to facilitate evaluation of the trained models. A test set is a set of data similar to a train set, except that it was not used during training of the models and therefore can be used to judge the accuracy of models during decoding. We will generate two tests sets, one we call our development set that will be used to tune our decoder and one we call our evaluation set, which will be held out until the end to verify that we tuned decoding and not to our data.

Dev Set
The group will create a developmental test set consisting of 5% of the full Switchboard corpus not used in creating the training set. An additional mini test set will be created using a 30 minutes subset of this 5% test set. These sets will be used for testing the models created during training:


 * a mini development test set will be created by James Bartoldus on April 19th
 * a full test set will be created by James Bartoldus on April 26th

Eval Set
Similarly, an evaluation test set consisting of the remaining 5% of the Switchboard corpus will be created:


 * a full evaluation test set will also be created by James Bartoldus on April 26th

Building Data Tools
The Data Group will also create data manipulation tools. They will be written in Perl, and will be used to automate the text parsing of the Switchboard data. The following set of tools will be created in Perl:


 * a tool that will parse transcriptions from Switchboard to Sphinx will be created by Nick Sandberg on April 26th
 * a tool that will call on an application to down sample audio files will be created by KC Ibey on April 19th
 * a tool that will generate new experiment directories according to the experiment directory structure will be created by Scott Innes on April 26th

A strong Perl programming skill-set will be necessary create these tools. Ultimately, a method of storing the scripts will be implemented that allows everyone on the Capstone team to access them.

Modeling Group: Setting up Speech Tools
This group, which we’ll call the Modeling Group, is lead by Matt Wakim and also consists of Corey Mooney, Nick Sandberg and Chris Reekie, with Brian Avery as a peripheral member focusing on the task of parallelization. At this stage, Sphinx should be installed and ready to run both trainer and decoder. As Sphinx comes with a demo, it will be used for initial run through the system. After that, we will have to create our own training corpus and generate our own models.

Run Online Modules – example
This will involve installing and setting up the demo modules that are available with Sphinx. The purpose is to have a basic understanding of how these are setup and installed before the full version is available.


 * Online Modules will be completed by Matt Wakim on March 22nd.

Run Mini Switchboard
Here we will determine the steps required to train new models. This includes determining any and all input data that training needs, including dictionaries and language models. We will use the Mini Switchboard training corpus to build our acoustic models and once successfully achieved, test them using the Mini Switchboard development test set.

Create Dictionary
We will use the speech recognition dictionary, CMU Pronunciation Dictionary, found at www.speech.cs.cmu.edu/cgi-bin/cmudict to create a small training dictionary. By taking all of the words in our training set, we will enter them into the dictionary and get their phoneme pronunciations. The CMU Pronouncing Dictionary is based in North American English and contains over 125,000 words and their transcriptions. The current phoneme set contains 39 phonemes, in which the vowels could carry lexical stress. The levels of stress are; 0 being no stress, 1 being Primary stress, and 2 being Secondary stress.


 * generate a dictionary of phonetic spelling from our Mini Switchboard training and test sets, completed by Corey Mooney on April 19th

Train Models
Once the dictionary is setup we will need to determine what other inputs we need for a successful training run. One thing we need is to generate a language model, which we can use the CMU Language Model Toolkit for. There may be other inputs that training requires, which will be learned by reading online documentation on how training works as well as searching for some examples of actual training runs. Tasks for training include but are not limited to:


 * generate a language model using the CMU LM toolkit, completed by Nick Sandberg on April 12th
 * run Mini Switchboard training, completed by Matt Wakim on April 12th

Decode on Dev
The decoding process will need to be streamlined to a turn-key fashion. To become more familiar with how sphinx decodes audio, we will test the already decoded audio samples against their transcriptions. This will ensure we have little to no surprises when introducing new audio. In order to decode audio samples, they must be converted to “Raw” audio format. Using one or more Perl scripts we can convert the audio from any source type, WAV, MP3, or AAC, to a raw audio format. Then the raw audio will be passed to Sphinx for decoding. The output from Sphinx can then be compared to an actual transcription of the audio, called scoring, and we will use sclite for this. We can be sure our system is set up correctly before we move to a larger scale of training with hundreds of hours of audio.


 * run decoding of the build Mini models, completed by Chris Reekie on April 12th
 * score output using sclite, completed by Corey Mooney on April 19th

Parallelization
Development of models for this project will be run in parallel in order to decrease the development time required. In order to do this, the work to be done will be divided up among the servers so that each server can do its own work. As each server finishes its component of the work, we will merge the resulting model into one combined model. This section discusses our approach to discovering how to do this. Searching on Google for the query “merging sphinx models”, as well as other similar queries, has currently yielded nothing in the first couple of pages of results. Other queries will be attempted until information is found that will help in understanding how the Sphinx models can be merged. Searches have also been performed on “parallelizing sphinx training”, which also yielded no viable results. Further research is necessary to find a practical solution and other approaches, besides online search, such as posting on an active forum, will be applied Since the aforementioned search has currently yielded very little, an attempt of understanding how training actually works will also be perused. The Sphinx training system is executed by several unique programs. Each program performs a particular piece of the process involved in training the acoustic models for Sphinx. Research was also done on “sphinxtrain components”. This research yielded little. However, searching for documentation on the sphinx training system produced the following website:


 * http://www.speech.cs.cmu.edu/sphinxman/scriptman1.html#00.

This website illustrates the Sphinx training process. A solution for parallelization may be discovered by comparing this process to the names of the programs in the sphinxtrain binary directory.

The sphinx training system spends the majority of its time translating the audio files that it receives, which are in the form of vectors of features, into the mathematical models that will be used for speech processing. As such, the goal of this parallelization system will be to split the hundreds of ours of audio into 9 equal batches, one per machine, and have training work separately on each batch and then combine the results on our central fileserver, Caesar. This would then constitute a single iteration of model training.

After each successful training iteration and recombination of our model, the new model will be distributed across all of the servers so that they do not experience the lag associated with accessing the models across a network.

Splitting this up into chunks of work is difficult by it’s nature, but the following attempts will be made:


 * further online searches using both Google and library catalogues on published work, completed by Brian Avery on March 29th
 * posting on relevant online forums for Sphinx, complete investigation by Brian Avery on March 29th
 * Understand how training actually works to determine a solution without outside aide, completed by Brian Avery on April 5th

Tools to Help Training
Although many of the tasks in section 3.2 are done by hand, it is crucial to the success of this project that these methods be captured and automated. The following tools will be created in Perl:
 * the wall street journal model will be downloaded by Matthew Wakim on March 29th
 * a tool that will run a training job given an experiment directory containing a training corpus will be created by Matthew Wakim on March 29th
 * a tool that will run a decoding job given an experiment directory containing a test corpus will be created by Chris Reekie on March 29th

Experiment: Building Switchboard Baseline Models
This part of the plan will be updated with a more detailed timeline once we’ve achieved some reasonable results in part 3 and are able to determine some estimates of how long each part of Sphinx may take. For now we will give an overview of the tasks needed to generate our initial baseline Switchboard acoustic model.

Training Models
Having configured a full training set in section 3.1.2 and developed a plan to split our Switchboard data up into parallel chunks to run on our queue of 9 machines, we now run the full training set to build our models. Both the Data and Modeling Group will participate in:


 * setting up initial baseline experiment for full Switchboard training set
 * starting parallel training run

It’s difficult to come up with an initial time estimate for completion so this will run until it finishes.

Monitoring Training
As this will be initial time we run a full set, during this process the entire team will take turns monitoring the queue 24 hours around the clock, 7 days a week. During this time we will also look into developing monitoring tools that can alert the team automatically if something has gone wrong during the training process.

Testing
Once models have been successfully generated, we gauge our first baseline results.

Tuning on Dev Test Set
Initially we check our results and determine what tunable parameters we need to adjust to improve recognition. This will require:


 * 1. run full development test set using built models
 * 2. analyze results, reconfigure decoder and repeat step 1.

Final Results on Eval Test Set
Finally, after sufficient number of iteration with our development set, we take our optimal configuration and


 * 3. run full evaluation set using optimal decoder configuration
 * 4. analyze results, if the match optimal development set we are done, if go back to step 1

Results
Final results will be posted in the experiment entry for these sets of experiment as well as in our final report.

Appendix A: Gantt Chart of Timeline
TBD