Speech:Spring 2014 Proposal

From Openitware
Jump to: navigation, search



This following proposal is for Spring Semester 2014. The proposal is broken up into different sections based on the groups we have working in class. The end goal of this project is to develop speech models to be used in conjunction with various tools to improve on our speech recognition technology. The work of the previous semesters have resulted in a number of refined processes involved in the complex task of speech recognition. The starting hardware infrastructure for this semester includes a group of 11 machines: Caesar, the nine clients and Rome. The speech system is running openSUSE, a Linux distribution, as well as Sphinx speech recognition tools. The previous semester was successful in making progress in each of the specialized subgroups. With this in mind, the core goal for this group is to analyze the processes used by previous semesters and continue to refine the speech system in anticipation of continuing to improve the baseline for speech recognition given the existing system.

Like the previous semesters, work on the speech system has been broken down into a series of core sub-tasks. These groups are subdivided as follows:

  • Systems group
    • Arwa Hamdi
    • Valerie Therrien**Sinisa Vidic)
  • Experiment group
    • Joshua Anderson
    • Brian Gailis
    • Ramon Whitman
    • Pauline Wilk
  • Tools group
    • Justin Alix
    • Matthew Leclerc
    • Justin Silva
  • Data group
    • Mitchell Dezak
    • John Kelley
    • Jared Rohrdanz
  • Modeling group
    • Colby Chenard
    • Colby Johnson
    • David Meehan
    • Forrest Surprenant

The purpose of task segregation is to allow each student to specialize on a particular aspect of the speech system. With a short timeframe to make progress, this degree of specialization is crucial in ensuring that the system can move forward. Ultimately, is is intended for all students to become knowledgeable on the overall functionality of the entire system. Furthermore it is the project objective for each student to gain important IT skills such as using Linux, advanced system configuration, maintaining hardware and software, communication and project management skills. The remainder of this proposal will detail the tasks and objectives of the five groups.



The modeling group's primary role is constructing and analyzing the effectiveness of the models used in speech recognition. Speech recognition requires two primary models to function. The Language Model (LM) is a data model used to maintain statistical data about the frequency and appearance of words. This data is useful as it provides a predictive advantage when decoding by determining the likelihood of words appearing next in the sequence. The Acoustic Model (AM) is the core model in mapping textual speech found in the transcription file with the audio files that correspond to it. Generating the acoustic model is the most arduous process, as there are many parameters involved in determining the mathematical relationship between the transcription and the audio file. For the scope of this project we are primarily concerned in tuning the latter model, and thus most of our work will revolve around configuring the parameters used when generating the acoustic model. It is ultimately our goal to improve the baseline to a level that can be reliably used to progress further in the process.

Our secondary concern, outside of the scope of generating and analyzing data models, is the refinement of processes when generating these models. There are a number of ways this has been done in the past, including a series of Perl scripts that automate repetitive processes and reduce the likelihood of human error in generating models. The current script architecture is robust, providing automating for virtually all the core steps in preparing a train, building a Language Model, decoding and scoring. One problem that persists is the wide distribution of scripts used, which require the user enter the same parameters multiple times. Secondly, the modeling group is concerned with the operation speed of these processes, which is the single-most severe bottleneck in building models. Therefore, it also falls under the modeling group's domain to find software solutions to improve the speed and performance of training.

Implementation Plan

Artifact Review and Cleanup

In previous semesters many of the processes used by the speech system were automated and improved. The end result of their work was a more efficient system capable of training data and handling many menial tasks that were once done by hand. In working with the speech system, the modeling group has begun to identify areas of continued inefficiency. In particular, it has been noted that the preparation time for running trains on the speech system is a source of conflict with our goal of optimizing current processes and improving the baseline. It would be beneficial to eliminate repetitive user input to improve overall efficiency when training. A number of existing processes will be explored and analyzed over the duration of two weeks.

The first step in the process will be to continue to review the existing infrastructure. By identifying the numerous working parts of the Sphinx system that relate to experimentation we will be able to aggregate all useful scripts and files while filtering those that no longer work. One area of inefficiency is the number of files on Caesar that are not used or no longer work. Files that were once used, but are no longer needed should be deleted or archived (and noted as such). The first subject to explore is the current status of the numerous scripts which exist in several directories on Caesar. In particular we will determine what scripts exist in the /mnt/main/scripts/train/scripts_pl and /mnt/main/scripts/user directories that are no longer used, no longer work or are currently used. If necessary we will also update these scripts, aggregating functionality as needed to meet changes in the system. We will also determine whether the scripts in /mnt/main/scripts/user.old/expdir_scripts are being used. Current resources suggest these scripts play a role in the system, but this information appears to be out of date. Lastly there are a number of other scripts that are not mentioned in the main experimentation wiki resources which may help us in automating processes further. Automation and modularization is the key to making forward progress. We should be spending as little time as possible getting experiments set up, and currently with what we know it takes us anywhere from 30 minutes to an hour to set up and begin running a train. This has become cumbersome, especially since we will be running many experiments throughout the semester, more than any previous class.

Another task will be proper file organization. For instance, it is still unclear whether there is a central location for missing word .txt files related to each of the eight primary corpus subsets used. These files are crucial in reducing the time to run a train as manually parsing each word for the correct phones is a long, error-prone process. Dictionaries can be created by merging the CMU dictionaries with the added words .txt files. Once the missing words have all been created and added to the dictionary we could clean up the file structure by leaving only one complete dictionary as well as the original CMU Dictionary.0.7a (uploaded 2/4/2014). This would reduce the confusion and clutter on the server. The other process to analyze is the state of the tiny and mini corpus subsets. Attempts to train using those have been met with problems caused by quotes appearing in the transcript files. An attempt was made to strip the file of all quotes, but the respective train failed further along in the process. We need to resolve the problem with this data and ensure that the tiny and mini train data sets are up and working for when we begin teaching other groups how to train.

Collaboration On Script Automation

When we have better organized and found the scripts and files related to experimentation, we will then proceed to write a script capable of executing the needed scripts to train and potentially decode using Sphinx. It should be noted that this task will fall mostly under the domain of the experiments group, although we will need to work with them to relay our findings and to help develop a script which can be flexible for what we are working on. The universal train script should be informative (displaying messages to the user to describe what processes are transpiring and what input is needed). The script should also be able to adapt to some extent of change in the system (i.e. implementation of configuration controls from within the script might be one avenue to explore, selection of a different corpus subset, missing dictionary entries, configuration changes). Simplicity can be achieved by running a single script or two that requires a handful of parameters that will specify the details of the experiment. With easily settable parameters it will make documentation and the Experimental process much simpler for the modeling group as well as other groups (as we will have to teach them later in the semester). Many of the processes are already automated, so developing a universal script should be relatively simple. Many of the inputs are the same, requiring a script to execute, an experiment number and a file to act upon. By taking input once, we should be able to execute all the scripts using the needed information. And by automating the process, the chance of errors occurring is lessened.

A few processes currently exist which make this difficult. The most challenging aspect in automating a train is adding missing dictionary words. Because there are only a finite number of corpus subsets, it should be possible to create the missing words files for each and link them when running the universal script (or a master dictionary with all of the missing words compiled together for all of the Corpus subsets). This works but is coupled strongly with the data-sets themselves. Should that fail, or should we decide to go with a more dynamic approach, it may also possible to access the primary online resource we are using to convert words to phones, using an HTTP request and query string (this is dependent on whether or not Perl permits making HTTP requests). We may also discover better alternatives to these problems which can make the process run smoother. Should we successfully script training, we will then proceed to looking at decoding. This is highly dependent on the success of creating scripts for training and what problems we encounter when decoding.

Sphinx Configuration

The best error rate achieved in work by previous semester work (specifically spring 2013) on Capstone is roughly 30% on the train/decodes . Compared to the estimated figures given by the Sphinx information web page, this number is far higher than expected. The Sphinx documentation estimates an error rate of about 10% for a train/decode run of 10 hours. The number increases to an upwards of 30% for a much larger run. Taking into consideration our current progress, we are within the margin of a large train but are also using a smaller data size. Improving the word error rate and making our system more accurate is one of our primary goals. To do this, we will need to configure various settings in an attempt to improve the WER. Our initial configuration will revolve around settings such as the senone count (which in the past was as high as 3000, much higher than is needed for a 5 hour train). Making improvements to our language and acoustic model will be necessary to move forward.

With the data that has been collected thus far has proven to lead our focus in a few specific directions to achieve a higher performance. The between what Eric had learned last semester and what we have completed up until now. The most dramatic performance gain comes from using higher Densities. The Senone value, which was originally thought to be the key was better for fine tuning based on the length of your train, proved to be less useful. Our goal is to attempt a brute force approach on these parameters, training using a large array of density values to determine the most effective configuration. With enough data we can gain a true sense of where the best baseline is. Another thing that stood out as a obvious deficiency is the dramatic difference in results between using the genTrans5 vs the genTrans6 file to prepare your transcription files. This could also stem from another inefficiency, which was the lack of consistent data for the transcription files. Few of them have proved to be useful, and those that have show differences in the WER upon decoding and scoring. Making these deficiencies a priority will be a step toward being able to complete a long term Train and Decode.

Training on Test Data

With our current progress, we will also be considering the differences between training on our data and different test data. This will be important as we improve the WER for decodes, as it will give us insight as to whether or not our models are too highly tuned to our own data. The first point of research falls under determining what experiments have been done in this vein already, and what is the WER in comparison to test on train runs using that same data. It appears most experiments were run as test on train, if not all. This is understandable as the baseline has been too low to begin analyzing other data. Experiment 0111 and 0024 were explained to be test on dev. These may provide some insight into the current processes. Analyzing the experiments directory, only one appears to be labeled as test on dev. We are currently working to build a strong test on train infrastructure using our brute force approach, this will give us a large amount of data to retest using different data in an attempt to see how our WER compares. If the WER follows a similar trend to our test on train experiments, and if the WER is within a reasonable scope, we can determine that the models are viable.

To do the above, the first step will be to determine the expected WER based on data differing from the data trained on. This number is likely to be dependent on the data quality used, but if we can at least get a reference point number it will give us an idea as to how our own data tests are progressing. The plan is to prepare test data from the corpus subsets which is not used in the train. We can use these test data sets to analyze the effectiveness of our acoustic and language models that we have been building using test on train. Along with the expected WER statistic data, we will also want to analyze the sphinx decoder further to determine the correct inputs needed to use different data, writing scripts as needed to automate complex processes.


To supplement the changes we will make to the process of running experiments, it will also be imperative that we improve the current documentation infrastructure. There are many important resources currently available, but this information is dispersed among a series of wiki articles, some of which are not up to date and now contain erroneous or false information. To improve the current information infrastructure, we will be analyzing the current wiki articles available on the subject of experimentation. By doing this work upfront this should help prepare other groups for the impending trains as well by aggregating the important information while correcting any misinformation. In particular it would be useful to add more debugging information to the current guides, since there is currently less support for errors and problems outside a small scope. By adding this information future users will be more aptly able to solve experimentation problems in a reasonable time.

It would also be helpful to create a section under the information page detailing what all the scripts are, what they do, how are they used in the training process and what is the most recent version. There were a number of scripts on Caesar that were not documented in the train guide that automated useful processes. Having an archive of the tools we are using would be very helpful to others who are joining the project. It will also be a good place to inform everyone what scripts work and what scripts don't, as we currently have to test unknown scripts ourselves and see what happens, and try to debug their function from the source code.

Parallelization (Torque)

Note that the lead in into this section is a bit weak...a paragraph on the computational expense of training and parallelization would have been helpful as well as a quick discussion on how Torque, a separate utility, is integrated within in Sphinx

By implementing parallelization, we could enhance the experimentation process by drastically reducing the time it takes to train and decode models. Research done by previous semesters suggests that Torque could fulfill the job by splitting the computational work of experimentation into the nine other machines, thus dividing our total time theoretically by 1/9. Sphinx is configured to use Torque, and there have been accounts of people using Torque to split the SphinxTrain process among multiple machines. With this said, there is very little information available and the exact extent of the configurations needed to make this work are still widely unknown. Further research and testing into using Torque for parallelization with Sphinx will be needed to determine the viability of this option. Should it be determined that this is viable, we will need to do further research into how to configure Torque properly and to determine what affect Torque will have on our current experimentation processes.

Known files that would need to be modified: etc/sphinx_train.cfg

Estimated Timeline

Colby C.

  • Week Ending Feb 18th
    • Collaborate with Colby J in trying to maximize a baseline WER.
  • Week Ending Feb 25th
    • Continue conducting experiments to try and achieve the lowest error rate possible through creating as much data as possible.
    • Move past 10hr trains and conduct larger train experiments, to deliver more concrete data for continued analysis.
  • Week Ending Mar 4th
    • In addition to data analysis of trains, start researching and developing a method for refinement of the acoustic model.
  • Week ending Mar 18th - spring break Mar 10th - 14th
    • Hopefully by this time we will have a solid baseline to work with that will allow us to focus more on the acoustic model.
    • Re-evaluate based on progress and feedback in order to develop the best plan of attack for tuning the acoustic model for desired results.

Colby J.

  • Week Ending Feb 18th
    • Generate a Data curve that can lead to topics of interest
      • Run many trains to see relational data between different parameters
    • Work on attempting longer trains
      • Use parameters that have successful, previous results.
  • Week Ending Feb 25th
    • Run many more Trains
      • Show more data curves
      • Tune paramters
      • Use updated scripts and Dictionary files
    • Run more Longer trains
      • If 10hr is successful
    • Generate Data Curves
      • Provide elegant view of evidence found throughout training process.
  • Week Ending Mar 4th
    • Help others with improving existing scripts, Dictionary files, Transcription data.
      • Test efficiency of any updated files.
    • Attempt larger trains (308hr)
      • Assuming things have been going well
  • Week ending Mar 18th - spring break Mar 10th - 14th
    • Analyze data collected thus far
      • Make graphs
      • Outline results clearly
      • Show our stance on how we compare to what we know this software is capable of
    • Investigate future topics of improvement for the next semester's groups.


  • Week Ending Feb 18th
    • Research the steps needed to train on test data.
    • Determine which (if any) previous experiments were not test on train.
    • Begin preparing a decode using the last_5hr test data.
  • Week Ending Feb 25th
    • Use the 6 experiments Colby ran using different densities to test using alternate data.
    • Find the expected WER for trains using different test data.
  • Week Ending Mar 4th
    • If the previous experiments showed potential, work with Colby to continue improving baseline. If the new experiments have a WER below the expected, work with Colby to determine the optimal configuration for testing on different data.
  • Week ending Mar 18th - spring break Mar 10th - 14th
    • Run a decode on a larger data set (10hr, maybe higher depending on our success in training these larger models) using test data.


  • Week Ending Feb 18th
    • Run a successful 5hr train to better understand the training process.
  • Week Ending Feb 25th
    • Run a train using torque. Configure Caesar as the pbs_server and methusalix as a pbs_mom
    • Clock the amount of time it takes to complete the trains at various mixtures aka densities.
    • 16
    • 32
    • 64
    • 128
  • Week Ending Mar 4th
    • If I am successful at implementing torque with Caesar and methusalix, and the time to complete a train is decreased by at least 40%.
    • I will continue to expand the drone network across the local network to include more drones until all 10 servers are taking the load of a train and decode.
    • Else if the train time is not decreased more than 40% or if I cannot get torque to work with all the drones, I will start research on another solution for load distribution.
  • Week ending Mar 18th - spring break Mar 10th - 14th
    • Create detailed documentation for torque. Create tutorial for configuring torque.

Systems Group


Moving onto the Systems Group, our responsibilities for this semester are to maintain and update the current systems infrastructure and system wiki guides implemented by previous system groups. Our main goal is to provide other groups with quality performance from our systems infrastructure throughout the semester. (This doesn't belong into proposal other than a timeline task of doing it) We have put a lot of time and effort in reading the previous systems logs in order to increase our knowledge of the current infrastructure that we can apply towards accomplishing our goals for this semester. We plan on doing one major update to the systems infrastructure and that is to switch to a new operating system. Two minor goals are to find a solution to the keys generator on Rome(Fedora 19 OS) machine which prevents password-less log in using SSH protocol and perform a backup of the experiments directory.

Implementation Plan

This year the Systems Group has a few minor tasks and one major task to accomplish by the end of the semester. Minor tasks are straightforward: update wiki guides concerning the systems infrastructure, find a solution to Fedora OS SSH keys generator issue and backup experiment directory. The major task of switching operating system from OpenSuse to Fedora is truly ambitious due to heavy downtime that will occur which will prevent other groups from accomplishing their own tasks (Note that I disagree here as the primary task is to run an experiment on Rome and compare it to one run on OpenSuse and share your result...the actual update would require a lot of downtime but wouldn't be something we would plan to do). Thus the Systems Group plans to thoroughly research and test Fedora OS with the help of Rome machine in order to make sure that Sphinx is fully stable on Fedora Well, this is a very vague stating of what I said in the note before...you should be much clearer on this task). If the testing is a success a proposal for a switch will be created and passed on to our Manager Prof. Jonas for final confirmation for operating system switch.

Detailed plan

  • Update wiki guides
    • Information on current Hardware/Software/Networking guides is out of date and will be updated according to the new information
    • If time permits a new guide will be created on how to fully setup Fedora OS to our needs for future references if the system ever needs to be built from the ground up
  • KeyGen solution to Fedora OS
    • Rome is currently running Fedora 19 but it has an issue with SSH keys or a firewall configuration that prevents users from log in from Caesar without a password. This issue will need to be resolved before the switch to Fedora OS on other machines. With the help of the Virtual Box the Systems Group plans to create a virtual simulation of our Caesar and Rome machines in order to find a solution to this bug.
  • Switching to Fedora OS
    • Before the Systems Group does anything, research into Fedora v19 and v20 OS will be done. Next, Sphinx will be tested by running experiments on Rome machine which runs Fedora v19. The same experiment will be run on Automatix which runs openSuse v11.3, this will give us a good idea of the performance difference between the two operating systems running Sphinx Speech Recognition software. Based on the above information a proposal will be created in favor or against a switch to Fedora OS. (Ok, here you are a bit more explicit but still somewhat vague. Which experiment? This is the proposal and you should state which one you would plan to run!)
    • If Manager Prof. Jonas gives a green light for the OS switch, the Systems Group will switch to Fedora OS at the end of the semester so that our current research is not jeopardized by technical issues that may come up during the process.
  • Backups
    • The Systems Group will install and format a backup hard drive(300 GB)on Rome for the Experiment Group to use for experiment directory backup. (Ok, but what about the existing backup system and making sure it is working for non-data backups?)

Estimated Timeline

Arwa Hamdi

  • Week ending Feb 18th
    • Update Hardware/Software/Networking guides.
  • Week ending Feb 25th
    • As a group setup a backup hard drive on Rom for experiment Group to use for backing the experiment directory.
    • Read how to create an experiment and how to setup to run a train. Possibly be able to run one experiment on Automatix
  • Week ending Mar 4th
    • As a group run multiple experiments on Automatix and Rome and compare the performance of Sphinx on the two machines
  • Week ending Mar 18th - spring break Mar 10th - 14th
    • As a group write a proposal for or against Fedora OS switch

Sinisa Vidic

  • Week ending Feb 18th
    • Look into which file system is best for a Unix/Linux backup hard drive
    • Begin reading how create an experiment and run a train
  • Week ending Feb 25th
    • As a group setup a backup hard drive on Rome for experiment Group to use for backing the experiment directory.
    • Run experiment on Automatix and Rome (Note that this is very vague and may actually need more time set aside since running experiments is the most time consuming!)
    • Fill in the rest of the team on my knowledge of running trains and creating experiments
  • Week ending Mar 4th
    • As a group run multiple experiments on Automatix and Rome and compare the performance of Sphinx on the two machines (Note that this is very vague and lacks details)
  • Week ending Mar 18th - spring break Mar 10th - 14th
    • As a group write a proposal for or against Fedora OS switch

Valerie Therrien

  • Week ending Feb 18th
    • Find a solution to KeyGen issue on Rome
  • Week ending Feb 25th
    • As a group setup a backup hard drive on Rom for experiment Group to use for backing the experiment directory.
    • Find a solution to KeyGen (if not find in the previous week)
    • Read how to create an experiment and how to setup to run a train. Possibly be able to run one experiment on Automatix
  • Week ending Mar 4th
    • As a group run multiple experiments on Automatix and Rome and compare the performance of Sphinx on the two machines
  • Week ending Mar 18th - spring break Mar 10th - 14th
    • As a group write a proposal for or against Fedora OS switch

Tools Group


Next is the Tools Group. The main responsibility this semester is to research possible updates of the currently used software suite that compiles the Speech Recognition program. There are five main tools to be examined this semester, as per Prof. Jonas's instructions: the Trainer, the Decoder, the Language Model Toolkit, the Scorer, and a Java Wrapper program (which is not yet in use). Many of the tools currently in use remain outdated, as they worked accurately enough for normal use during previous semesters. Updating the tools has been a consideration over previous semesters, but it was determined that the downtime and need to re-configure the system was not worth the minimal performance upgrade. This semester, with the costs and benefits of each possible update weighed, the decision to upgrade any of the tools will be carefully considered. Currently, the main hindrance to upgrading all tools is the continued use of OpenSUSE 11.3. Many of the tools require a more modern OS, or at the very least an update to OpenSUSE 13.x. The use of Fedora is also to be considered, as many of the updated tool versions support this Operating System as well. As such, the System and Tools groups must collaborate on a possible system-wide update to bring the entire system up to date.

Implementation Plan

If all of the Tools Groups plans can be accomplished this semester, it will be a big step towards modernizing the entire Speech Software suite. A full list of tools, current versions, the most up-to-date version available, and notes, can be found by following this link to the System Information Page.The main tools of concern this semester are listed below.


The first tool used in the Speech Recognition process is the trainer. This is the actual software procedure, run through a terminal interface, that performs "trains" on the provided data (or corpus). This procedure analyzes the supplied recordings to "learn" its characteristics; that is, to recognize patterns and acoustic occurrences in the audio file, determined by several user defined variables. A user can run a train in many different ways and in several different lengths and sizes. If a user wishes to run a train, they should refer to the main webpage, under the "information" link, where there are detailed instructions under the "Project Notes" section. Additionally, an index of previously run trains can be found under the "experiments" link. All trains (either performed or not) MUST be documented regardless of outcome, even if they offer no knew insight into the system or produce any worthwhile results. This application is used heavily by both the Modeling and Experiments groups. Please see the "Model Building" section of the documentation for more information (it can be found here). The current installed trainer version is SpinxTrain 1.0, which has been in use for several semesters. The newest version is SphinxTrain 1.0.8 , at: SphinxTrain 1.0.8, which is openfst-based and best used with Sphinx4 (although it MAY be used with Sphinx 3.7 - more research needs to be done). A parallel extraction feature is available, which proves to be intriguing. There are several bug fixes as well. Further investigation is warranted

Language Model Toolkit

Once a user has completed a successful train and documented it in both the Wiki and on the server, a Language Model must be built. A Language Model uses the results of the performed train to analyze the frequency of words within a corpus and predict their future occurrence. Each word is given an integer value as recorded, and the LM allocates memory to its occurrence. Building a language model helps the software to "understand" how any word is used or said, and is necessary to run the decode procedure. A full documentation can be found here. Currently, the CMU-Cambridge Statistical Language Modeling Toolkit v2 is in use. Version 0.7 is the latest version and provides no real benefits other than minor bug fixes. It appears that the CSLMT is no longer in active development (re: being updated), and neither has a replacement been created.


Once a Language Model has been established, the decode process can be run. The decoder uses the information generated through the Language Modeling and Acoustic Modeling steps (which were based on the trains performed earlier) and applies them to a set of audio files, whose speech we wish to recognize. These decodes can be as little as 1 hour to up to over 100 hours. Several user parameters are established (all of which can be found in the documentation) and can be tweaked to attempt better success during the decode. After the decode is run and the results output (what the program actually interpreted the audio file as), the results can be compared to the written transcript of the input audio file. This data is dealt with in the next step. The current Decoder in use (Note, is that the version we are using?) is version is Sphinx 3.7, which has been used for quite some time and remains stable. The latest version is Sphinx 4.0, which is a re-haul of the engine in Java. It is faster, slightly more accurate, and more flexible. However, it can only be used with a suitable Java Wrapper (covered below), which poses problems to our OpenSUSE system. It relies on the Java SE 6 Dev Kit (or higher) and Ant 1.6. The need for these dependencies is an unfavorable result. Documentation of Sphinx 4.0 can be found here: Sphinx4 Documentation


Once the decode has run, the user feeds the results into a scoring program. This program can display several variables the user defines, such as word total, error rate, word insertion, word deletion, etc. These results are the ultimate test of the accuracy of our Speech Recognition System. Several experiments have results posted under the "Experiments" link; to date the most accurate has a 25% word error rate. The current scorer in use is SCLite 2.3. The newest version is SCLite 2.9, which is installed with SCTK 2.4.8. The advantages are unclear, however, as the documentation has not kept up with revisions.

Java Wrapper

While the user may find it satisfactory to perform the Decode using Sphinx3, the desire to use the latest and greatest version, Sphinx4, can promise more accurate results while running our Speech Recognition System. Unfortunately, the use of Sphinx4 requires several dependencies and upgrades to our System, as it is written in Java (as opposed to C). One solution to circumvent these dependencies is by using a Java Wrapper program. While this tool is not currently installed or in use on the system (and has never been in any previous semester), it provides the ability to wrap Java coded programs into a "shell" that allows for execution on a Linux-based system as a daemon process. This would allow the upgrade of the Speech Recognition system to Sphinx4 and subsequently SphinxTrain 1.0.8. The most popular version of this software is Java Service Wrapper by Tanuki Software: JSW Homepage. However, support for openSUSE is unknown, as it is only known to work with SUSE Enterprise Server. While this would be a large change, minor investigation is warranted as a major upgrade to the system could be performed and possibly enhance the overall Speech-Recognition System. All factors will have to be considered including system downtime, upgrade procedures, dependencies, etc. An alternate software, Yet Another Java Service Wrapper, found at: YAJSW, provides another option. It provides the same functionality, and is tested on openSUSE 11.1. Although it is not as developed and supported (re:large) as Java Service Wrapper, further investigation is warranted.

Estimated Timeline

Justin A.

  • Week ending Feb 18th
    • Re-write the proposal to more accurately reflect the project and conform to the model.
  • Week ending Feb 25th
    • Examine the Java Service Wrapper and get it running on a VM to determine if it is possible to use with Sphinx4.1 decoder.
  • Week ending Mar 4th
    • Continue with Sphinx 4.1 evaluation. Attempt to install and configure all tools on a Fedora VM and test performance (confer with Systems group).
  • Week ending Mar 18th - spring break Mar 10th - 14th
    • Test Sphinx4 installation on Traubadix, and implement of Caesar if successful.

Justin S.

  • Week ending Feb 18th
    • Determine if there is an update to the CMU Language Model Toolkit and if it should be used.
  • Week ending Feb 25th
    • Evaluate the SCLite v.2.9 scorer and it's performance. Determine if it warrants installation.
    • Evaluate SphinxTrain 1.0.8 trainer and it's performance. Determine if it warrants installation.
  • Week ending Mar 4th
    • Test upgraded tools on a cloned copy of Caesar (on Traubadix) or VM, if upgrades are to be proceeded with.
  • Week ending Mar 18th - spring break Mar 10th - 14th
    • Install new versions of tools on Caesar and test.



The Data group is primarily concerned with maintaining, updating and future proofing the following elements:

  • Transcripts
  • Word Alignment
  • Audio

We have been tasked with several important aspects of our research and development with Speech technology. First and foremost, the transcripts were left a cluttered mess. We have to find and organize all the transcripts, and make sure we record the actual total hours we have. Once we find them, we need to make sure they're organized correctly and actually usable. We also need to make sure the audio files are in the correct format, and are able to be used. The Data group is fully committed to gaining a complete understanding of our current and future positions regarding the data gathered. We have already begun learning the structure of our several corpora. With these and other tasks, such as checking the genTrans6.pl script along with verifying dictionary completeness, and learning about .sph files, the Data group has a lot to keep track of.

Implementation Plan

This semester, the Data Group fully intends on accomplishing all of its tasks. Staying on top of our assignments is what will cause us to succeed. We need to fully understand the various scripts that have been previously written to maintain and count transcripts for future use. Once we understand and get all of the kinks out of our transcripts and audio files, we can continue to make progress on our research. We also want to take this semester to learn Perl (Note that you don't really have the luxury to spend the semester learning Perl and need to find a way to figure this out as quickly as possible), and how we can utilize it to better perform our tasks.

Below is a detailed list of what we would like to accomplish this semester:

  • Collect and organize all of our transcripts so they are usable
    • Last year, several Perl scripts were written in order to count and organize the transcripts. The success of these scripts it not yet known. The scripts can be found here (http://foss.unh.edu/projects/index.php/Speech:Spring_2013_Matthew_Henninger_Log)
      • It seems as though these various scripts have collectively combined all transcripts into one transcript file. We still need to verify the accuracy of these statements.
  • Make sure eval and dev files are separate from train corpus
    • Professor Jonas spoke about the possibility of eval and dev files not being separate in all corpora
  • Add a lot more useful information than previous semesters regarding the data group
    • The data group has only been around once before, and unfortunately its members didn't keep very detailed logs, if logs were present at all.
    • We want to organize all of the perl scripts used, and the locations of our transcripts and audio into one central location so future Data groups aren't as lost as we were
      • A new page on the media wiki site dedicated to this would be very helpful
  • Read and understand the genTrans6.pl script
    • Taken from the media wiki page, "This is the Perl script that was created to do nearly everything you want. It cleans the transcripts and creates the wav files. It locates the .sph files from the specified directory and it converts each one to a .wav file. It then goes through the transcript and cleans is up. This means that it takes out the header and it leaves the < s > for the start. It also changes all characters to uppercase and deletes any [, ], {, }, and -, that it finds. This is done through the use of the "sed" command. It does this all the way through the script and it leaves the < /s > to show that it is the end of the line."
  • Verify completeness of our dictionary
    • We also need to locate the dictionary
  • Understand the experiment and train processes, and successfully run trains and test on trains
    • Media wiki pages are very helpful for this

Estimated Timeline

(Note that all three timelines are very vague and little on detail here compared to the other groups!)


  • Week ending Feb 18th
    • Understand trains and experiments
  • Week ending Feb 25th
    • Use Perl scripts to better understand how they work
      • Specifically genTrans6.pl
  • Week ending Mar 4th
    • Collaborate with group members to create a useful page on Mediawiki for other future Data Group members
  • Week ending Mar 18th - spring break Mar 10th - 14th
    • Finalize any last tasks


  • Week ending Feb 18th
    • Verify that the corpus is up to date.
  • Week ending Feb 25th
    • Create a way to determine the length of training transcripts (tiny, 5hr, etc).
  • Week ending Mar 4th
    • Find a way to convert .sph files to wav and determine total length.
  • Week ending Mar 18th - spring break Mar 10th - 14th
    • Verify transcripts match audio for each train.


  • Week ending Feb 18th
    • Check the genTrans for functionality
  • Week ending Feb 25th
    • Improve any genTrans files that need work
  • Week ending Mar 4th
    • Convert working transcript files to audio
  • Week ending Mar 18th - spring break Mar 10th - 14th
    • Make the audio into one single file



The Experiment group focuses a lot of their energy providing a better experience for other members while producing a variety of different types of experiments. We spend time perfecting our knowledge of the detailed directory structure each new experiment implements, making sure we know what each part is incase any questions arise. Over the past semesters, the number of experiments has grown tremendously proving just how important a smooth, and efficient process running experiments must be. With the Spring 2014 semester already running a significant number of experiments, it gives our group great motivation to continue our efforts of learning this process, and applying it by writing knowledgeable guides, detailed information about all parts of the experiment directory, and modifying some scripts to become up-to-date.

Implementation Plan

The Spring 2014 Experiment groups plan for this semester is ambitious. Like our introduction states, we want to focus on the theory behind what goes into the Sphinx process because when we get that, we'll have a much better understanding on the experiment directory structure we are using. Once we have an understanding, we can apply it by creating a more instructive and initiative guide on the Experiment Information Wiki page. And depending on time, we may be also be able to modify existing scripts to help the entire experiment making process a bit smoother and more efficient for those that have to run a number of them.

Understanding the Process

As the group responsible for the Experiment directory on Caesar, we should become experts in understanding what is happening inside our directory structure. Having many different folders inside a single experiment can be confusing for anyone that looks at it, that's why it's going to be one of our primary focuses to make sure we learn the process of running a train, creating language and acoustic models, and finally decoding the data. Our detailed plan for this task is below:

  • Better understand the theory and application behind the structure when creating a new Exp. directory.

Artifacts Documentation

The Spring 2013 semester was the first semester to carry an Experiment group. Based on the current state of the main Experiment Wiki Page (http://foss.unh.edu/projects/index.php/Speech:Exp), it's in need of a major face lift. The information that is on there is out of date, and mostly not being used at all. Another primary goal for this semester will be to completely re-format this page with up-to-date artifacts including all the research we do regarding the process I explained above, step by step guides to all the folders within an experiment directory explaining exactly what each file that gets created etc, and lastly any scripts that are currently being used or will be used regarding experiment creation will have their own sub-page detailing exactly what's going on there.

We all agree that this project is purely run on documentation, and with Experimentation being such a large part of moving this project forward, we want to be sure when we leave the experiment page will have all the information needed for future semesters.

Below is a detailed outline of our plans:

(Note that perhaps giving some guidelines when new experiments are added as to what information should be in there would be useful but I do not want old experiments that do not belong to individual group members modified. If they are in bad share, so be it, it's an artifact that remains that way. More importantly is obviously getting SpEAK in shape...yes, I realize it's only 1/2-person time, but that's still a considerable investment.)

  • Totally re-format the Main Experiment wiki page: http://foss.unh.edu/projects/index.php/Speech:Exp
    • The current state is confusing and very out of date. Our goal should be to create a more informative and intuitive page to detail the Experiment process.
    • Once we have an understanding, we should organize this page with detailed information about each of the 8 folders that get created:
      • Why they are added
      • What they get used for
    • Detail any scripts that are currently being used and/or scripts that we create during the semester that have direct relationship to the experiment directory and the process as a whole.
    • Note all the different types of experiments that can be run based on previous entries. For each one, give a little summary of each and link to a past experiment that someone did.

Collaboration on Scripts

As the Modeling group mentioned in their section, we are going to be working directly with them in the process of modifying and creating new Perl scripts that encompass the already-used number of scripts when Running a Train. They talked about how their time shouldn't be spent entering in duplicate pieces of data multiple times because the scripts being used require such. As the experiment group, that process of Running a Train to build an Acoustic Model includes creating a new Experiment which is why are attention is grabbed and want to help expedite this process.

Below is an estimated plan of how we will approach this. This is subject to change as we need to make sure we're coding to what the Modeling group needs:

  • Re-visit the script that actually creates a new (empty) Experiment directory: create_expdir.pl
    • I think we can use this script and expand on it and possibly use this as our base Master script when we get to a point where we can start combing some of the existing stuff when doing a new Run a Train experiment (as that process requires you to create a brand new experiment directory) including Eric's train_01.pl and train_02.pl (which already need to be combined), and all the other scripts Colby talked about in his email he sent Wednesday shortly after class.
  • Re-visit the move_to_expdir.pl script - not sure what the point of this script is and if it's being used anywhere.

Estimated Timeline


  • Week ending Feb 18th
    • Finish and publish final proposal.
    • Gain understanding and theory behind the experiment directory structure.
  • Week ending Feb 25th
    • Revisit create_expdir.pl script
    • Understand what it's currently doing, and add any modifications to make it more widely used in the process of creating a new experiment - i.e. Run a Train process.
  • Week ending Mar 4th
    • Should have working knowledge of the create_expdir.pl script. Can write up a simple guide that can be added to the Experiment Wiki page.
    • Can now start on incorporating this into the Running a Train process by implementing the train_01 and train_02 scripts (if they choose to do so when running the create_expdir.pl script).
  • Week ending Mar 18th - spring break Mar 10th - 14th
    • Complete the create_expdir.pl script that is functional in creating a new experiment directory and can simply start the process of Running a Train
    • Create a sub-page on the Experiment Wiki page that describes this script in detail.
    • Work with team to modify the Run a Train tutorial page to include the steps for this new script.


  • Week ending Feb 18th
    • Gain understanding and theory behind the experiment directory structure.
    • Create a description of all scripts in one place. See if we can create a wiki page for just describing scripts so that every time one is created, it can be documented and described in detail by its creator, because no one can provide a better description than the creator!
  • Week ending Feb 25th
    • Read through all of Eric Beikman's log entries if possible, to continue to help understand all the scripts he wrote.
  • Week ending Mar 4th
    • Run and document the step by step process of running an experiment from start to finish. Note where it would be helpful to have a script for automation.
  • Week ending Mar 18th - spring break Mar 10th - 14th
    • Have scripts that are used to automate the experimentation process fully described and enunciated. Make improvements if necessary.


  • Week ending Feb 18th
    • Gain understanding and theory behind the experiment directory structure.
  • Week ending Feb 25th
    • Log details regarding experiment, for example, the who, what, when, where scenarios
  • Week ending Mar 4th
    • Generate an experiment from start to finish
  • Week ending Mar 18th - spring break Mar 10th - 14th
    • Detail all aspects of an experiment, for example, what it is, what it does, and how to do it


  • Week ending Feb 18th
    • Gain understanding and theory behind the experiment directory structure.
  • Week ending Feb 25th
    • Run and document the step by step process of running an experiment from start to finish. Detail all aspects of an experiment
  • Week ending Mar 4th
    • Read through all of Eric Beikman's log entries if possible
  • Week ending Mar 18th - spring break Mar 10th - 14th
    • Look at clone_exp.pl script and see what is done and how can be approved on.


The objectives and plans outlined by each group above serve a collective purpose. Speech recognition is a complex process, requiring multitudinous components, including Caesar, nine drone machines, Rome, Sphinx3 (the trainer, decoder), SCLite, the switchboard data corpus and its subsets, Perl scripts for process automation, and above all else the documentation which has allowed us to continue where the prior group left off. Herein lies the problem, successful and effective speech recognition relies on each of these components functioning at their best. Inefficiencies in any component can result in overall performance lapses. For this reason, we have delegated tasks as outlined above. By breaking the complex system down, it provides each group the possibility for gaining more focused expertise in a subset of the system, which will become relevant when we regroup mid-semester and work together to educate one another and move forward in our collective task. It is our hope this semester to refine the system such that we can achieve a baseline by which the progress of this research can move forward. To do this, each group will need to meet their goals and improve the efficiency of the existing processes.

After the March 18th meeting, the structure of the class will change a bit. As you can see above in each groups Estimated Timelines, we go up until March 18th. This is because we will be split into different groups of 4-5 people. Those groups will consist of one "expert" from their first groups allowing these new groups to have over 5 weeks of knowledge at all the areas this class is focusing on. For the following 2 weeks, each group will work through the entire process of creating Acoustic Models, Language Models and then run the Decode. It's at this point where everyone's individual timelines will begin to become drastically different as all of us will learn the entire Sphinx process and be able to successfully retrieve results. After these 2 weeks, everyone should have become familiar with the process and we will again be split up into new groups - this time 2 groups of 9 people. For the following 2 weeks, both groups will continue down the path of running Decodes, but this time on larger Model sets and Dictionaries. This is the time where we will be purely focused on reaching our goal on getting a baseline for our Speech Recognition project.

After the 2 weeks of both groups running Decodes, we will keep one of those 2 groups going in regards to building a baseline. The other group will begin writing the final report describing our progress we made this semester.

We all feel ready to move this project forward in many ways. Not only is our primary goal to build a strong baseline for the decode, it is to make sure that the following classes after us are in line to build an even stronger baseline after us.