Speech:Spring 2015 Proposal
- Information - General Project Information
- Experiments - List of speech experiments
This document is the formal proposal for the Spring 2015 semester of the Computer Information Systems major's Capstone class. The goal of the Capstone project is to develop a world class baseline using the switchboard corpus. This proposal is primarily focused on what is taking place between February 4th and March 18th. The documentation created by previous semesters is in pretty poor shape -- augmenting it is one of our primary objectives. However, learning how to use the system and to build models is the ultimate intent of this project.
Congruent with the previous semesters, the speech system has been divided into five principle responsibilities. The composition of roles involved with this research project are as follows.
The primary reason for dividing tasks is to allow each individual student to become an "expert" at that specific area of the speech system. Due to time constraints (one semester), it's essential to have that degree of specialty in order to advance the project. The majority of this proposal is focused on the five group task. The five group task has two intentions: to improve infrastructure and fix shortcomings in the documentation, and also to learn how the sphinx speech system works in terms of building and evaluating models.
The speech Boot camp will be an intensive two week program where each member of Capstone will create and run an experiment. Professor Jonas will create and organize four new groups: each has one member from each original group of the five group task -- thus, each group will be composed of five masters with respective unique specialties. The purpose of the speech boot camp is for students to understand how to build and evaluate models themselves with some sense of comfort.
In the final phase of the project, the 22 students will be randomly divided into two teams of 11. The intent is for each group to compete against each other to create the best speech model baseline -- trying to build both an acoustic and language model that will yield the best baseline. The loser of this competition will be obligated to write the final report of the project, while the winning group updates the experiment documentation.
As referred to above, the main part of this document is to discuss the task of the five main groups. Below we will go into detail about each one, respectively.
The modeling group’s role is to create effective models to be used for speech recognition. These models will be analyzed to prove their effectiveness, and modified until a highly functional baseline is found. Speech recognition requires two types of models to function. These models are the Language Model and the Acoustic Model. The Language Model is used to find statistical data about word frequency and where words appear in a .WAV file. This model helps in predicting what words may appear, in what order, and how frequently when decoding audio files. The Acoustic Model is used to map a transcript with the audio file that it was derived from. The Acoustic Model is the more difficult of the two models to determine because of its many parameters used to find a relationship between the transcripts and the audio files. The main concern for our team is to finely tune the Acoustic model to create a reliable baseline. This baseline will be found by working with the Acoustic Model's parameters and analyzing their effects on the overall baseline. When the relationship is established these parameters can be tweaked to lower the baseline needed to make further progress on the accuracy of the speech recognition software. As a second objective our team will be responsible for cleaning up the preexisting Perl scripts. This includes making them more efficient as well as including comments in the code for future participants to learn their functions.
Artifact Review and Cleanup
In earlier semesters many scripts have been written to automate the process of the speech system. These included generating transcripts, creating dictionaries, and running trains among others. Last semesters Modeling group was able to consolidate much of these automation scripts into one master script which is used for setting up directories to run trains and tuning Sphinx to its desired parameters. Due to all these newly created automation scripts there are many scripts that exist on Caesar that are not being used. Our team will be reviewing all the scripts the live on Caesar and deciding which ones are being used and which ones no longer have use.
After reviewing the scripts on Caesar our team will be responsible for updating media wiki to match the server structure. Media wiki currently only shows about half of the current scripts that are being used on Caesar. For the scripts that are no longer being used we will identify these as outdated on media wiki for future groups to know that they are no longer useful.
Along with the script identification, our team will be reviewing the actual code within the scripts. When we find inefficiencies in scripts we will tune them to work at their most optimal levels. Much of the scripts are also not commented. It will be our job to add comments within the scripts for easy identification of what each part of any given script is doing. This will allow future groups to be able to edit scripts with greater ease, and it will eliminate some of the guess work that may be involved in that process.
In past semesters much work has been done to try to create better models to achieve lower baselines. The most important breakthrough was finding certain parameters in Sphinx's configuration file that could be tuned to possibly help create better models. The current parameters that have been discovered are sense and densities. These parameters have been finally tuned over passed semester and analyzed in great detail. The lowest baseline achieved with these parameters was 38% on test-on-train data for 125 hours training corpus. The Modeling team this semester will be vigorously searching for other parameters that could be adjusted to aid in our goal to get a lower rate of error framer models. It is our goal that with the discovery of more parameters and analyzing their affects on the overall model we will be able to improve the baseline to a usable level.
The current documentation that exists on the wiki site is outdated and full of organizational and formatting flaws. It will be our team’s job to improve these pages to provide clearer documentation of preexisting wikis so that future groups will be able to follow instructions with ease. Due to many groups from multiple years working on these wiki pages there are many different layout styles and design elements. For continuity purposes our team will be tackling these pages to make them all consistent.
The scripts page needs a major overhaul as well. There are currently brief explanations on what each script does. These explanations needed to be expanded upon as many of them do not give enough detail on their functions. Our group will also be adding comments to the actual script themselves so future groups will know what the individual scripts are doing line by line.
- Acclimate ourselves to the context and architecture of the project
- Connect to Caesar - 2/3 (Garrett, Sam, Zach, Zeb)
- Successfully run a train - 2/3 (Garrett, Sam)
- Successfully run a decode - 2/11 (Garrett, Sam)
- Compile a list of relevant scripts and any improvements that need to be made to these scripts - 2/11 (Sam, Zach)
- Devise a concrete set of improvements to be made on the documentation - 2/11 (Garrett, Zeb)
- Repair the current process and documentation of running a train
- Work with the experiment group (Nicholas) to determine the best way to organize the tutorials - 2/18 (Sam)
- Condense the existing tutorials into one fluid and working tutorial - 2/18, 2/25 (Garrett, Zach)
- Adjust scripts as necessary to fix recurring errors - 2/18, 2/25 (Sam, Zeb)
- Work with the data group as necessary to resolve missing directory references - 2/25 (Sam)
- Get our new member up to speed (Trevor)
- Ensure that he knows how to run a train and where to look as far as resources/tutorials - 2/25 (Garrett, Zach)
- Improve baseline of errors below the current baseline of 38% by running experiments with varied senone and density values, which the previous semesters found to be relevant
- Change the senone value (default is 1000) and run trains to determine a trend toward a most efficient value - 3/3 (Trevor, Zeb)
- Change the density (default is 8) and run trains to determine a trend toward a most efficient value - 3/3 (Sam, Zach)
- Attempt to find a most efficient combination of the two. If significant individual trends are found, start from there to see how they affect each other. - 3/10 (Garrett, Sam, Zach)
- Research other variables that may affect the error rate - 3/3, 3/10 (Trevor, Zeb)
- Determine why previous semester's error rates were proportional to the amount of data examined
- Higher amounts of data should yield lower error rates. Check for any errors in their process. - 3/3, 3/10 (Garrett)
Our responsibilities as the Systems Group are to maintain the systems and update the software being used for the speech project. We must ensure that the servers and software are running without error to allow experiments to run without interruption and provide good results. Our group is responsible for any downtime with the servers, so our peers will be looking to us to ensure the servers stay running. We will also need to help out other groups if they are having problems connecting to the servers or need their user accounts to be created on specific servers. The current infrastructure in place is made up of the Caesar server (PowerEdge 2650), its 9 clients (PowerEdge 1750): Asterix, Obelix, Miraculix, Traubadix, Majestix, Idefix, Automatix, Methusalix, Verleihnix, and Lutetia (PowerEdge 2950) and Brutus (PowerEdge 2900) servers. Caesar will be replaced with Brutus by our group this semester and Brutus will become the new Caesar server. Caesar and Asterix are currently running OpenSUSE and the other servers running RedHat. It is our goal to ensure a smooth transition from Caesar server running OpenSUSE to the new server hardware Brutus running RedHat.
The Systems Group has the primary task of completing Caesar's migration from legacy hardware to the new Dell Power Edge 2900 (Brutus), and from OpenSUSE to Redhat 6.6 x86. To do this we will first need to test install RedHat on a server client using the RedHat installation documentation to make sure the documentation is complete and accurate. If we have any issues during this installation process or find it needs more detail we will make our own changes to the documentation. Once the installation of the server client is complete and we believe the documentation is accurate we will make sure the new server (Brutus) is properly configured. We will need to make sure the NICs are properly configured, the NFS is setup, and user accounts are created before we copy files from Caesar to Brutus. We will then need to run trains to create a baseline for the performance of the old machines to compare with the performance of the new servers. If this data proves that the new server performs better, we can finalize the transfer of the Speech project to the new server. Our other tasks will be updating the wiki with the most current information of the hardware, software, and other systems-related information. There have been many changes in the past semesters, and the systems information page has not been updated to reflect those changes. We may also want to discuss the reasons for switching from OpenSUSE to RedHat.
- RedHat Test Installation
- RedHat was installed last semester, but the documentation was not completed until the start of the Capstone course
- It is our job to do a test installation of RedHat on one of the servers using the documentation - 2/11 (Chris, Kyle, Melissa, Mohammad)
- Clearer documentation must be written for any parts of the installation that were not sufficiently documented prior to the start of the semester - 2/18 (Melissa)
- RedHat Test Configuration
- Documentation from the previous semester must be used to make network configurations - 2/11 (Chris, Kyle, Melissa, Mohammad)
- A network bridge must be configured to allow drones to access the Internet - 2/23 (Mohammad)
- The process of creating a network bridge must be thoroughly documented - 2/24 (Melissa, Mohammad)
- Test trains between Caesar, Brutus, and drones need to be done to justify making Brutus the new Caesar - 2/25 (Mohammad)
- Caesar Server Replacement
- Technical specifications and other relevant information about the new server specs and configurations need to be updated on the Wiki (Chris, Kyle)
- Brutus configuration needs to be verified before we are ready to start copying files. 2/25 - 2/27
- All data needs to be copied from Caesar to Brutus. (copy /mnt/main) - 3/2 - 5:30 - 6:30 PM (Chris, Kyle, Melissa, Mohammad)
- Verify that everything is copied, nothing missing, all access rights and links are the same, verify disk usage. Need to log what directories are being verified.
- Run experiments, Brutus on Brutus, Verleihnix on Brutus.
- All drones must be configured to work the same way with Brutus as they had with Caesar.
- Brutus will be put online as the new Caesar and Caesar will be shutdown. - 3/6 - 5:00 - 7:00 PM (Chris, Kyle, Melissa, Mohammad)
- Recopy 0260 - 0267 experiments and Spring 15 home directory.
- Brutus's IP becomes 126.96.36.199 and hostname changes to Caesar.
- Run 5th experiment, new Caesar to new Caesar.
More detailed breakdown: Migrate to Brutus
The Tools Group is responsible for investigating the various software tools that are used in speech recognition to determine what the best update path is and if updating would be beneficial. Most of the software tools that are currently being used have newer versions available that would support a modernization of the entire system. The updated versions of our main software tools to be examined this semester include: SphinxTrain 1.0.8, CMU-Cambridge Statistical Language Modeling Toolkit 0.7, Sphinx 4, and SCLite 2.9. In addition, this semester, the Systems group will be updating the machines to a new and different Operating System: RedHat. In order to implement Sphinx 4, though, we will need to look into the use of a Java Service Wrapper since the OS and the program are written in different languages (C vs. Java). Our goal is to make sure that the software tools are not only compatible with this OS, but are more efficient and useful and worth the upgrade. We will also be implementing a new piece of software called Emacs, which is an extensible, customizable text editor. In order to compare the older versus newer software tools, we will be ditching the idea of installing a virtual machine to one of our personal computers. Instead we will be accessing the machine Obelix to conduct testing through training.
As previously mentioned, the speech software tools currently installed have updates worth investigating, which is the major task of the Tools group this semester. Upgrading of speech software tools, as well as the upgrading of the OS from the Systems group, would allow for an overall update of the speech recognition system as a whole. There are several newer versions that have been released, and our goal is to prove that each updated version is an improvement over the older versions. The speech software involved is as follows:
The trainer is the software that builds models. “Training” is the process of using samples of speech to fine tune the Acoustic Model. The trainer learns the parameters of the models of the sound units by using the sample speech signals. The current version that is installed is SphinxTrain 1.0 which has been used for several semesters. The potential upgrade to the newest version, SphinxTrain 1.0.8, is supported by the following new features:
- Bug fixes (memory leaks and build issues)
- Package can be installed now just like any other application
- Single ‘sphinxtrain’ command to access all training process
- Increased reuse of sphinxbase functions
- Supported by Sphinx-4
Language Model Toolkit
A Language Model helps us to figure out how likely a word sequence is, by taking the training results and analyzing word occurrences in a particular corpus. This toolkit is a suite of software tools to facilitate the construction and testing of statistical language models and is meant for large amounts of training data. The newest available version of the CMU-Cambridge Statistical Language Modeling toolkit is version 0.7. This version, however, doesn’t seem to offer any real benefits that would make it worth installing, aside from minor bug fixes, so the current version in place (version 2) might be sufficient.
Decoding is the process converting audio to text. The current version of the decoder being used is Sphinx 3.7 and the newest version is Sphinx 4. Sphinx 4 is a re-write of the software in the Java programming language and it provides high flexibility, great accuracy, and great speed for small tasks. Sphinx 4 will provide a decent update to the system, since Sphinx 3 has been in use for several semesters. Sphinx 3 has been considered the most accurate for large vocabulary tasks, according to CMU, and it is intended more for researchers. But, with Sphinx 4, word error rate has decreased and the real time (ratio of processing time to audio time) has decreased. Also, Sphinx 4 was written in Java, whereas Sphinx 3 was written in C. The implementation of Sphinx 4 will likely require a Java Service Wrapper.
Java Service Wrapper
Upgrading to Sphinx 4 will be successful if we also use a Java Service Wrapper. A JSW is a configurable tool that allows Java applications to be installed and run as a daemon on a Linux system. The latest stable release is version 3.5.25 provided by Tanuki Software.
SCLite is a tool for scoring and evaluating the output of speech recognition. The program compares the hypothesized text output by the speech recognizer to the correct text. The user can access and view the reported results (such as word error rate) that are produced by the scorer. According to last semester’s Tools group, the newest version is SCLite 2.9 and it is included in the Speech Recognition Scoring Toolkit (SCTK) version 2.4.8. Unfortunately, the differences between the versions were not documented and cannot be found because the readme is outdated and the developers do not seem to have kept up with documentation.
Emacs is an extensible, customizable text editor and more. Some of its features include:
- Content-sensitive editing modes, including syntax coloring, for a variety of file types including source code, HTML, and plain text
- Complete built-in documentation
- Full Unicode support for nearly all human languages and their scripts
- Highly customizable, using Emacs Lisp code or a graphical interface
- A large number of extensions: project planner, mail and news reader, debugger interface, calendar, and more.
The current stable release of GNU Emacs is 24.4
- Speech Software Updates and JSW Implementation -2/25(combined group effort)
- Implement New Software - GNU Emacs - 2/25 (Refik)
- Training on Obelix-2/18,2/25
- Test current software tools for basic performance evaluation (Nate, Kayla)
- Once software updates have been pushed onto Obelix, continue on with training to compare results (Nate, Kayla)
- Determine whether updated software versions produced results that are more efficient or better in any way
- Construct a Proposal/Argument -3/4, 3/10 (combined group effort)
- Create an argument, with supporting evidence, for why we should or should not upgrade Caesar with new software versions
- Update Wiki -3/4, 3/10 (Kenneth)
- Edit the Software Information page to contain information about any new software in use
- Document our findings for the next Tools Group
There are two corpora of speech data, Switchboard data and NOAA data. Our main objective this semester is to clean up the audio files and provide much needed documentation for where the files are located. We will make sure the data is efficiently structured and organized, so that they are all centrally located. Some soft links have been created, but we found that most of the existing soft links have broken paths. We will fix these links and also create soft links for the remaining .wav files. The soft links will allow for a cleaner and more robust organization system that will benefit every team assigned to the project. We will clean up the structure of the switchboard directory to match the current structure of the NOAA directory. Once the directories are mapped similarly, it will be easier to create the soft links and prevent future issues with broken soft links. We will accomplish removing all the duplicate copies of the .wav files.
We also want to get a final answer and clarify the discussion over whether the audio files are 308 or 247 hours. We also want to figure out why there are two different numbers and where the confusion over this has been coming from.
The Data group will also become proficient in running experiments and trains: All members must be capable of these basic tasks that form the basis of all research conducted in the speech project. This process will begin as soon as we are given access to the Caesar server.
This semester the Data group will be diligent with the documentation of the audio file cleanup and data structure implementation. We will fully and comprehensively document the organization and new structure of data storage, so that other teams can quickly take advantage of the structural changes. Any additional functionality that we develop in the form of scripts or other types of tools will be documented fully as well. Existing documentation will be reviewed and clarified if necessary.
We will create soft links for the .wav files that have not already been done. We want to create a mechanism for creating the soft links, which may include a script to implement them.
Experiments and trains will be run by each member to gain a better understanding of the system as a whole, and how we can best assist the other teams in achieving their objectives. We will maintain open communication with the other teams as the project evolves, so that we can provide the best data support possible for all those who are currently working on the speech project.
- Experiments creation for the Data Group (Task Leader: Stephen)
- The data group will have an experiment directory where we will conduct our own experiments. -2/16 (Stephen)
- This is created to familiarize all of the group members with running trains. -3/4 (Stephen, Dakota, Russ, Krista)
- Update/Fix the soft links of the Switchboard Corpus (Task Leader: Dakota)
- There are a lot of broken links within the switchboard corpus.
- New soft links will be created into order to repair the existing Switchboard Corpus. – 3/4 (Dakota)
- This is needed in order to create trains as of right now, since all of the links in the Switchboard corpus are broken.
- Managing the .wav files, and deleting and documenting .wav files prior to spring 2014 (Task Leader: Krista)
- There are a lot of unnecessary .wav files left over from experiments from the spring 2013 semester and back.
- These files will be pruned while the experiments themselves will be left alone. – 3/4 (Krista)
- Switchboard Analysis (Task Leader: Russ)
- The Switchboard corpus is said to be 256 hours long, but there is evidence that supports the fact that there are more.
- The Switchboard corpus will be analyzed for the total length of the corpus. – 3/4 (Russ)
The Experiment group’s primary objective during the five week project phase is to organize the scripts portion of MediaWiki and to simplify the experiment creation process.
Our main objective is to create a new script which interfaces with the MediaWiki API in order to simplify the process of creating an experiment. The script should automatically input the experiment number, allow a user to input the name of the experiment along with a brief description, and then post this information to the Experiments page of MediaWiki. The script will then prompt the user to create a corresponding directory on Caesar.
Some sub-tasks that we will also complete include: categorizing all existing scripts into relevant and irrelevant categories, and create collapsible menus for the Experiments page on MediaWiki in order to categorize the experiments by their respective semester.
This semester the Experiment group intends to create a new script to interface with MediaWiki API with full documentation that is easy both comprehensive and learnable. We will organize and streamline existing scripts and experiments for ease of access and understanding. Any additional functionality that we develop in the form of scripts or other types of tools will be sufficiently documented as well.
Each member of the Experiment group will understand the process of creating an experiment with the existing scripts fully and comprehensively so that we can best understand where improvements to this process can be made. We will maintain open communication with the other teams as the project evolves, so that we can be a resource when creating experiments for all other students involved in the speech Capstone project.
- Condense Experiments Page on MediaWiki (Morgan 2/4)
- The goal is to condense the existing experiments page into expandable menus by semester for categorization and ease of navigation.
- Investigate use of MediaWiki API (Taylor 2/4)
- We researched the use of the MediaWiki API in order to create an experiment using a Perl Script that would interface with MediaWiki.
- Create an Experiment (Nick 2/11)
- In order to understand the existing process, we felt that creating an experiment start to finish using the existing process would be beneficial.
- Documentation (Taylor 2/23)
- Proposal and create documentation for new scripts
- Write a Script using MediaWiki API (Ben, Nick 3/4)
- The goal of this script is to automatically pull the last existing experiment number from MediaWiki and increment it, and then prompt the user for an Experiment name and brief description.
- Test this script (Taylor 3/4)
- Categorize existing scripts (Ben, Morgan, Nick 3/4)
- By the time the new scripts to create an experiment are written, we would like a good idea of which of the existing scripts are relevant and irrelevant.