Speech:Spring 2015 Report


 * Home
 * Semesters
 * Spring 2015
 * Proposal
 * [Report]
 * Information - General Project Information
 * Experiments - List of speech experiments

Introduction
This document is the formal concluding report for the Spring 2015 semester of the Computer Information Systems major's Capstone class. The goal of the Capstone project is to develop a world class baseline using the switchboard corpus.

Tasks were divided amongst the class so that each individual student could become an "expert" in a particular aspect of the speech system. These areas included: Data, Tools, Experiments, Systems, and Modeling. These groups were later split up into two separate competing teams: the Patriots and the Bruins. Both the Patriots and Bruins contained an equal number of members from the original groups so that the rosters had an even number of "experts" for each specialty. The task for these teams was to create the best speech model baseline. This report is focused on detailing the results from each of the five groups as well as the conclusions drawn at the end of the competition.

The competition results are outlined at the end of this document while the results for each of the five initial teams are described below.

Introduction
The modeling group took charge of getting better results in speech recognition. Our objective was to use the given Sphinx speech recognition software to establish a system that can decode audio data into text. This is done by altering the various elements that Sphinx uses in order to improve the accuracy of the end results. The system itself has three major components that contribute to this accuracy: the dictionary, language model, and acoustic model. A set of audio files is used to create all three of these components in order for them to become more precise. Previous semesters had a tendency to focus solely on the acoustic model to get better results, but our group explored all three. Taking advantage of every component allowed us to yield better results than previous semesters for large amounts of training data.

Areas of Focus
From the start of this project we had two main goals to accomplish. The first goal was to make this project easier for future semesters by cleaning up documentation. Every semester in the past has included there input into various elements of documentation for this project which has created a very large amount of information. The problem with this is that all of this information has been put in scattered locations, and it has become difficult to understand how to perform simple tasks. The second goal that we made at the beginning was to improve the word error rate of a train that was run on a large amount of audio data. Specifically, we wanted to get less than 50% word error rate with 256 hours of training audio. Earlier groups had not been able to achieve any successful results with more than 125 hours of training audio.

Documentation
The biggest difference that we have made with documentation is with the tutorial of how to run a train. The wiki has three steps on what the proper process is to create both of the models and decode them to get results. The Spring 2014 semester restructured both the process and the tutorial of how this is done to make it simpler. In doing so they made it much more difficult to make necessary changes to the sphinx configuration, and the process became almost entirely automated. This automation allowed people to get away with not learning what was happening throughout the process. On top of these negative side effects, the new process that they created did not work. It obviously worked fine for them last year, but so much has changed since then that this “simpler” process had become broken. Therefore, we had to bring back the old process by updating both the scripts and the tutorial. The newest tutorial on the wiki now has well explained steps that make it very easy for anybody to walk through the process while also learning what is happening behind the scenes. Along with this fix, we appended the links to properly direct the user through the tutorial. Now, the older tutorials can only be found in an archived page to avoid any confusion.

The scripts that actually allow you to run the train had to be altered a lot to become functional. We are not quite sure how these scripts became as broken as they were, but about half of them were unusable. Naturally we took it upon ourselves to fix all of them even though we did not know what the working versions ever looked like. This was a real challenge to figure out what was supposed to happen in these scripts, and what was causing the errors. If we had prior experience with running a train or sphinx than it likely would have taken half the time to make everything work, but this was not the case. Regardless, we were able to get the process to run smoothly before the time came to compete for better results.

Acoustic Model
This component is what converts audio waves into words. It uses the waveforms found within audio files along with the text that is being said to learn what every word looks like as audio. Then it uses this knowledge to convert raw audio into text that seems to have similar waveforms.

The acoustic model is something that most semesters have focused all of their time on. Fortunately, it does seem that this component makes the biggest impact in the final results. Unfortunately, this component takes the longest amount of time to create, which makes it the most inconvenient to alter. There were only a few variables that we were able to achieve better results with, but each of these had a significant impact. The senone value is something that we kept at 8,000 for a majority of our experiments. This is the maximum value that is recommended in various documents for sphinx, and it is also suggested that it is used for any amount of data exceeding 100 hours. The convergence ratio is another value that we believe had a big impact on our results. The default value for this is 0.04, but we found reducing it to 0.004 produced a better model. This modification does increase the real time factor, so future semesters should attempt to find the optimal balance for this variable. Another parameter that varied throughout our experiments was density. Given large amounts of data, it is important that the density is increased to a value that works well with the given senone count. There is documentation regarding density that suggests that it should not exceed 64. Therefore, most experiments that were conducted with 256 hours of data used this value. However, there was one experiment that was created with a density of 128. This experiment gave the lowest word error rate when decoding on seen data, but it is likely that this will have a significant reduction on unseen data. For this reason it is recommended that this experiment is decoded to decide whether or not the density should remain at 64.

Language Model
This component learns the context of a word within an audio file. It uses words before and after to know what a common sentence structure is. It then uses this data to predict what word is next in the sentence and it can also verify that the previous word makes sense in the given context.

The language model has been taken for granted by just about every semester that has come before. For that exact reason we decided to look into what could be changed to improve it. There are many variables that can alter the accuracy of the language model, but there are only a few that we looked into this semester. The biggest one that we discovered was the default value for the word cap. This is set to 20,000 so that the machine that creates the model does not have to reserve more memory than needed. However, the 256 hours of data that we used to train contained far more than 20,000 words. This means that it was being cut off and our language model was not being completely created. Due to time constraints we were not able to perform many tests to verify that this made a significant impact, but the few times we did implement it, our results were better than expected. We recommend that future semesters look into the rest of the language model to create the best one possible. Along with modifications being made directly to the language model, there is a parameter called language weight that determines the balance between acoustic model and language model. Changing this will also have an impact on how much of a difference the language model will make.

Dictionary
The dictionary is probably the simplest component of speech recognition. It is essentially just a list of word with their pronunciations. That means that there are only really three different ways to attempt to improve this: increasing the number of words, decreasing the number of words, or changing the pronunciations. We believe that the pronunciations are correct and therefore should not be altered, so that just leaves changing how many words are listed in the dictionary. We attempted to trim the dictionary down to only words that were contained within the audio files that were being used to decode. This proved to have a very small impact on the outcome of the data, but it still did make a difference.

Something that a future semester should look into is to increase the size of the dictionary. Even though increasing the number of words listed may create some conflicts that cause errors there is also a possibility of error reduction. There are a large number of words that appear in the audio transcripts that are not listed in the dictionary. These generally don’t seem to be complete words, but they are words that got cut off by a dash (e.g. examp-). The transcripts that we have used contain a very large number of these part-words that are likely causing a substantial portion of our errors.

Introduction
The systems group had two primary objectives throughout the Spring 2015 semester. The first was to migrate Caesar from its existing Dell PowerEdge 2650 hardware to newer Dell PowerEdge 2900 hardware. The second was to migrate Caesar’s operating system from the deprecated OpenSUSE 11.3, which is no longer supported by developers, to RedHat Enterprise 6.6. This needed to be accomplished with as little downtime on Caesar as possible to minimize disruptions to the others teams’ work.

Pre-Migration
Before Caesar’s migration could occur, the systems group had to prove that the migration was worthwhile and eliminate as many unknowns from the migration process as possible in order to minimize downtime.

To test out the new operating system, RedHat was first installed on the drone Verleihnix and installation instructions were recorded on the systems group page. Network configuration changes were then made on the newly installed operating system, including changing the host file to make Verleihnix aware of the names of the other servers on the network. Verleihnix then had to be connected to the Internet indirectly through Caesar so that its copy of RedHat could be activated which was a prerequisite step for mounting Caesar's filesystem on Verleihnix. The systems group connected Verleihnix to the Internet by configuring the drone's network interface card, DNS settings, and kernel IP routing table. Once online, Verleihnix's copy of RedHat was activated, and Caesar's filesystem was mounted to it.

Once Verleihnix was successfully set up, the same operating systems installation process and network configurations were performed on the new server. The new Dell PowerEdge 2900 server hardware was called initially called Brutus, and it was on Brutus that all pre-migration work was done. The goal was to configure Brutus to the point where it could replace Caesar by having Caesar's filesystem copied over to it. This meant that it was necessary for the drones to be able to connect to the Internet through Brutus just as they had done through Caesar, by plugging in Brutus to the UNH network, setting up a static IP address on it, and resetting the DHCP server.

Trains
After Brutus was properly configured, trains were then run to compare the speed of Brutus and Caesar and justify the migration. The following trains were run to compare the speeds of the two servers: Trains run on Brutus were faster than those run on Caesar, which proved that migrating systems would increase the efficiency of speech research.
 * 5 hour train 0260-001 on Caesar
 * 5 hour train 0260-002 on Asterix with Caesar's filesystem
 * 5 hour train 0260-003 on Brutus
 * 5 hour train 0260-004 on Verleihnix with Brutus' filesystem

Migration
Migrating Caesar to Brutus involved creating a migration schedule to give the other teams insight into when server downtime could be expected so that they could plan accordingly. The migration took longer than expected because it was discovered at the last minute that the /home partition had 3TB of disk space while the / (root) partition had very little disk space. In order for Caesar's filesystem to be copied over to /mnt/main without Brutus running out of space, this partition first had to be resized.

Extra care needed to be taken when copying the filesystem over, because using an ordinary cp (copy) command risked altering or removing symlinks or file permissions. To ensure that the the integrity of the filesystem was maintained, a combination of the ssh and tar commands were used to perform the copy. Once this step was completed, Brutus was renamed to Caesar, plugged into the UNH network, and Caesar's filesystem was mounted onto each of the drones, which completed the migration process.

Introduction
The Tools group’s main task was investigating the various software tools that are used in speech recognition. We wanted to determine what the best update path would be and if it would be beneficial. Most of the software tools that are currently being used have newer versions available that would support a modernization of the entire system. The updated versions of our main software tools that we examined were SphinxTrain 1.0.8, CMU-Cambridge Statistical Language Modeling Toolkit 0.7, Sphinx 4, and SCLite 2.9. Since the Systems group updated the machines to a new Operating System, we needed to make sure any updates would be compatible with the new OS. We also implemented a new piece of software called Emacs, which is an extensible, customizable text editor. We also spent a lot of time doing literature searches for information on speech recognition and the software tools we were using.

Documentation
We added a page to the information section on the wiki for Speech Recognition Related Readings. We uploaded all the PDFs we found during our literature search and referenced them on this page. There is a specific article that compares Sphinx4 to Sphinx3. This article states that Sphinx3 has the same performance as Sphinx4. The only real difference between the two versions of the software is that Sphinx4 is java based. We decided that based on the extensive reading we did that it was not worth updating from Sphinx3 to Sphinx4 at this time since it does not enhance the performance. After a lot of effort, we were able to install Emacs on Obelix.

Introduction
The data group was tasked with cleaning up the audio files and providing documentation for where files are located. We also wanted to restructure and organize the switchboard data, so that all the different length data corpus file directories are set up the same way. Within the switchboard corpus directory, there are different length data corpus directories; 125hr_3170, 256hr, first_5hr, and full. At the beginning of the semester, each directory had a completely different structure.

Areas of Focus
Organized existing data by creating new directory structure

Within the switchboard corpus directory, there are multiple data corpus directories; 125hr_3170, 256hr, dist, first_5hr, full, and old. Dist contains the original audio files from each disk and a directory call flat that has soft links back to each disk. 125hr_3170, 256hr, first_5hr, and full directories each had different structures. We were able to restructure each of these directories so that they each have a test directory and a train directory. The train directory is used for experiments that use the entire directory's dataset whereas the test directory is for experiments that wish to use a subset of the dataset. Within the test directory, there is an audio directory and a trans directory. Within the audio directory, there is a conv directory and an utt directory. Within the train directory, there is an audio directory, info directory, and trans directory. Within the audio directory, there is a conv directory, an OLD directory, and an utt directory. There were various other directories under each different length, but since they were no longer in use, those directories were moved under the OLD directory in each length.

Fix broken soft links

The flat directory contains a soft link to each audio file on the original disks. Unfortunately, all these soft links were broken. The data group was able to fix these soft links. Under each conv directory for each length of data corpus, there were soft links to the flat directory. These were also broken and became another task to fix. The utt directories also needed to be fixed.

Eliminate old data in Experiments directory that is no longer necessary

When an experiment is created, a copy of the audio files is created in the wav directory under each experiment. The data team went through all experiment folders prior to Spring 2014 and deleted all the wav directories to save hard drive space.

Documentation
Our process throughout the semester has been documented in our personal logs. Further information can be found on the Speech Corpus pages located on the wiki. Within the Switchboard Data Notes on the Information page, we added a diagram showing how the switchboard directories are now structured.



Introduction
The experiment group's two main tasks were to organize the scripts portion of MediaWiki and to simplify the experiment creation process. A new script was created which interfaces with the MediaWiki API to simplify the process of creating an experiment. The script automatically inputs the experiment number, allows a user to input the name of the experiment along with a brief description, and then posts this information to the Experiments page of MediaWiki. The script then prompts the user to create a corresponding directory on Caesar.

The scripts within MediaWiki were organized to be much cleaner. All existing scripts were categorized into relevant and irrelevant categories, and collapsible menus were created for the Experiments page on MediaWiki to better categorize the experiments by their respective semester.

Documentation
The createWikiExperiment.pl is fully documented so that it is both easy to understand and modifiable for future users should changes need to be made. Existing scripts and experiments were organized for ease of access and understanding.

Patriots

 * Members: Nathaniel Biddle, Melissa Bruno, Garrett Bryant, Krista Cleary, Trevor Downs, Dakota Heyman, Refik Karic, Taylor Kessel, Kyle Poirier, and Nicholas Tello

Description

 * Modified language model to include 5,000,000 words in its vocabulary (using the the command "-top5000000")
 * Changed dictionary to remove all words that were not found in the decode set provided by Professor Jonas. A Perl script was written to parse through a transcript and create a dictionary that only contained words in utterances that shared a fileid with the provided decode set.
 * Modified multiple parameters within Sphinx configuration file.
 * Changed the convergence rate from .04 to .004
 * Senone value was changed to 8000
 * Density changed from 8 to 128

Results
We decided that our best result was a 32.2% word error rate trained on a subset of 256 hour data with a real-time factor of 11.35. Our second best result was also based on 256 hours. That results achieved a higher error rate (40.9%) but a lower real-time factor (5.00). We believed that the much better error rate was the better result since both real-time factors are relatively high.

Bruins

 * Members: Zachery Boynton, Adam Cheney, Kenneth Drews, Mohamed Fadlalla, Morgan Gaythorpe, Stephen Griffin, Benjamin Leith, Kayla Mackiewicz, Russ Sweet,  Sam Sweet, and Christopher Teahan

Description

 * Language Weight:
 * The recognition process references predictions by both the language and acoustic models. This value determines which model holds more weight in its predictions, with higher values tending the system toward the language model and lower ones tending it toward the acoustic model. The default value is 10, which shown to be terrible for any system with a reasonable language model. Research showed that a near-ideal system has an optimal language weight of 27. Though, in our system, 25 produced the best results with our test range being values between 25 and 35. This makes sense, because of how much more focus has been put into our acoustic model than our language model.
 * Beam Settings:
 * Adjusts the amount of sentences that will get rejected from poor word alignment, which is when the acoustic model does not manage to effectively match the phonemes of a processed audio chunk to known words. The CMU website suggested 1e-50 as an optimal value, further confirmed by the previous semester’s work.
 * Convergence Ratio:
 * In simple terms, decides how much leniency is allowed when a model narrows in on a specific word probability. Less leniency leads to a more precise model, but a higher real-time factor. The previous semester demonstrated that .004 is more efficient than the default .04; however, more tuning may be possible here.
 * Sample Rate:
 * Changed to 8,000 to match the sample rate used on the test audio files.

Results
Our best result was a 38.2% word error rate trained on a subset of 256 hour data with a real-time factor of 7.11. Although we have a lower WER in shorter experiments we felt it was important to show a competitive experiment from a larger corpus as these experiments have in past semesters failed to show positive results. In addition we have a lower score on a 125 hour train than the best result produced by last year’s iteration. Our choice to execute experiments across all three training times was specifically intended to promote results that conclusively prove our research has produced dividends. Our single result is a really a culmination of our entire semester’s work. We tested parameter changes on smaller experiments and took those positive results to the larger data sets.

Competition Results
After consultation from three judges, the unanimous winner of the Spring 2015 competition was determined to be the Bruins. The results were deemed to be similar in regards to word error rate versus real-time factor. However, the Bruins research yielded information that put the Patriots results into question.

From the CMU website it was determined that setting the density value too high (like the 128 setting used by the Patriots) would ‘over-train’ the speech models. To clarify, we are testing our models on the data that is used to train them in the first place, known as testing on ‘seen data’. However, the goal of speech recognition is to get successful predictions on unseen data. In simple terms, over-training produces a circumstance in which the error rate on the ‘seen’ test data will be lower, but the error rate will drastically increase when unseen data is encountered in real-world situations. The Bruins hypothesized that using a density of 128 could have lowered their error rate as much as 10%.

This theory was unable to be proven through experimentation but put the Patriots results into question. Regardless of results, the Bruins final report was more thorough in their thought process and was ultimately the deciding factor in determining a victor.

Future Semesters
Further discoveries from the competition include many results that could help future semesters. As of now, the effectiveness of tuning the language model is undetermined. The Patriots did a lot of work with it only to find marginal increases while the Bruins determined that it is better to optimize the actual recognition process first so that the absolute potential of the language model is recognized before altering it. If future semesters decide to further research working on the language model, than the language weights can be adjusted since focus will be more evenly split between the acoustic and language models. Future semesters will also have a major head start by looking at the current configuration of the winning experiment and tuning from there rather than from scratch.

One question that has been lingering across semesters is why the word error rate increases as the amount of data goes up. Upon listening to the actual audio files, the Bruins found that the audio data contained in 125 and 256 hour data sets contains a great deal of variety in accents, whereas the 5 hour data set includes no such variety. This would account for the higher error rate in larger trains and future groups may want to take this information into consideration.

The documentation made this semester should allow the next semester to understand the entire process quicker and more effectively. The results we achieved should allow future students to make better progress than any previous semester.