Speech:Spring 2017 Report


 * Home
 * Semesters
 * Spring 2017
 * Proposal
 * [Report]
 * Information - General Project Information
 * Experiments - List of speech experiments

Overview
During the Spring 2017 Semester of the Speech Recognition Capstone Project, our goal of the project has been to create a world class baseline, to lay the foundation to perform speech recognition research in the future. This was achieved chiefly through improvements to practical Word Error Rate (WER) on both seen and unseen data, through a variety of changes to the Acoustic Model, Language Model, dictionary, and verification of the Switchboard Corpus speech data.

Work has also be undertaken to streamline, optimize, or replace the supporting hardware, software, and automation implementations, to improve the overall efficiency of the project.

Team Sub-Groups and Process Improvements
The work in Spring 2017 Capstone Project has been divided into five primary categories, each headed by a team, with a specific area of chief responsibility. These teams researched within their subject area with regards to process improvements or development of new processes or script implementations.

For more information on each group's sub-tasks, see the corresponding sub-section for that group. For more information on each member, see their personal log.

Team Members

 * Vitali Taranto
 * Alexander Turner
 * Gregory Tinkham
 * Jonathan (Tucker) Cleary

Introduction
This year's modelling group chose to put itself on the focus of minimizing the word error rate with the current setup of transcripts and data in the form they are in now. At the same time, the modelling group recognized that there are many different forms of speech modelling going on, so the introduction of neural networks into the language model was decided as a nice possible bridge, if needed, to use some other forms or combine them. This year's group also put a heavy emphasis on using larger data-sets for its experiments. This is especially true in the competition as there were 300 hour experiments used. This offered a better look into the scalability of the results found.

It was important to us that our progress was scored on unseen data, so that it could accurately be compared to the results achieved last year by the 2016 speech capstone class. Last year, they were able to achieve a baseline with a WER of 45.4% on unseen data. Our goal was to produce an experiment that would score better than last year's baseline and we would do so through the research and integration of the RNNLM (Recurrent Neural Network Language Model) toolkit, LDA (Linear Discriminant Analysis), and manipulating the configuration properties used by Sphinx.

RNNLM
A neural network is a computer system that represents the functionality of the human brain. It consists of a large collection of nodes (neurons) that are connected to associate inputs with the best output. A recurrent neural network functions similarly to a traditional neural network but it keeps memory for what has been calculated thus far. This is especially important for speech modeling because greater context can be added to the model.

To incorporate this into our language model we first created an RNN-based language model using the transcript. Then, we used the RNN model to generate random words. Lastly, we trained our final language model as an n-gram based model for the Sphinx Decoder to interpret.

The first experiments run with the Recurrent Neural Network Based Language Models have not yet shown to be more effective than the previously used solely n-gram based models.

LDA
LDA is a dimensionality reduction method. It works by finding the best possible hyperplane (“line”) that preserves as much of the difference between data points as possible. It can be thought of as the opposite of Linear Classification.

LDA was shown to be significantly effective in testing. Utilizing LDA brought the Word Error Rate down to 31% on un seen data in a 30 hour experiment. This was down from 37% on our previous best model.

Sphinx Configurations

 * Final Number of Densities : This is the final number of gaussian densities that should be used when building features. The final number for this parameter was 32. Based off of the 30hr experiments which achieved its best results with 16, and the CMU documentation, it was decided to use 32.
 * Number of States Per Hidden Markov Model : As the name suggests, this is the number of states for each hidden markov built in the acoustic model. Our final number here was 5. When using 7 our models would be too flexible.
 * Skip State : Better results were achieved in our 30 hour experiments when allowing the hidden markov models to skip a state. What this did is give the model the ability to if needed, skip a state entirely in the hidden markov model.
 * Senome Count : The senome count was a parameter that needed much more flexibility due to the nature of the data contained in the 300 hour corpus. More specifically, there is a large number of speakers within the 300 hour corpus that can create a lot of variance in the different senomes created. 8000 was decided on by the fact that increasing the count to 4000 on five hour trains and 10000 on thirty hour trains would overtrain the models.


 * Linear Discriminant Analysis : The only parameter available with linear discriminant analysis is the number of dimensions with which LDA can reduce. 32 was chosen based off of research and a vast improvement within the 30 hour experiments.
 * Recurrent Neural Network Language Models : The use of Recurrent neural network language models were used to build n gram language models from but did not improve the word error rate in either five or thirty hour experiments. This can be attributed to the fact that an n-gram model must be built from sentences generated from the RNN model because the Sphinx decoder can not accept RNN models based on their structure.

Results
The results achieved are the best to date on unseen data at 300 hours. This is for both the test/train.trans and test/eval.trans. Achieving a result of 41.3% word error rate on the eval.trans transcript for 300 hours is a great improvement over the 47% by the 2016 group on 145 hours. Additionally, the result of 28.4% on the test/train.trans is an improvement over the 30.2% word error rate achieved by the Spring 2016 group on a 145 hour corpus. However, with the s tags removed this score rises to 33.0%. Although, this score is still below some of the other experiments run by last year’s groups resulting in 31% and 33%.

These large gains can be mainly contributed to the additional changes of utilizing Linear Discriminant Analysis, changing the number of states per model, and allowing skipped states. What this allowed our acoustic model to do is capture the wide variances in the different speakers senomes, but not overtrain to the limited vocabulary of the phone conversations because of a smaller number of states. One parameter that was not used in these results but started was the convergence ratio. Having a smaller convergence ratio could offer better results but increase the real time of training significantly. Further improvements on 300 hour experiments could be achieved from the utilization of other modelling techniques such as vocal tract length normalization, improvements to the transcripts used within the corpuses, and research into new model types such as long short term memory networks or convolutional neural networks for both acoustic models and language models. Additionally, further research into the adaptiveness of the Sphinx decoder can be made to see the compatibility of it with new model types.

Team Members

 * Sharayah Corcoran
 * Jeremy Beal
 * Jeffrey Gancarz
 * Huong Ha

Introduction
The tools group was responsible for researching existing speech tools, determining whether existing tools need to be upgraded, as well as installation and configurations of new software tools to enhance our world class speech recognition system. This section will be a report documenting the progress made in accordance with the proposal found at Tools Group Proposal Spring 2017.

Main Speech Tools
The first part of our research was to determine the current versions of the main speech software. The main speech related tools currently installed that we looked into were Sphinx 3, Sphinx Decoder, and Sphinx Trainer. Software and installations were documented, for more information please visit the Tools 2017 Group log, or the Speech Software Functionality page.

PocketSphinx
Although we did not end up getting a chance to upgrade the current Sphinx software, we ended up creating a proposal to upgrade Sphinx 3 to PocketSphinx for future semesters. Due to time constraints, we did not have enough time to work on PocketSphinx, and had to direct our attention to higher priority software tools. Additionally, choosing to postpone installing PocketSphinx let all team members focus on making improvements to Sphinx 3. However, we proposed that PocketSphinx replace Sphinx 3 due to

Additional Tools
Our group also worked on analyzing and installing GCC and G++, which are a pair of language compilers for unix systems, and were needed for the work of other groups.

GCC
The 2016 Spring Tools Group was able to install GCC on Majestix. We decided to go through the installation process again on Obelix to find out how the installation of GCC would affect the system. GCC is a compiler system produced to support various programming languages.

G++
G++ is a C++ based compiler that is usually operated through the command line. G++ requires GCC to be installed before installation. G++ was also installed on Obelix this semester. It was installed on different drones this year in order to test LDA.

Training before and after GCC and G++
A thirty hour train was run upon completing a GCC installation on Obelix. Another thirty hour train was ran after installing G++ on Obelix. The before and after thirty hour train results had very slight Word Error Rate (WER) differences. As such, GCC and G++ are both safe installations for Caesar, since installing this software will not improve nor disprove WER.

Results
Comparison snapshots were taken before and after GCC and G++ installations. Numerous trains and decodes were done to compare Sphinx 3 performing following said installations. For these reasons GCC and G++ should be installed on Caesar for the benefit of the upcoming semesters.

Thorough research was done on PocketSphinx, the mobile friendly decoder. Although PocketSphinx is one of the fastest decoders available, we thought that undergoing implementation would take away from the improvements done to the Sphinx 3 decoder. For these reasons, and lack of time before the end of the semester, PocketSphinx was not utilized.

Team Members

 * Jake Sprague
 * Nick Bielinski
 * Cody Roberge
 * Zachary Dudek

Introduction
Our original aim when we started Capstone this year was to "create the tools necessary to easily generate trains, decodes, and experiments, as well as ensure that we have all the tools required to be as efficient as possible in our progress towards a world class baseline. We want to simplify and improve existing scripts, merge scripts together that are often used in tandem, and create new scripts to reduce boilerplate activity". We ended up being able to stick fairly close to that original goal of ours. We worked on improving a couple of scripts and creating/merging several news scripts as well. Hopefuly the work we did this semester will be useful to other groups in the future.

Areas of Focus
Our main area of focus for this semester was in updating and creating scripts to make the experiment process easier as a whole. We worked on three scripts specific to this goal. OUr group also worked closely with the data group to write a script to try and get better data models so future semesters could get better results. Lastly, we worked on smaller things like cleaning up scripts, cleaning up the scripts folder, and working on making better documentation to be more up to date with current scripts and to be filled out with better information.

Results
The first script we worked on was addExp.pl. This script was changed slightly to better align itself with the updated login system at the school. It got rid of the wildcat directory and only allowed you to sign in via active directory. This script was also adjusted so that when your group made its root experiment page on the wiki, it would also force an 001 sub experiment.

The second script we worked on was copyExp.pl. this was another script we worked on to make the experiment process easier. This script allows you to copy over the train or decode from one directory to another. You could also copy over an entire experiment if you wanted to. We told the program what files and folders to look for depending on what was being moved, and then copied those specific files and folder to the directory specified. We also had to update all hard coded links in the configuration files for sphinx_train and spinx_decode. If this was not done, it would be using the previous directories files instead of the current directories files. We also made it so every file and folder that had the old sub experiment number would be updated to the new sub experiment number once it had been copied over. For example if there were a bunch of files labeled 001_xxxx and you copied them using the script to the 002 directory, all the files would turn to 002_xxxx. This is more for aesthetics. I don't think the file names are required to be named the same as the sub experiment, it just makes more sense that way.

The last two scripts were being created at roughly the same time. We split up the group in order to work on these. createExp.pl was one of the final scripts we created this semester. It was a script to make the experiment process quicker and easier. Instead of having to follow the wiki and go step by step to create an experiment, we took the wiki steps and made it into a program that guides you along. You only need to enter the information about your switchboard size, directory, etc and then it will do all the scripts and settings for you. This would be a good script to use after everyone has done their initial experiment by hand, so you know what is happening in the background. Alongside creatExp.pl our group also worked on genTrans.new.pl. This is the script that the experiment group worked on in cooperation with the data group. New regular expressions were added so that some words would be kept in and some words or characters would be taken out so that the data model would be better overall and produce better results. We were not able to check how well the new script worked along with the data group to see if the new regular expressions helped to get a better score.

Our group also worked on various other things such as updating the steps on how to make an experiment. Some areas were lacking, such as what directory you should be running a command in. It may not mean much to someone who's run many full experiments, but it can be confusing for beginners. We also helped move some older scripts and organize the scripts folder a little more, as well as commenting almost line by line in our scripts and having detailed documentation on the scripts we put on the wiki page.

Team Members

 * Matthew Fintonis
 * Maryjean Emerson
 * Dylan Lindstrom

Introduction
The data includes Switchboard Corpus data. We are responsible for the integrity of the data, keeping it up to date and supplying the other areas of the project with good quality data so that they can run their experiments and models on it efficiently.

Areas of Focus
Our semester goals were to make sure that the data we have is accurate and up-to-date. To achieve this we worked on the regular expressions in the genTrans.pl script and modified the dictionary that is used for the experiments.

For the regular expressions out goal was to modify them so that the non-speaking tags [laughter], [laughter-word], [vocalized-noise], etc were removed from the transcripts but the words were kept. An example would be when a tag included laughter and a partial word [laughter-he]. By keeping the partial word we could decrease the WER and have a better score with the decode.

Once that script was modified then we wanted to take the partial words that were created and add them to the dictionary so that they would be recognized. We also wanted to add to the dictionary the words that were not currently in it. We hoped this addition would improve the WER.

Another goal was to improve the master dictionary by adding new words. The current master dictionary has roughly 40,000 words in it, whereas the latest CMU dictionary release has over 130,000. CMU suggests that the dictionary size should realistically be between 200-500 thousand words. The small size of our master dictionary may have been to help reduce train/decode times due to the limited amount of data it had to work with, but for real-world results, a much larger dictionary should be used to increase WER.

Our last goal was to look through the text files of the transcripts and look for any anomalies that we could find that were preventing a poor score report and WER.

Improving Macros In Scripts
The table below outlines the changes we made to the genTrans.pl script. All the changes outlined were successfully added to the script in the form of regular expressions, however were never able to use these in a full train and decode for reasons that will be explained further down.

Below are the regular expressions added to genTrans.pl:

$message =~ s/noise]//g; $message =~ s/\[laughter//g; $message =~ s/\[vocalized//g; $message =~ s/\w*\[\w*\]-//g; $message =~ s/-\[\w*\]\w*//g; $message =~ s/\[.*?\]-//g; $message =~ s/-\[.*?\]//g;

Conclusion for Improving Macros:

While we were able to successfully add and test the RegEx's to the script, we were never able to fully use them in a train and decode. One of the main issues we had was with the partial words. The partial words created with the regular expressions would not be in the dictionary and even if they were manually added to the dictonary, the phonemes that the program uses to know how to pronounce the word are not automatically created. While we did think of a few solutions to solve this issue, we did not have the remaining time to fully implement and test these solutions and is something that could be taken up by a future data group.

Improved Dictionary
To improve the dictionary, we created a script called 'getNewWords.pl' which grabs newly-added words from the Oxford English Dictionary website and adds them to a text file, or "word file". This word file can then be uploaded to the CMU Lexicon Tool, where the pronunciations for each word will be generated and saved in a format that is readable by Sphinx. This word file can then be uploaded to the server and then added to the master dictionary by running the 'addNewWords.pl' script, which adds the new words, removes duplicate entries, and then sorts them alphanumerically.

Over the semester, we added over 9,000 words to the latest dictionary release (0.7b) from CMU. This dictionary is located on the server as 'test.dic', in the same directory as the master dictionary. However, similar to the improved 'genTrans.pl' script, we ran into errors with the 'verify_all.pl' script when trying to run a train.

Team Members

 * Andrew George
 * Mark Tollick
 * Julian Consoli
 * Bonnie Smith

Introduction
The Systems group is tasked with the responsibility of keeping Caesar and all of the drones running smoothly for the semester. This semester we were tasked with fixing back ups, researching and implementing Torque/Maui, rebuilding and maintaining the servers, support any requests by the other groups and teams, gain management access to the Enterasys switch and configure multiple VLANs, clean up cable management on the server rack and allow all drones access to the internet via wireless dongle.

Areas of Focus
Register Red Hat on all drones, Installation of Torque, getting internet to all drones and fixing back ups, facilitating other group needs as they arose.

Results
Registering Red Hat on all the drones

We were unaware that the drones had unregistered versions of Red Hat. We talked with Bruce from UNH IT and were able to get the activation keys for Red Hat, however they were not working. We were able to register red hat by editing the domain server so it could access cinnabar.unh.edu. Then we had to follow the guide that Bruce sent us to install the certificate needed in order to connect to cinnabar.unh.edu. Once that was figured out we were able to register Red Hat on all the drones with the Internet was connected to it. This allowed us to move forward with our group tasks.

Installation of Torque

First we did a lot of research on how to install torque, and what torque was. We read through a lot of documentation on torque before we decided on which version to install. After being told the incorrect drones to use as the main node and the compute node followed by the uninstalling of torque. We finally installed it one the correct drones, Rome as the main node and Majestix as the compute node. The first leg of Torque installation went smoothly we followed the documentation on the Adaptive Computing website with no issue. Then we installed Torque on the compute node (Majestix). We ran into a few issues documented here (Andrew George’s log on 3/19/2017), but once those were resolved the installation went smoothly

Internet connectivity to all drones

Professor Jonas provided a wireless dongle to help with getting all the drones connected to the internet. The intent with this dongle is to set it up on Rome and then turn Rome into a wireless router, routing traffic to the drones. The disc that was provided with the wireless dongle contained a linux installation, this was good news as it allowed us to utilize the dongle. The files were extracted off of the disc. There were issues when trying to build the drivers. The issues pointed to missing kernel packages. After specifying the correct kernel version to download, and after the yum update, a reboot of the machine allowed the drivers to successfully install. With this success, the dongle was able to see UNH-Public SSID. This was problematic as the public Wifi stops working after 30 min. Setup for the wireless dongle can be found here (https://foss.unh.edu/projects/index.php/Speech:Hardware#Setting_up_the_Wireless_Dongle) Jonas informed us that we were able to use the switch that was in the server room. After eventually gaining access to the switch. Gaining access to the switch allowed us to put the drones’ secondary interface on a separate VLAN so that Rome could route the traffic on a different subnet (172.16.0.0/24). The installation of the wireless dongle was completed by following the steps here (Julian's April 9th entry). There were still issues pinging Google but that was taken care of after an old firewall entry was deleted. Once that was fixed Rome finally had Internet access.

Fix backup system

The backup system is running rsync snapshot which was set up correctly, however there was substantial packet loss between the Rome and the backup VM located in the tech consultants work room. After some troubleshooting it was determined that it was not the physical connection that was having an issue but the hardware itself on the machine located in the tech consultants room. The VM was not seeing the Network adapter. The physical machine that was being used for back ups (Dell T105) was switched out with a replacement machine Professor Jonas had in the IT storage Room (Dell T100). The HDD was taken out of the original machine and was put into the replacement machine. At this point the connection was tested and it had zero packet loss. The fan was running loud and there was concern about noise level for the consultants in the room. The fan was taken from the original machine (Dell T105) and placed into the replacement machine. This caused a front fan error at startup. After reassembling everything and putting the machine back in it’s place. The connection was tested, and there was significant packet loss. At this point rsync was restarted and after it came back up there was only about twenty-five percent packet loss. Jonas cleaned and oiled the original fan and put it back into the current rsync server, after that the no front fan error disappeared.

Documentation

Unknowingly the drones were running an unregistered version of Red Hat, in order to perform yum installs the drones had to be registered. The documentation on how that can be done can be found here (https://foss.unh.edu/projects/index.php/Speech:Spring_2017_Systems_Group). Documentation on how Torque was installed was added, the documentation talks about some of the issues that occurred while installing it and how they were overcome. It also goes over the installation of Maui. Maui is used as a scheduler with Torque. The documentation for setting up of the wireless dongle can be found here https://foss.unh.edu/projects/index.php/Speech:Hardware#Setting_up_the_Wireless_Dongle this will be useful when the drones are upgraded. It will also be incredible helpful for troubleshooting any issues that may arise.

Empire

 * Members: Alex Turner, Andrew George, Cody Roberge, Dylan Lindstrom, Huong Ha, Jake Sprague, Jeffrey Gancarz, Julian Consoli, Maryjean Emerson, Vitali Taranto

Plan
The empire team decided that the best way to start would be to build off of the results of Spring 2016, and then add our own new technologies on top of that in the hope of improving those results. We decided that it would take too long to test our theories on 300 hour data, and instead chose to use the 30 hour corpus for testing and the 300 hour corpus when we were finished. It was important to us to always use unseen data for scoring our experiments, since seen data would lead us to overfit the model, and we didn't want to be surprised when it came time to test on unseen. The empire sought to achieve world class results by use of new technologies such as VTLN and MMIE, which have not been used by previous capstones, but we were ultimately unsuccessful in those attempts.

Results
Using our best experiment, we were able to achieve a 45% error rate on the 300 hour corpus as compared to the 50% error rate achieved by Spring 2016.

Rebels

 * Members: Jonathan (Tucker) Cleary, Greg Tinkham, Matt Fintonis, Nick Bielinski, Zach Dudek, Mark Tollick, Bonnie Smith, Sharaya Corcoran, Jeremy Beal

Plan
The framework that was followed to bring us to our results was to focus on using as much data as possible in our experiments and to create models that were flexible enough to adapt to the wide range of speakers while not over-fitting on the. To do this, we used a build up approach. All of the models we did preliminary tests on were with 30 hour corpus's. From there we took our parameter changes and tweaked them in such a way to adapt to the 300 hour corpus. This allowed us to run more experiments over the course of the competition vs running all 300 hour corpus's. In addition, we were able to test our models on a larger set of data than the 5 hour corpus.

Results
The results achieved are the best to date on unseen data at 300 hours. This is for both the test/train.trans and test/eval.trans. Achieving a result of 41.3% word error rate on the eval.trans transcript for 300 hours is a great improvement over the 47% by the 2016 group on 145 hours. Additionally, the result of 28.4% on the test/train.trans is an improvement over the 30.2% word error rate achieved by the Spring 2016 group on a 145 hour corpus. However, with the s tags removed this score rises to 33.0%. Although, this score is still below some of the other experiments run by last year’s groups resulting in 31% and 33%.

Future Semesters
Modeling

The modeling group has a number of recommendations for future semesters. The first would be to pick up where this year's group left off with the parameter tweaks found used in the competition. Secondly, the use of the RNNLM toolkit has yet to be determined to be useful on larger experiments such as 145 hours and 300 hours. The language model portion of the project should also gain more attention in the future semesters, and with that a look into how to use language models that can have more context than simple n-gram models. Finally, a close look into how much more performance can be gained from the current setup, in terms of transcripts and the splitting of the corpus for seen and unseen data, could be looked at for seeing similarities or differences in the distributions of each set. Other areas to look into include utilizing other acoustic model methods (various neural networks or changes to hidden Markov models), speaker independent models, new language models, and an incorporation of outside data into the system (there are other large corpus's available, as well as many large data-sets for building language models).

It would be wise to update sphinx from sphinx3 to sphinx4, as doing so would allow for advanced methods such as VTLN and MMIE to be used. VTLN especially has the potential to improve WER as much as LDA did, if not more.

Tools

Due to time constraints, we were unable to get GCC and G++ installed on Caesar. This can, and should, be completed by the following semester. In order to help the future Tools Group, we have compiled the installation instructions for both GCC and G++ on the Tools 2017 Group Wiki. Future semesters should also consider installing and implementing PocketSphinx. However, we believe that more time and energy should be spent on improving Sphinx 3. Due to the large differences between PocketSphinx and Sphinx 3, testing both and creating accurate train comparisons may be time consuming and could possibly take away from Sphinx 3 train configuration improvements. The pros and cons of installing PocketSphinx have been included in a separate proposal, this proposal and supporting documents can also be found on the Tools 2017 Group Wiki. Finally, the existing table of software installed on Caesar should be observed and updated to reflect any new installations.

Experiments 

The main thing that we would have liked to have worked more on would be the createExp.pl script. We would have liked to have added a feature where you can change different configurations in the .cfg files. This would make it more useful in the future, especially when teams compete against each other at the end of the semester. Instead of having to manually go into the .cfg file, you would just be able to edit it through our script. As of right now, you can only create an experiment using one of the switchboard sizes and default settings. Our script also does not run the decode for unseen data, only seen data. Had we had more time these are two things we would definitely have wanted to finish. I think it would be a good place for a future group to start at in order to get themselves into the project.

Data

Our recommendation for future classes working on the data would be to work with the experiment team on the verify_all.pl script. We were unable to successfully fix the errors that we received once the new regular expressions were added to the genTran.pl script. Similarly, the improved dictionary does not work due to errors in the verify_all.pl script. Some of the core files released by CMU may need to be edited, so a group with programming knowledge might be the best course of action for the future. The Sequence-to-Sequence G2P toolkit with TensorFlow might be something worth looking into for future groups as well.

Systems

The future systems team needs to work on finishing the backup server and make sure it can sync the backups on a regular schedule. Then, the focus should turn toward getting Torque installed on all of the drone machines and figure out how to get them all to work together to run trains and decodes. Next, the wifi dongle is going to need to be troubleshot on getting it to connect to the UNH secure network or figure out how to not get the connection to timeout after 30 minutes. Giving an Internet connection to every machine will help with keeping the systems up to date and patched. Lastly, we recommend taking a look at the hard drives in each drone to make sure that there are separate partitions and the boot drives can be cloned. During their semester, the future class will also need to focus on maintaining the systems and support the needs of the other groups when they ask for it.