Speech:Spring 2017 Proposal

From Openitware
Jump to: navigation, search


Contents

Overview

The overall goal for this project is to work towards achieving a world class baseline for our speech recognition technology. We will achieve this through research and experimentation using a variety of tools and data sets running on systems that will be maintained to ensure reproducibility, consistency, and ultimately, forward progress.

The data that is used will be refined by reviewing the corpus transcripts and creating a more complete collection of the spoken word in order to ensure a minimal loss of data between the audio recordings and the written transcripts. In conjunction with this refinement, the language model will be a point of focus and research, as we implement state of the art technology that will greatly reduce the Word Error Rate (WER). The aforementioned progress will succeed by maintaining stable systems and by installing the requisite tools that are a necessity to achieve our overall goal. Furthermore, we will develop scripts, as needed, to allow efficient use of the time allotted for this project. All areas of progress, as well as regression, will be documented and logged, as to decrease the time need for future development.

The project, as a whole, can be defined in five sections: Modeling, Tools, Experiments, Data, and Systems. Each section includes an overview of its respective responsibilities, the goals we want to accomplish, and finally a timeline of when and who, specifically, will work towards those goals.

Group Membership

Modeling Tools Experiments Data Systems
Vitali, Alexander, Gregory, Jonathan Sharayah, Jeremy, Jeffrey, Huong Jacob, Nicholas, Cody, Zachary Matthew, Maryjean, Dylan, John Andrew, Mark, Julian, Bonnie

Modeling

Team Members

Overview

While it is possible to continue the progress of the past semesters in largely concentrating on the acoustic model, when thinking towards a world class baseline, the best options will need to be considered. More specifically, we will look to other types of acoustic and language models to gain better word error rates while focusing on testing on unseen data. Last year's modeling gained better results from tweaking the language model, however a greater focus will be placed on implementing other options for the language and acoustic models. Some of the best language models are created using neural networks, a type of model that is fully supported in the RNNLM (Recurrent Neural Network Language Model) Toolkit. This will be a large focus in the coming semester, as there have been significant advancements in language models in recent years. Progress was also made last year by increasing the number of words in the dictionary. Although important, this is a smaller target in which we will look for improvement this semester. Utilizing Linear Discriminant Analysis for acoustic models will also be a large focus for the modeling work. It is half implemented by Sphinx already, but requires Python, Scipy, and Numpy packages as dependencies, which are currently not installed, in order to function. An effect of using Linear Discriminant Analysis will be a significantly lower word error rate. Currently, the best Word Error Rate on unseen data is 45.4% scored from Experiment 007 of the final experiment in Spring 2016, found here [1]. According to this article [2], that places the project in the late 90's for error rates. From these two sources, [3], http://mi.eng.cam.ac.uk/projects/cued-rnnlm/papers/Interspeech15.pdf, error rates of the lower 30's can be expected from properly tuned acoustic models utilizing LDA and language models built by recurrent neural networks; this will be our ultimate goal of the semester. Hopefully, with the introduction of neural networks into the Capstone program, future semesters can work towards the industry standards that are achieved using deep neural networks.

Goals

The goals of this semester can be summarized by continuing in improvements that the 2016 Spring Semester made. As stated before, Linear Discriminant Analysis in acoustic models, recurrent neural networks for language models, and increasing the dictionary while cleaning up transcripts will provide us with a vastly better WER.

  • Establish a baseline model with the current toolset for testing on unseen data
  • Use cleaned transcripts from the data to test models on
  • Coordinate in order to clean the data transcripts efficiently and effectively
  • Install and test the necessary libraries (Scipy and Numpy) for LDA on Idefix and report the results of the installation of those tools
  • Install and test G++4.5 and the RNNLM toolkit and report the results of the installation of those tools
  • Create a new baseline model using LDA and RNNLM for testing on unseen data
  • Improve new baseline model, document future areas for improvement including other model types

Plan

Implementation Timeline March 1st, 2017

  • Install Numpy and Scipy packages on Idefix (Tucker, Vitali)
  • Install RNNLM toolkit on Idefix (Greg, Alex)
  • Use settings from experiment 0288 011 to recreate the experiment to act as a baseline for this semester (Tucker, Greg, Alex, Vitali)

March 8th, 2017

  • Implement LDA on the Idefix drone machine (Tucker, Vitali)
  • Build language model as a proof of concept (Greg, Alex)

March 15th, 2017

  • Test switching on LDA option in the sphinx_train.cfg file (Tucker, Vitali)
  • Run experiment with LDA and compare with newest experiments up to that point (Tucker, Greg, Alex, Vitali)

March 22nd, 2017

  • Compare results against baseline test (Greg, Alex)
  • Run experiments with the updated data from the data transcripts (Tucker, Greg, Alex, Vitali)
  • Report feedback of the results using the new data transcripts (Tucker, Greg, Alex, Vitali)
  • Incorporate RNN-based language model into an experiment (Greg, Alex)

Continuous Throughout Semester

  • Maximize CPU time on the drone machines available to test out new experiments; this will be taken on by those not directly involved with researching or implementing a new package or tool onto the machine (Tucker, Greg, Alex, Vitali)

Tools

Team Members

Overview

Our aim is to have a thorough understanding of the software features currently being used for the UNH Manchester Speech project, as well as provide software and/or update suggestions when necessary. Understanding and documenting the purpose, versions, and benefits of the project's software features are vital for creating a world class baseline and our overall success in this project.

GCC is a C++ compiler that is currently installed on Majestix. We need determine if it is worthwhile to install G++, and if it is backwards compatible with GCC. We also plan to look into Pocket Sphinx and compare it to the current decoder, Sphinx 3.7. Performance advantages, as well as the tools that come with Pocket Sphinx, will be taken into consideration before installation. We will also remove any duplicate tools as necessary, and will keep and/or install any new tools that are desired. The artifacts regarding our tools research, and the software currently being utilized, can be found on the Speech Software Functionality wiki page.

Goals

  • Make improvements to the tools that were implemented by the Spring 2016 class
  • Determine if it is beneficial to update or install new software
  • Identify if G++ is backwards compatible with GCC
  • Install G++ on an available machine for testing and comparison purposes
  • If G++ installation benefits are significant, install it on Caesar upon receiving proper approval
  • Research PocketSphinx and compare it to Sphinx 3.7
  • Install PocketSphinx on an available machine for testing and comparison purposes
  • Gain a comprehensive understanding of current system software and features
  • Provide software updates and/or installations to support all of our needs
  • Document findings on the Speech Software Functionality page

Plan

Implementation Timeline

Feb 8th, 2017

  • Research and get a base understanding of Pocket Sphinx (Jeremy, Huong, Jeff, Sharayah)
  • Continue writing proposal draft (Jeremy, Huong, Jeff, Sharayah)
  • Run a test train (Jeff, Sharayah)

Feb 15th, 2017

  • Research installation and features of G++ (Jeremy, Huong, Jeff, Sharayah)
  • Compare G++ to GCC (Jeremy, Huong, Jeff, Sharayah)

Feb 22nd, 2017

  • Continue investigation of G++ and Pocket Sphinx (Jeremy, Huong, Jeff, Sharayah)
  • Finalize report on GCC and G++ comparison (Jeremy, Huong, Jeff, Sharayah)
  • Run a train for comparison purposes before G++ install (Jeff, Sharayah)

Mar 1st, 2017

  • Run a train for comparison purposes after G++ install (Jeremy, Jeff, Sharayah)
  • Check on possibility of installing G++ on Majestix (Jeff, Huong)
  • If it is installed check to see if it can be put on Caesar (Jeff, Huong)

Mar 8th, 2017

  • Create proposal for Pocket Sphinx installation on Majestix (Jeremy, Huong, Jeff, Sharayah)
  • Install Pocket Sphinx on Majestix (Jeremy, Sharayah)
  • Determine if software is required to be installed on other machines (Jeremy, Huong, Jeff, Sharayah)

Mar 22nd, 2017

  • Run a train for comparison purposes after Pocket Sphinx install (Jeremy, Huong)
  • Create proposal for Pocket Sphinx installation on Caesar (Jeremy, Huong, Jeff, Sharayah)

Mar 28th, 2017

  • Install Pocket Sphinx on Caesar (Jeremy, Sharayah)

Experiments

Team Members

Overview

Our aim is to create the tools necessary to easily generate trains, decodes, and experiments, as well as ensure that we have all the tools required to be as efficient as possible in our progress towards a world class baseline. We want to simplify and improve existing scripts, merge scripts together that are often used in tandem, and create new scripts to reduce boiler plate activity.

Goals

  • Organize and archive the old experiments in the Exp folder
  • Go through all the scripts and make sure they are still relevant and useful; make sure each script is accurately reflected and is clearly described on the wiki page
  • Look through all the existing scripts and make note of the most commonly used scripts and look for areas of improvement, such as making them more user friendly, up to date with our needs, and merging scripts that are used one after another
  • Fix media wiki to get rid of the WILDCAT domain
  • Fix add addExp.pl by removing the WILDCATS domain and add AD as the default, as well as forcing the script to create a sub experiment when an initial root experiment is created
  • Understand "copy experiment" for decode and train improvements
  • Copy decode (copyDecode.pl will be created)
  • Copy train (copyTrain.pl will be created)
  • Make decode (makeTest.pl will be checked out and updated if need be)
  • Make train (makeTrain.pl will be checked out and updated if need be)
  • Run experiments to determine how the scripts will need to function and determine areas where manual tasks can be automated

Plan

Implementation Timeline

Feb 8th, 2017

  • Look through older scripts and determine how they are coded and how they interact with the system, etc.(all)
  • Learn linux commands, how to login to the servers, and how to interact with our scripts (all)
  • Look at last year's experiment logs in order to gain a better understanding of the previous progress made (all)

Feb 15th, 2017

  • Run some of the more prominent scripts to see where, or if, we can make any improvements (all)
  • Fix addExp.pl to default to AD and make it auto force a 001 sub experiment (Jake, Cody)
  • Work on finishing final proposal (Nick)

Feb 22nd, 2017

  • Begin creating/updating makeTrain.pl, makeTest.pl (makeDecode.pl), copyTrain.pl, and copyDecode.pl (Nick, Zack, Jake, Cody)

Mar 1st, 2017

  • makeTrain.pl and makeTest.pl will be completed, whether they need to be updated or not (Nick, Zack)
  • Begin work on copyTrain.pl and copyDecode.pl, if not already started (Jake, Cody)

Mar 8th, 2017

  • copyTrain.pl and copyDecode.pl will be completed; if more time is needed, this task will take another week (Jake, Cody)
  • Look into combining scripts to increase usability and efficiency, if the previously mentioned scripts are completed (Nick, Zack)
  • Start improving the relevant wiki pages by ensuring everything is up to date, add new information, as needed, add more detail, etc. (Nick, Zack)

Mar 22nd, 2017

  • copyTrain.pl and copyDecode.pl will be finished by this week (Jake, Cody)
  • MediaWiki will be up to date on all scripts (Nick, Zack)
  • Older experiments will be archived and in the proper folders (Nick, Zack)

Data

Team Members

Overview

Our aim is to ensure that all types of data are accurate and up-to-date. One way we hope to accomplish this is by making improvements to the transcript file. As it is now, every word in the transcript that is enclosed in brackets is removed through the genTrans.pl script. For example, because the speaker is laughing while talking, the second part of this line is lost

it's so uh okay very good i guess we've kind of covered our subject matter since [laughter-neither] 
[laughter-one's] [laughter-really] [laughter-into] [laughter-gardening] [laughter-are] [laughter-we]

Our aim is to change this script so that the [laughter] tags, and any other tags, are removed while the words themselves remain preserved. This will help to improve the Word Error Rate. Another way we hope to keep data up to date is by adding new words to the dictionary file. To achieve this, we will create a word file containing only new words, along with their predicted pronunciations as generated by the CMU Lexicon Tool. As a reference for new words, we will use the Oxford English Dictionary, to which new words are added four times per year. A hand file, which will contain self-made corrections for pronunciations, will also be maintained in the event that the pronunciations provided by the Lexicon Tool are not accurate.

Goals

  • Update the genTrans.pl script to preserve words enclosed in brackets using regular expressions
  • Routinely study the transcript file to identify new cases that are to be addressed in the script
  • Create and maintain a word file/hand file for adding new words to the dictionary
  • Create a script that will add only new words from the word file to the dictionary then sort them alphabetically
  • Improve quality of data to raise percentage of words correctly interpreted
  • Update broken 'linkTransAudio.pl' script and revise documentation regarding creation of new corpus sizes

Plan

Implementation Timeline

Feb 22nd, 2017

  • Look through transcript file to find other types of bracket markings (MJ)
  • Begin work on the revised 'genTrans.pl' script (Matt, John)
  • Begin adding new words with pronunciations to the word file (MJ, Dylan)
  • Revise corpus creation documentation/scripts (Dylan)

Mar 1st, 2017

  • Continue looking through transcript file for other types of bracket markings (MJ)
  • Figure out regular expressions to be used in revised 'genTrans.pl' script (Matt, John)
  • Continue adding new words to the word file (MJ, Dylan)
  • Begin work on script for adding words from the word file to the dictionary file (Dylan, MJ)

Mar 8th, 2017

  • Continue looking through transcript file for other types of bracket markings (MJ)
  • Test and revise a new version of the 'genTrans.pl' script (Matt, John)
  • Continue working on dictionary script (Dylan, MJ)

Mar 15th, 2017

  • Continue looking through transcript file for other types of bracket markings (MJ)
  • Finalize 'genTrans.pl' script (Matt, John)
  • Continue working on dictionary script, add sorting functionality (Dylan, MJ, Matt, John)

Mar 22nd, 2017

  • Finalize the dictionary script (Dylan, MJ, Matt, John)
  • Add the new words and pronunciations to the dictionary file (Dylan, MJ, Matt, John)
  • Test new dictionary file (Dylan, MJ, Matt, John)

Continuous Throughout Semester

  • Update and improve documentation from past semesters
  • Any tasks that other groups may need

Systems

Team Members

Overview

We need to ensure proper maintenance and updates of the environment (or system) in which our project is housed and run. The previous year was primarily focused on the upgrade of the Dell PowerEdge servers which were outdated, to say the least. Now that the servers have been fully upgraded, it is imperative that we inspect and preserve the state of the servers to ensure proper functionality as intended.

Most of the maintenance involved in these tasks will include error checking and file structure integrity; this includes both critical and non-critical errors. We know that it is important to have a complete understanding of what occurs in the environment so that we can effectively document the compatibility and functionality for the installation of future programs/features.

As a side project, we are testing new ways in which communication can be more effective and secure, such as the implementation of an IRC server (an addition the previous year had installed, but not configured), as well as a private server for document sharing. This would remove the need from using outside resources such as Google or Office365 for communication.

Goals

Our main objective, in regard to the systems, is to perform a system scrub, making sure that the environments are free from errors.

  • Analyze errors on the Dell PowerEdge 1950’s
  • Check logs for any residual errors and document our findings
  • Determine whether or not to update Red Hat and other features to newer versions
  • Configure the IRC for efficient communication
  • Setup a temporary private server for file sharing
  • Setup a monitoring system
  • Setup a configuration management system

Plan

Implementation Timeline

Feb 15th, 2017

  • Research open source system monitoring of hardware resources (Andrew, Mark)
  • Research open source configuration and system management software (Andrew, Mark)
  • Troubleshoot servers that are not accessible (Andrew, Mark, Julian & Bonnie)
  • Assess errors on servers to determine if any maintenance is necessary (Andrew, Mark, Julian & Bonnie)

Feb 22nd, 2017

  • Have a solid status of each machine and if/how they need to be reconfigured (Andrew, Mark, Julian & Bonnie)
  • Upon completion, we will discuss the needs of each machine that needs to be repaired (Andrew, Mark, Julian & Bonnie)
  • We will have discussed creating a private file share system with our Client (Professor Jonas) and the security that would be required for that (Andrew, Mark, Julian & Bonnie)
  • Log any current errors and what fixes were applied to address those errors (Andrew, Mark, Julian & Bonnie)

Mar 1st, 2017

  • Have the non-working drones fixed and fully functioning (Andrew, Mark, Julian & Bonnie)
  • Start moving forward on the hardware requirements for the file share server (Andrew, Mark, Julian & Bonnie)
  • Retrieve hardware information to update MediaWiki documentation (Andrew & Julian)

Mar 8th, 2017

  • Have the file share fully functional as well as the IRC server; Ensure the other systems are functioning as expected (Andrew, Mark, Julian & Bonnie)
  • Research Torque and install procedures (Bonnie)
  • Make sure RSync backup service is running properly (Andrew, Mark, Julian & Bonnie)

Mar 15th, 2017

  • Research hardware requirements for setting up OwnCloud server to allow sharing of files securely, if necessary (Andrew, Mark, Julian & Bonnie)
  • Update documentation (Andrew, Mark, Julian & Bonnie)
  • Ensure that all servers are up and running as expected (Andrew, Mark, Julian & Bonnie)

Mar 22nd, 2017

  • Document our findings on errors (Andrew, Mark, Julian & Bonnie)
  • Document any installs and the effect on the servers (Andrew, Mark, Julian & Bonnie)
  • Ensure that all servers are up and running as expected (Andrew, Mark, Julian & Bonnie)

Continuous Throughout Semester

  • Support the needs of the project during time allotted (Andrew, Mark, Julian & Bonnie)