Speech:Spring 2016 Report

From Openitware
Jump to: navigation, search


Contents

Overview

During the Spring 2016 Semester of the Speech Recognition Capstone Project, our goal has been to develop a "world-class" baseline result to lay the foundation for future research into speech recognition. Our system uses Carnegie Melon University's "Sphinx 3" speech recognition software, and our data is from the approximately-300-hour "Switchboard" voice corpus.

The "baseline" refers to achievement of the lowest-possible Word Error Rate (WER) on "unseen" data (data that the system hasn't been exposed to yet).

Team Sub-Groups and Improvements

The class was initially divided into five groups of approximately equal size (3-4 persons). Each group was assigned an area of responsibility pertaining to some aspect of the system or the services and hardware that support it.

Broadly-described, these groups were

Modeling Group: responsible for understanding and building Acoustic Models and Language Models, and serving in an advisory role to others on training/decoding

Systems Group: responsible for maintenance and improvements to the physical hardware on which the Sphinx CMU system runs

Data Group: responsible for verifying the integrity of the corpus data and correcting any problems that were uncovered

Tools Group: responsible for maintenance and improvements to the software the Sphinx CMU system depends on to run

Experiments Group: responsible for refinements of the scripts and tools which are used to run experiments

Group Membership

Modeling Group Systems Group Tools Group Data Group Experiments Group
Ben, James, Jonathan S., Ryan Aaron, Michael, Neil, Saverna Daisuke, Jonathan T., Nigel, Thomas Brenden, Brian A., Brian D., Justin Kevin, Matthew, Meagan, Peter

Competitive Groups and Improvements

Two groups were formed with approximately six weeks to go in the semester and the purpose of this competition was to come up with the lowest WER based on the 5hr dev.trans transcript from the 300hr corpus. Details on the competition can be found further down in the report.

Modeling

Introduction

The modeling group is responsible for speech research and running experiments based on the research performed to achieve better and better word error rates on unseen data. The goal of this process is achieving a new world class baseline for the Switchboard corpus. The current best Switchboard corpus baseline is at 25.2% as of 2015 (click this link for more information). Now, regarding the progress of this spring semester, the modeling group has made many strides in speech research as well as developing several scripts that can be used by the data group to regenerate the audio utterances of the Switchboard corpus and create new corpora based off of the full Switchboard corpus.

To better understand the results achieved by the modeling group, laid out below is a crash course in speech recognition covering the various components of Sphinx and the general steps involved to decode audio to text.

Miscellaneous Words to Understand

  1. Feature: a feature is typically a 10ms slice of audio data
  2. Phoneme: a phoneme is composed of multiple features and a phoneme represents a distinct unit of sound

Components

  • Acoustic Model
The acoustic model, generally speaking, is a mapping of features to phonemes. In order to create the acoustic model, training needs to be performed. The training essentially comes up with a best case match between features to phonemes. In addition, the process of training a.k.a. creating the acoustic model also requires the help of a dictionary. More on this below.
  • Language Model
The language model essentially provides intelligence to the acoustic model. This intelligence is essentially bringing context into play. For example, in a sentence, immediately following a verb, as an English speaker, you wouldn't expect to see another verb. Instead, you would expect to see a noun, for instance. This is what the language model helps with. Without it, no restriction is placed on the acoustic model and the decoded results may not be what you expected.
  • Dictionary
The dictionary is a mapping of words to phonemes. Additionally, word recognition is limited to the scope of the dictionary and words that are not in the dictionary will not be recognized. Regarding the size of the dictionary, the bigger the data being trained and decoded on, the bigger the dictionary that will be required. For example, a smaller dictionary can be used when the expected data is small, say giving commands to a dog. A dictionary for giving commands to a dog could consist of only simple words such as "sit", "stay", "speak", "down", "come", etc. A dictionary for conversation must consist of many more words.

Process of Speech Recognition

  1. Generate the dictionary. Remember, this need to be used in the process of training.
  2. Utilizing the dictionary, start training a.k.a. generating the acoustic model
  3. Using the combination of the acoustic model, language model, and dictionary, decode audio data to text

As mentioned above, this has been a crash course in speech recognition, which isn't supposed to tell you everything to know about how speech recognition works but gives you enough knowledge and context to better understand the results below.

Areas of Focus

The primary goal of this semester was to achieve a world class baseline (25.2%) on unseen data. Previous semesters had not reached the point where they could test on unseen data, so we did not have a class baseline on unseen data to build upon. In the first few weeks we read tutorials and logs in order to understand what speech recognition was, how modeling plays a role, and what infrastructure was available to use via Caesar and the drones. During that time, the data group discovered there was possibly bad data in some of the utterance or conversation files. Our goal for the next couple weeks was identifying those files, generating clean data, and performing tests to confirm the data was fixed. We wrote a script to regenerate the audio files using sox, changed the audio from dual-channel to single-channel, fixed some minor bugs (due to discrepancies in sampling sizes of audio files), and verified the data.

Next, we needed to generate new corpora utilizing the cleaned audio. We wrote several scripts that made the generation of a corpus, sampling of a transcript, and generating audio files easier. More information can be found in the log of Jon Shallow and Brenden Collins. Most notably, the sampling of scripts allowed us to "remove" a dev.trans and eval.trans transcript file from the main corpus train.trans file. This allowed us to train the acoustic model on "seen" data (the train.trans) but then decode on "unseen" data, data that the software had never touched until that point (dev/eval).

At that point, we split into groups to compete in the competition. During that time, both teams pursued very similar strategies. This is where we discovered the world class baseline mentioned earlier of 25.2%. In our research we also found a presentation from CMUSphinx that went into detail about more advanced training methods such as VTLN, LDA/MLLT, Force Alignment, etc. Unfortunately we found many missing dependencies which prevented us from implementing these. Ultimately, we achieved a 48.4% WER on unseen data. We also discovered the npart setting on sphinx_train.cfg that allowed the training process to be multi-threaded. This significantly reduced training time. Applying the same changes to the sphinx_decode.cfg, we realized that sphinx_decode.cfg is never utilized by the software. See Competition Report for more details.

Finally, in the last weeks of the semester, the data group discovered inconsistencies with the <s> and </s> silence tags when decoding. SCLITE would always recognize the silences, and score them as recognized words. This slightly skewed our results and decreased WER. The previous experiment of 48.4% WER was 41.8% prior to discovering this.

The final area of focus was figuring out a way to properly use the sphinx_decode.cfg file, so that decoding parameters could be experimented on instead of just training parameters. To do this, we built the decode_config.pm perl module. It allows future semesters to change any of the parameters the sphinx decoder uses, and should be looked into in the future. The README file in the module explains in further detail.

Documentation

The modeling group documentation consisted of the Modeling Group Wiki Log, Experiment log and seperate individual group members logs ( Ryan, James, JonBen). Further documentation consisted of tutorials for running a train experiment, decoding and scoring. They required documenting Linux commands and Sphinx configurations in the Wiki tutorials in order to give users a better focus on how to modify their own experiments and compare results from documented experiments.

Systems Group

Introduction

The Systems group is responsible for the care and maintenance of the systems upon which the Sphinx Software runs. This semester's Systems group was tasked with the upgrade of five servers (Asterix, Obelix, Idefix, Miraculix, and Majestix) from old Dell Poweredge 1750s to newer model Dell Poweredge 1950s, as well as the configuration of Rome, the future home of a dedicated IRC Server for the UNH Manchester Speech Project.

Areas of Focus

The installation, configuration and testing of the five new servers, and the configuration of a sixth, Rome, were the primary focus of the Systems Group this semester.
The team must take care to not disturb two PE1750 servers that were being used by grad students to research multi-processing.

Replacing old equipment

  • The five PE1750s that were in the racks at the beginning of the semester needed to be removed and replaced with newer PE1950 models. This was accomplished entirely in the second week of class. The installation was fairly simple for all servers. They are rack-mounted with rail kits, so it was very straightforward. Once the units were in place and recabled, the software phase can begin.

Redhat installation

  • Once the units were in place, we began by installing Redhat on Asterix. This process was very straightforward, and is fully documented on the System Software Setup page.
  • The rest of the servers each ran into some issues during installation.
    • Obelix was installed with CLI-Only. The installation actually went fine, but the wrong option was selected during installation. Reinstalled the desktop version
    • Idefix, Majestix, and Miraculix all had foreign configured disks. It was due to the fact that the disks we were using had been previously configured for a different system. This required accessing and configuring the SAS BIOS to wipe the previous configuration that was preventing the server from booting. This threw us off the trail for over a week. We thought that the problem had to have been the "ROMB Battery Power" warning on the LED screen on the front of the servers that were giving us issues. All of this is outlined on this page.

Network configuration

  • This is a fairly simple process. The servers are all running a version of Redhat. By editing the files /etc/hosts, /etc/sysconfig/network, /etc/resolv.conf, and /etc/sysconfig/network-scripts/ifconfig-eth1, you are able to assign an IP that will allow the server to communicate with the network. This process is documented fully here.
    • /etc/hosts should be configured to hold the IPs and hostnames of the other servers on the LAN
    • /etc/sysconfig/network can be used to change the hostname
    • /etc/resolv.conf should be configured to hold the DNS of the server
    • /etc/sysconfig/network-scripts/ifconfig-eth1 is the configuration of the local network interface card. Put the local server IP in this file

Mounting Caesar's /mnt/main

  • Since we don't want to install sphinx on every drone, Caesar is set up to host /mnt/main to share resources and save disk space. This is a fairly simple process, but it is so poorly documented until now, that we spent many hours trying to figure out how to do it properly. We were extremely careful during this process. Caesar's /mnt/main is the brain of the project. Damage to it would be a catastrophe. This process is also detailed at the bottom of this page

Unique Servers

Of the five servers and six configured, two were unique for the project. Further information on these unique servers, Majestix and Rome, is detailed below.

Majestix

Majestix was set up and configured to be a "Tools Group Sandbox" to work on Emacs, rsync, and any miscellaneous tools that need testing without risking Caesar's file system in any way. It did not have Caesar's /mnt/main directory mounted to it for this reason. Once Redhat was installed on Majestix, it needed to be configured to use the internet. Once we isolated a cable that had a connection to the net, we began configuring the network card to use it. Unfortunately, this also became a multi-week process, as the network configurations that Caesar was using weren't working on Majestix. We toiled over this for hours. We finally brought the Prof. in to try and get some insight, and he found that someone or something had changed Caesar's DNS entries. Very mysterious indeed. The configurations were eventually hashed out after a power-outage brought some of the servers down over a late-semester weekend. Jonathan T ended up installing his software by FTPing compressed archives through Caesar, so the internet connection was returned to the server it came from at the end of the semester.

Rome

Rome was not used for speech experiments. Like Majestix, Rome's /mnt/main initially remained local. This ensured that any installations made to Rome would not hinder the speech progress being made. The server was initially intended to be a backup for Caesar, and a host for a "Capstone IRC Channel". When we first got our hands on the unit, it was running OpenSUSE from semesters long gone by. Our first task was to nuke the OS and install Redhat, like the other drones in the LAN. Not long after installing the OS, we found that the server is only running on 2GB of RAM. At this point, Rome was put on the backburner, with the intent of installing an IRC server at the least. Come the end of the semester, that plan was scrapped for a future date. Tom R of the Tools Group needed to take control of Rome in order to set up Rsync so that Caesar has some form of backup, at least. This was accomplished in the final weeks of the semester.

Documentation

Our documentation is currently split between our log and the info page. The log has our documentation of the HDD errors and our solutions, while the info page has documentation on the network bridge, and the creation of the banners. The info page will also contain updated documentation of the progress made on the IRC Server, Rome.

Tools Group

Introduction

The tools group was responsible for researching existing speech tools, and the installation and configuration of new utilities to enhance the systems. This section will be a report documenting the progress made in accordance with the proposal found at Tools Group Proposal Spring 2016.

Main Speech Tools

The first part of our research was to determine the current versions of the main speech software.The main speech related tools consist of Sphinx Trainer, CMU Dictionary, CMU Language Model Toolkit, Sphinx Decoder, SCLITE, and SOX.


Sphinx Trainer

Sphinx Trainer is used to train models for processing audio. The version that capstone was using was Sphinx Trainer 1.0. After some research a newer version was found, Sphinx Trainer 1.0.8 https://sourceforge.net/projects/cmusphinx/files/sphinxtrain/1.0.8/. The differences between version 1.0 and 1.0.8 were small. The new version added a single sphinxtrain command in order to access all training processes and also fixed some memory leaks and build issues. The latest version of Sphinx Trainer is intended for Sphinx4 which we are currently not using. Because of these reasons the recommendation was to not upgrade.


CMU Dictionary

CMU Dictionary is a list of words and their phonetic spelling. The version that capstone was using was CMU Dictionary 0.6. The latest version is CMU Dictionary 0.7 http://svn.code.sf.net/p/cmusphinx/code/trunk/cmudict/cmudict-0.7b which had been out last year as well. The new version had some more words in it as well as a new file format. The recommendation was to not upgrade. CMU Language Model Toolkit The CMU Language Model Toolkit is a set of unix software tools designed to facilitate language modeling. Some of these tools are used to process general textual data into useful speech files and others use the resulted language models to compute various speech statistics. We are currently using the most recent version so there was no need to upgrade.


CMU Language Model Toolkit

The CMU Language Model Toolkit is a set of unix software tools designed to facilitate language modeling. Some of these tools are used to process general textual data into useful speech files and others use the resulted language models to compute various speech statistics. We are currently using the most recent version so there was no need to upgrade.


Sphinx Decoder

Sphinx Decoder is used to decode seen or unseen audio files. The version that the previous capstone class was using was Sphinx Decoder 3.7. A newer version of Sphinx 3 was found - Sphinx Decoder 3.8 however we chose not to upgrade. Decoder 4.5 was also an upgrade option however we chose not to upgrade to this after doing research that showed that the java based Sphinx 4 decoder did not have an increase in train & decode speed. The CMU Sphinx site also specified that Sphinx 4 decoder was not faster than the Sphinx 3 decoder unless the system that Sphinx is running on is optimized - even then it has a slight performance increase.


SCLite

SCLite is a tool for scoring and evaluating the output of speech recognition systems. This is rating the quality of the models. Current version is 3.9.15 (32 bit). And latest version is 4.3.1 (http://www.nist.gov/itl/iad/mig/tools.cfm). It could not be found difference between these version. The recommendation was to not upgrade.


Sox

Sox is an audio processing tool used to manipulate sections of audio before use in the speech recognition system. At the beginning of the semester Prof. Jonas indicated that although Sox was currently installed it was no longer working as a result of an OS change from Opensuse to Red Hat. Upon researching we found there were a few missing depencencies and patched the system using the hotfix documented at https://foss.unh.edu/projects/index.php/Speech:Function_Sox .

Additional Tools

When looking at what tools might be added to the system we took several factors into consideration. First we looked for things that would have a positive impact on the general programing side of the project, secondly at the team aspect of the project. We also considered the amount of system modifications needed to add the tool.

Professor Jonas indicated at the start of class that he would like to install and use Emacs. Emacs is a more convenient and feature-full text editor used in mostly Linux systems that can make scripting or coding much easier.

After researching some options we recommended the installation of three additional tools. The first being screen, a console session sharing tool that can be useful for long distance collaboration. This tool allows for more than one user connected to a remote system to see the same console. This might be something a group could utilize to review scripts together in real time as well as leave sessions running and come back to them at a later point. The second tool the group recommended for installation is tree, a program that allows the user to view the structure of a file system in a tree like recursive fashion without just using the “ls” command inside of each directory separately. Here are the tree usage instructions

One of the more obvious areas of improvement was in the system security area. Every user of the project has root (full administrator) rights to these research systems. This is dangerous because it can lead to accidental deletion and editing of files which were not fully backed up on a regular basis. The group recommended the installation of an incremental backup system to enable version history and prevent data loss.

All tools were successfully installed and documented.

Tree

Emacs

Screen

Rsnapshot Backup System

Recommendations for Future

Although great progress was made with implementing new tools this semester it appears the work will have to be reapplied as all of the new clone servers were installed with 32bit Red Hat. Although not a problem for the tools we installed this does not match the main server Caesar and therefore the hotfixes need to be applied a bit differently and extra lib files needed to be included. It is our recommendation that all servers be moved to the same architecture in the future to improve compatibility for future changes.

The backup system is also currently being run from a fairly resource limited VM in rom 124. I would recommend monitoring this system during backups to see if it is CPU or memory bound and adjust accordingly. The system is also connected to Rome through a 10/100 switch. This will certainly prove to be a bottleneck for large backups and should be updated to a 10/1000 interface.

In addition I would recommend setting up a mail relay for the backup system. This would allow a nightly email of file changes to be sent to the professor automatically. This might be helpful in more quickly identifying if important system files were changed that perhaps should not have been.

For group meeting's Screen should be used. Each student can share what they are currently doing. It does not matter if they are inside of the same classroom or doing work from home. It allows multiple users to join and see the same command screen. So it should be useful for group meetings especially small groups that may be meeting in any situation.

For editing text, Emacs should be used as it is more powerful than vi or nano. Macros can be utilized for faster and more reliable editing/file creation. Emacs also utilizes enriched text, bringing color to the terminal.

Data

Introduction

The Data Group is responsible for Switchboard Corpus data that is used to run trains and decodes to establish a low word error rate and create a world class baseline for speech recognition. The Switchboard Corpus data is composed of 256 hours of audio telephone conversations which are broken into conversation files, containing a smaller subset of audio between two individuals, as well as utterance files which further segment the audio into files which capture a specific phrase or sentence spoken by a single individual. All audio files correspond with a transcript file which textualizes the spoken conversations and utterances to ensure proper analysis when creating a baseline.

The Data Group manages, organizes and ensures the integrity of these audio files to guarantee accurate and true results for word error rate calculations that will help determine a world class baseline. While tweaks to software configurations and longer trains on data will decrease WER (Word Error Rate) percentages, if the Capstone Project train on inaccurate and incomplete data, the results produced from training and decoding will be greatly flawed.

During the 2016 Spring semester the Data Group focused on validating corpus audio files, correcting corrupt corpus audio files, building optimal corpus sizes and scoring with SCLite per utterance rather than per speaker. Below is a thorough explanation of our areas of focus with our findings.

Areas of Focus

Validating corpus audio files

  • The main area of focus this semester was to ensure correct and valid audio files were being used in the training and decoding process. Previous semesters had never focused on the data itself, but rather the organization of it. We started out with identifying where the data was located, how we were going to get a random sample, how we were going to listen to it and how we were going to evaluate that it was good data. The entire Switchboard corpus contains 250,330 utterance files which would need to be listened to and matched to the transcript file. A 1% sample or roughly 2500 utterances was decided to be evaluated because manpower and time constraints for the semester would make trying to listen to all 250,330 files impossible. The first week of audio files proved nothing out of the ordinary but the second week of analysis showed that a big portion of one group member’s audio files were completely wrong compared to the transcript. The following weeks also showed more errant audio files which brought the estimated total incorrect audio files to around 25,000. Any percentage of error in the data will greatly flaw Word Error Rate and with 10% of the data being incorrect, we knew this finding and the correction would greatly improve Capstone’s results compared to the previous 4 years of research.

Correcting corpus audio files

  • With the help of the Modeling Group, audio files were reloaded unto the system to correct the suspected 25,000 incorrect utterance files. The Modeling Group had also identified portions of the transcript that didn’t have any corresponding audio files so those portions were omitted from the transcript. After the reload was completed, the Data Group completed another random sampling of the entire corpus and listened to audio files as well as previous incorrect audio files and felt confident that all utterance files now corresponded with their transcript file entry.

Building optimal corpus sizes

  • The Switchboard corpus is advertised as containing approximately 256 hours of audio but we were able to discover that the actual total amount is 311 hours. With this information, the Data Group was tasked with building new corpora sizes which would better reflect the whole corpus size as well as removing segments to allow for ‘unseen data’ decoding. Segments named eval.trans and dev.trans were removed from each of the corpora that were created and contained roughly 5 hours each of data that was not trained on. The largest corpus size that was built was named 300hr, 311 hours minus 10 hours for both eval.trans and dev.trans. For decoding on ‘seen data’ another 5 hour segment called test/train.trans was created, however those utterance files were sampled from the trained transcript and not removed like eval.trans and dev.trans.

Scoring with SCLite per utterance rather than per speaker

  • When scoring completed decodes, the report that is generated shows how each speaker in a conversation scores rather than how every utterance file is scored. To generate a Labeled Utterance Report (LUR) you are first going to have to go into the either the etc directory in your experiment if you ran a train on seen data or you want to be in the DECODE directory of your experiment if the data is unseen. Once in that corresponding directory you might find one or more decode.log files. You will need to run the following command (/mnt/main/scripts/user/parseDecode.pl decode.log ../etc/hyp.trans) to create a hyp.trans for one or all of the decode.log files. In our case the data was unseen. After that is finished you are going to want to run this scoring command (sclite -r <exp#>_train.trans -h hyp.trans -i swb -o all lur) for all of the hyp.trans files using the correct train.trans. That command will provide your with 4 different files when it is run one time. The most inmportant file is the hyp.trans.sys. This file has the WER for each utterance and the total WER of that section of the corpus.
  • Two experiments were run to decode the whole 300hr corpus. Experiment 0284/007 ran the first half of the 300 hour corpus or 121,165 utterances and Experiment 0284/008 ran the second half of the 300hr corpus or the last 121,165 utterances. After creating the LUR we have came to the conclusion that the total WER of the full corpus averages to ~41%. We did not go through each specific utterance which is something that next years capstone should take a look into. All of the LUR tables that were produced are to big to put on the wiki. The locations of each report are as follows:
  • Experiment 007
    • /mnt/main/Exp/0284/007/etc/hyp_1.trans.sys 42.6% WER
    • /mnt/main/Exp/0284/007/etc/hyp_2.trans.sys 42.2% WER
    • /mnt/main/Exp/0284/007/etc/hyp_3.trans.sys 41.3% WER
    • /mnt/main/Exp/0284/007/etc/hyp_4.trans.sys 41.1% WER
  • Experiment 008
    • /mnt/main/Exp/0284/008/etc/hyp_1.trans.sys 41.8% WER
    • /mnt/main/Exp/0284/008/etc/hyp_2.trans.sys 44.3% WER
    • /mnt/main/Exp/0284/008/etc/hyp_3.trans.sys 41.0% WER
    • /mnt/main/Exp/0284/008/etc/hyp_4.trans.sys 38.9% WER
  • Total WER: 41.65%

Documentation

Documentation exists in the form of Data Group Member logs and also the Data Group Wiki Page Speech:Spring_2016_Data_Group which highlights some of Data Groups accomplishments and allows for future semesters to pick up where the Spring 2016 Data Group left off. The Data Group consisted of Brian Anker, Brenden Collins and Justin Gauthier.

Experiments

Introduction

The experiment group was one of five groups within the capstone class, which worked on creating a world class baseline for speech recognition software. The group decided that the best way to contribute to this goal was to make changes to the documentation and organization on foss.unh.edu and the server Caesar easier to complete. We created two perl scripts, addExp.pl and makeTest.pl, as well as the creation of an easier archiving system (for current and past perl scripts). The last thing that was created, was a document explaining the purpose of each perl script so that current and future users have a better understanding of the scripts uses.

Areas of Focus

addExp.pl

  • The purpose of addExp.pl was to simplify how a user documents the creation of an experiment as well as the creation of a sub experiment. It took two existing scrips, createWiki_Experiment.pl and createWiki_Sub_Experiment.pl, and combined them into one. It also allows the user to specify whether they are creating an experiment or sub experiment and if it is a sub experiment then it allows the user to specify which main experiment to create the sub experiment under, all from the command line. This streamlines the process of documenting the creation of an experiment and sub experiment on foss.unh.edu.

makeTest.pl

  • The purpose of makeTest.pl was to set up a series of files for a decode. The script evolved to a state where it could handle source acoustic models that were done in a different experiment and proved to be very useful in setting up an experiment using unseen data. It copies only the files that are required for a decode, and it uses a softlink from the source for the model parameters, provided that the folder doesn’t already exist.

Documentation

  • The experiment group consisted of four members Matthew Heyner Peter FerroMeagan Wolf Kevin Soucey. Each log contains each members personal "journey" through Capstone, this includes how each of us went through learning the different aspects in speech recognition from starting a train to performing a decode, it can all be found here.
  • The top portion describes how the software repository is set up and how to properly maintain it for years to come. The Scripts page contains the information needed to understand which scripts are important for a specific system. Spring 2016 Experiment group spent a great deal of time configuring how this page was presented. Providing important information on how a script works from the inside out. If you need to figure out what the script developers thoughts were for the script in question and you need to know how it's used go to this link and take a look.

Team Strategies

Initial Teams

Captain America

  • Members: Saverna Ahmad, Brain Anker, Peter Farro, Justin Gauthier, Daisuke Matsukura, Thomas Rubino, Michael Salem, James Schumacher, Jon Shallow, and Meagan Wolf

Iron Man

  • Members: Neil Champagne, Brenden Collins, Matthew Heyner, Benjamin Leith, Aaron Miller, Ryan O'Neal, Kevin Soucey, Nigel Swanson, and Jonathan Trimble

At the start of this competition, both of our teams were competing against one another, attempting to get as low a WER on unseen data as possible. However, with two weeks left to go, after having exhausted ideas and/or run into road blocks on both teams, we decided to join forces, to see if together, we could do even better than what we accomplished, separately.

Final Team

Capstone

  • Members: Saverna Ahmad, Brain Anker, Neil Champagne, Brenden Collins, Peter Farro, Justin Gauthier, Matthew Heyner, Benjamin Leith, Daisuke Matsukura, Aaron Miller, Ryan O'Neal, Thomas Rubino, Michael Salem, James Schumacher, Jon Shallow, Kevin Soucey, Nigel Swanson, Jonathan Trimble, and Meagan Wolf

When trying to create the most accurate models, team Capstone narrowed down a few parameters that need to be altered depending on the state of the data being used. The following list are parameters that we scrutinized through multiple experiments.

Team Capstone first established a baseline experiment. This was the control experiment which we would analyze our further experiment results against. 0288/003 (and re-verified in 0294/004) was used as our control. We then used a strategy revolving around the scientific method in order to achieve our goal:

Team Capstone Scientific Approach to Speech Recognition Experimentation

  1. Ask a question
    1. What does CFG_FINAL_NUM_DENSITIES affect?
  2. Do background research
    1. Research and cite sources
  3. Construct a hypothesis
    1. Based on our research of sources A & B, we believe CFG_FINAL_NUM_DENSITY does X and increasing it to Y will do Z
  4. Test Hypothesis
    1. Run experiment with all variables in accordance in our control experiment except set CFG_FINAL_NUM_DENSITY to Y
    2. Control experiment was established prior to this, the control experiments used all defaults CMUSphinx settings (configuration files were not altered)
  5. Analyze your data and draw a conclusion
    1. The experiment with CFG_FINAL_NUM_DENSITY set to Y, had a 1.5% (simulated) increase in accuracy (Word Error Rate) over the control experiment. This validates our conclusions.
  6. Communicated Your Results
    1. Publish results on Foss Wiki and personal logs. Update any readme files or tutorials as appropriate.

Training Parameters

  • $CFG_NPART (npart) - Value is equal to number of threads to separate training into.
  • $CFG_N_TIED_STATES (senones) - The whole variety of sound detectors can be represented by a small amount of distinct short sound detectors. Usually we use 4000 distinct short sound detectors to compose detectors for triphones. We call those detectors senones. A senone's dependence on context could be more complex than just left and right context. It can be a rather complex function defined by a decision tree, or in some other way (senone).
  • $CFG_CONVERGANCE_RATIO - The ratio of the difference in likelihood between the current and the previous iteration of Baum-Welch (an iterative algorithm that is used to find hidden parameters within an HHM - a Hidden Markov Model) to the total likelihood in the previous iteration.
  • $CFG_FINAL_NUM_DENSITIES (density) - The number of Gaussians to be considered during speech processing

The decoding section was handled slightly different in terms of what parameters we altered. We used a script named run_decode.pl that was created by Michael Jonas during the summer of 2015. This script calls the s3decode binary file directly with a set number of parameters that it gathers from the arguments when the user invokes it.

Decoding Parameters

  • Language Weight (LW) - Allows you to fine tune balance between acoustic model prediction impact and language model prediction impact.
  • CTLOffset - Offset, from start of control file line N, to line N + CTLOffset setting. Used with CTLCount to partition decodes into multiple threads. For example: Start decoding at line 250.
  • CTLCount - Count from CTLOffset to line N, to line N + CTLCount. Used with CTLOffset to partition decodes into multiple threads. For example, decode from line 250 to line 500.

Results

Team Capstone discovered methods to incorporate all 8 cores of a machine to leverage the advantage of multi-threading for trains and decodes. This resulted in a training speed increase of 69.7% (0288/003 control compared to 0288/021 experiment) and a decode time speed increase of 66.2% (0288/003 control compared to 0288/024 experiment). Note the decode speed increase is not linear and instead varies depending on decode configuration settings, for example higher senone decodes see a drop in speed improvement from 66.2% to approximately 50%.

It was also discovered that the file used for decode configuration settings was never being used by our decoding methods. Ultimately, changing the decode configuration settings did nothing for the actual decoder. We fixed this by writing a Perl module where a speech scientist can decide which variables the decoder uses and properly set them in accordance with experiment settings. The module allows access to manipulate all 152 decode configuration settings. This will make it much easier for future Capstone classes to dig further into the decode process.

Finally, Team Capstone attempted several advanced techniques such as Force Alignment, LDA/MLLT, VTLN, and others found in the Sphinx Benchmark Report from CMU. Experiments continuously failed due to unmet dependencies. Further research comparing our sphinx3 software against the most current builds of sphinx3 found the CMUsphinx Github showed we were using outdated software and in some cases entire python libraries (for VTLN) in order to make these methods work.

Conclusion

Competition Results

Due to both teams having accomplished similar feats and getting stopped up in similar places, Team Captain America and Team Iron Man decided to combine forces with the last two weeks of class to see if the combined brain power could lead to a breakthrough. Unfortunately, we did not make any further strides, as the techniques we could use to reduce WER required dependencies that we did not have and/or could not supply.

Future Semesters

Future semesters have plenty of tasks to execute to help arrive at a lower WER on unseen data. The following bullet points outline the various tasks that need/should be executed.

  • (Modeling Group, Tools Group, Experiments Group) Find a way to implement LDA-MLLT, VTLN, MMI, SAT, CMLLR, and MLLR in the training process (dependencies will be the big hurdle here but the payoff will be worth it as WER will drop significantly)
  • (Systems Group) Continue work on the IRC server that will serve as a communications hub so that any information being passed around is kept in-house
  • (Data Group) Listen to audio files and cut out corrupt or unusable files