Speech:Spring 2019 Report

From Openitware
Jump to: navigation, search


Contents

Introduction

The goal of the Spring 2019 Capstone project was to build off of the work of the previous classes, with a focus on maintenance and documentation to improve the overall functionality of the speech recognition project. To that end, the class was split into 5 groups with each group specializing in a certain area of the project to ensure that all areas of concerned are properly addressed. Each group had researched the previous classes progress (respective to their group) as well as looked into the current issues that were present in their field and had chosen areas of focus to further enhance the progress of this project.

Acting in correlation with previous semesters, the class was been divided into 5 groups and 2 teams with the following members:

Modeling Group Data Group Experiments Group Software Group Systems Group
Christian Khoshatefeh Aashirya Kaushik Brooke Brown Adam Harney Donald Combs
George Harvey Brandon Peterson Dilpreet Singh Anthony Toscano Naina Prasai
Kevin Richardson Monica Pagliuca Ethan Jarzombek Travis Deschenes Scott Hughes
Vladimir Kazarin Peter Baronas Nicholas Klardie Wesley Krol
Team Alliance Team First Order
Aashirya Kaushik Anthony Toscano
Adam Harney Brandon Peterson
Brooke Brown Christian Khoshatefeh
Ethan Jarzombek Dilpreet Singh
George Harvey Donald Combs
Nicholas Klardie Kevin Richardson
Peter Baronas Monica Pagliuca
Scott Hughes Naina Prasai
Wesley Krol Travis Deschenes
Vladimir Kazarin


The tasks that each group had chosen to undertake reflected an eagerness for both enhancing the progress that had been made with past groups while focusing on the overall reliability and maintenance of their group respectively. This goal is set to allow future classes to be able to more easily grasp the material in the project as well as further enhance the efficiency and progress of the project.

To obtain this objective, each group had a specific goal in mind related to their respective group. Modeling focused on achieving a better Word Error Rate (WER) than the class of 2017's rate of 41.3% by improving the implementation of the Linear Discriminant Analysis (LDA) and Recurrent Neural Network (RNN). Data's plan was to look into improving the experimentation process by adding a one hour and three hundred hour corpus, and to complete the work of the past class. For the Experiments group, they looked into creating and improving the scripts needed to run experiments to make the process more seamless as well as to provide documentation on the steps they made as well as to review and revise the documentation of the past groups. Software looked into the speech tools that were used and whether it would be beneficial to update any of the tools along with looking into the current status of Torque. Lastly, the Systems group worked on improving the reliability of the system by getting the backup server operational and fixing hardware/bios related issues present on a few drones.

Modeling Group

Overview

During the Spring 2019 semester the modeling group set out with the primary goal of reducing the word error rate to be less than 40% on seen data. Our plan to do this was to attempt to implement Linear Discriminant Analysis (LDA), a Recurrent Neural Network (RNN), and to manipulate the phonetic spellings in the dictionary of specific utterances. By the end of the semester we were able to accomplish our primary goal and achieved a word error rate of 18.8% on seen data on a 145 hour corpus. We were able to successfully implement LDA, and a RNN, however, the RNN did not prove to reduce the word error rate therefore it was not utilized in our final experiments. Aside from LDA and RNN, we also ended up looking into and configuring, what we believe to be at this time, the most effective senone and mixture weight values for specific corpus's. Due to time restraints and that our efforts were drawn away from our initial intentions at times, we did not end up looking into the phonetic reconstruction of utterances in the dictionary. Additionally, another aspect that we believe resulted in improved results, was the incorporation of the summer 2018 findings of the best random number. We had the best results using the random seed from experiment 0312/006.

Linear Discriminant Analysis (LDA)

Linear Discriminant Analysis or LDA is a process that reduces the number of dimensions the data is represented on. Data is represented and or separated by various categories, this translates to the number of dimensions the entirety of data will be presented on. Depending on the dimension level, it can cause the normalization of the data, or linear combination of the data, to be more difficult or even less accurate with the more dimensions there are. What LDA does is reduce the number of dimensions the data is represented on without clustering data from different categories and maintaining adequate separation of data from different categories. This makes the process of finding a "line" more appropriately, the plane of best fit, easier with more, or without less, accuracy.

We were able to successfully implement LDA into our experiments and it proved to be a very beneficial technology. Prior to our semester, the most recent, and best baseline, was for a 30 hour corpus which achieved a word error rate of 26.3%. We repeated this experiment, except we also implemented LDA and it proved to reduce the word error by 5.2%, with the word error rate being 21.6%. The implementation of LDA into the longer 145 hour proved to show continued success as our best 145 hour experiment yielded a word error rate of 18.8% on seen data.

In order to incorporate LDA into an experiment during both training and decoding the following changes need to be made:
- Prior to training, in etc/sphinx_train.cfg CFG_LDA_MLLT should be set to 'yes' and CFG_LDA_DIMENSION should be set to 32.

- When decoding, at this point to our knowledge the run_decode, or run_decode_lda scripts do not work properly with LDA and the sphinx decoder must be ran manually with the appropriate flags.
- Below is the command, with flags, we used to run the sphinx decoder.

 /usr/local/bin/sphinx3_decode -lda /mnt/main/Exp/<your_main_exp#>/<decode_sub_exp#>/model_parameters/<#>.mllt_cd_cont_<senone_value>/feature_transform -hmm 
 /mnt/main/Exp/<your_main_exp#>/<decode_sub_exp#>/model_parameters/<#>.mllt_cd_cont_<senone_value> -lm /mnt/main/Exp/<your_main_exp#>/<decode_sub_exp#>/LM/tmp.arpa -dict 
 /mnt/main/Exp/<your_main_exp#>/<decode_sub_exp#>/etc/<source_sub_exp#>.dic -fdict /mnt/main/Exp/<your_main_exp#>/<decode_sub_exp#>/etc/<source_sub_exp#>.filler -ctl 
 /mnt/main/Exp/<your_main_exp#>/<decode_sub_exp#>/etc/<source_sub_exp#>_decode.fileids -cepdir /mnt/main/Exp/<your_main_exp#>/<decode_sub_exp#>/feat -cepext .mfc >& decode.log &

Recurrent Neural Network (RNN)

A Recurrent Neural Network (RNN) is a neural network in which memory from previous inputs is essentially stored, meaning that that the output of a specific neuron is also going back in as input to get the next output. RNN's have grown to be popular for machine learning situations where the data is changing overtime and/or an output is dependent on its previous input. The last semester to implement this technology was the Spring 2017 semester. They implemented a RNN in order to incorporate it into the language model to generate random words and then using the results to train the Sphinx decoder. However, the RNN did not prove to be more effective in improving performance. Current research does show that RNNs are popular with speech recognition and similar technologies.

Our plan was to try and implement a RNN differently than the previous semester that did, to see if we could see any positive results. Our implementation of a RNN was successful. Successful in the sense that we were able to have it incorporated into the process and work. However, what we had found was that when a RNN was implemented it severely degraded, and showed no improvement, in the resulting word error. As what was concluded with the spring 2017 semester's implementation,the RNNLM still has no working benefit over non-RNNLM, based on all the results in the RNN sub-experiments in experiment 0316, but further improvements may be possible. Based on outside research, there also may need to be work done on the trans_parsed file before RNNLM is able to achieve improved results. For instance, most RNNLM tool-kits require one sentence per line, and none mentioned the placeholders [in brackets like these]. Further research is required on RNNLM tool-kits, beyond what's setup on the servers currently.

A guide on how to incorporate a RNN into an experiment is proved here https://foss.unh.edu/projects/index.php/Speech:RNNLM.


Senone and Mixture Weights / Sphinx Configuration

On top of the implementation of LDA, we also looked at what the best sphinx configuration would yield the lowest word error rate while using LDA. The two primary areas we focused on were the senone and mixture value. These two values are defined within sphinx_train.cfg.

Although there is probably room for further research as to what exactly senones are and how they work, as of now our understanding is that senones act as detectors for specific sections of audio data. These detectors get assigned to represent a specific sequence of data during training and then when it comes time for decoding, these detectors are used to identify the sequence of data the were trained to represent and then eventually the combination of senones result in the detection of a specific sound or phoneme. As training is taking place, the sequence of data a senone is told to represent is consistently being adjusted through an average process of the vectors of data coming in. By the end of training, each senone is a representation of a specific vector of data.

Depending on the length of training a certain amount of senones are suggested. Using too many senones can result in over-fitting your data, and not using enough senones can result in under-fitting your data. The best senone values we have found have be the following: for a 5 hour corpus, 1000 senones, for a 30 hour corpus, 3000 senones, and for any train over 100 hours, 8000 senones. The senone value is represented in sphinx_train.cfg by the variable CFG_N_TIED_STATES.

The mixture weights come into play as the are used to adjust the the vector tied to a senone during training. As more data is coming and the senones are being re-fitted to account for more data of the same type, the mixture weights act as a multiplier to be used when averaging the vectors together. The mixture weight values are created at run-time and only exists in compiled files and there fore cannot be hard coded. However, inside the sphinx_train.cfg the CFG_FINAL_NUM_DENSITIES variable is used to define the intensity of the mixture weights.

As with the senone value, there is an appropriate CFG_FINAL_NUM_DENSITIES value depending on the length of the training. Over fitting and under fitting can occur if the CFG_FINAL_NUM_DENSITIES value is to low or to high. The best CFG_FINAL_NUM_DENSITIES values we have found have been the following: for a 5 hour corpus, 16, for a 30 hour corpus 32, and for any train over 100 hours, 64. First Order team conducted a wide research to find the best combination of senones/densities used for every corpus and the best WER on seen data for every corpus. The data gathered helped us better understand how senones/densities affect speech recognition on seen data but additional research will be needed to see how the trained models perform on unseen data. This will help us find the best values for senones/densities for our corpus sizes and also identify the tendencies in the data and predict the best used values for other corpus sizes.

Lastly, the only other aspect that was changed, from the default settings, was the hard coding of the random seed in mllt.py. Based on the summer 2018 experiments it was found that instead of using a random number generator, hard code the seed value in mllt.py to whatever yields the best result, in order to always be achieving the best word error rate and maintain consistency throughout experiments. The best seed value that was found during that semester was 2038113770. This is also the seed value that gave us the best results and what we used in our final experiments. This value is set in python/sphinx/mllt.py. The code for the random integer generator must be commented out and replaced with the single line 'r = 2038113770'. It takes place approximately around line 96 in mllt.py.

Transcript Files Parsing

We also continued the research of the last year group's approach to parse the transcript files (<exp>_train.trans) recorded in experiments 0305/010 and 0305/011. We removed certain annotations from the transcript files that were used to note how words were pronounced or if a word wasn't said but was meant in the context of the conversation. For example in the following phrase the person stumbled over the word Ammunition and the system recorded his complete pronunciation of the word: WELL UP UH UP IN NEW ENGLAND WHERE I'M FROM UH Y- YOU HAD TO GET A PERMIT BEFORE YOU COULD BUY ANY AMMUIT[ION]- AMM- AMMUNITION. Based on the scripts in 0305/010/etc/scripts, we created improved scripts in 0314/011/etc/scripts and 0314/010/etc/scripts that parse the transcript files and remove brackets and dashes from alike sentences. This approach helped us create much more effective models and the resulted WER improved by more than 10% and scored 21.7 WER against 32.7 WER last year(0305/010 vs 0314/015). We followed with a series of experiments which concluded that parsing the transcript files on average results in 2-3% WER improvement.

Data Group

Overview

The primary tasks for the data group to complete this semester were performing experiments on what type of annotation in the transcripts would lead to improved word error rates: stripping [], and keeping - in LM/train/dic, keeping [] and - in LM/train/dic or stripping [] and - in LM/train/dic. Also, creating new shorter corpora, and creating new transcripts in order to have protected files for testing so that there was no danger of accidentally using seen data in unseen experiments. We managed to successfully complete all of these tasks this semester.

Creating Shorter Corpora

We completed creation of new shorter corpora quite quickly. However, after its creation, the new corpora did not get used very much because we quickly realized that a shorter corpora was not as useful as we first thought it would be. While a shorter corpora would allow us to complete experiments in significantly less time, the results would be of limited value because the system would not process enough iterations on this model to make sure that model was accurate to any reasonable or useful degree. The shorter corpora did prove useful as a way to test to make sure we could create new corpora and that they would work the way they were supposed to.

Annotation Experimentation

We worked to complete the experiments with different types of annotations in the transcripts. However, we found that we needed to get help from the modeling group in order to complete the experiments because there was a large number of long experiments that needed to be run. This required a lot of scripting work and this effort proved to be bit too much for one group, especially when the regular expressions we were using in scripts to edit the transcripts were not always completely accurate and sometimes created some problematic results. Engaging the modeling group to assist provided the necessary additional expertise and resources. Overall, we were able, with the help of the modeling group, to complete all the experiments and get useful data.

We needed to re-run last year’s experiments with different annotation removals, on 5 and 30hrs. These included 0305/011, 0305/012, and 0305/013 all of which were 5hr experiments. First, we needed to re-run last year’s 0305/012 experiment, which keeps [] and – on 5hrs of data. The word error rate improved from 34.4% (0305/012) to 33.6%. However, it is still worse than 0305/011 (WER of 32.8%) and 0305/013 (WER of 32.7%) suggesting that removing either [] or both [] and - may give an improvement. To address this we needed to re-run all three experiments (i.e. 0305/011, 0305/013 and 0314/004) with 30 hours of data to see if the pattern continues (note that 0314/009 was the 30 hour repeat of this experiment). For 0314/009, the purpose was to keep [] and -. The result of this 30hr experiment which was a redo of 0314/004 baseline showed a 43.9% WER. This was worse than the baseline, as well as last year’s 0305/011 and 0305/013 5hr experiments. After a lot of struggling, we were able to complete a 30hr experiment based on last year’s 0305/011 5hr experiment. This experiment used a modification of the regular expression scripts from 0314/013 to remove certain information in brackets. The WER came to be 21.2%. We also did a 30 hour run of experiment 0305/013 which was designed to remove [] and - from the transcript. Sphinx settings were 5000 senone count and final_num_densities = 16. Mllt.py file was not edited for accurate LDA seed. The WER came to be 26.2%.

Creating New 300-Hour Corpus

We managed to create two new transcripts. One transcript is 290 hours of audio that will be used for training the system. The other transcript is 10 hours of audio that will be used only for testing, with five hours dedicated to unseen testing and five hours for seen testing. The transcripts were created by finding the file IDs that corresponded to the protected audio file and removing it from the original transcript and placing it in the testing-only transcript. The majority of this task was finding what audio files had what IDs and then using the grep command to create the new transcripts.

Software Group

Overview

The three main goals of the Software Group were to determine if any software should be upgraded, determine the install location of Sphinx, and to determine the status of Torque. By determining what software should or shouldn't be upgraded, a plan was created for what should happen during the Summer. As for finding the active installation path of Sphinx, this helps the Summer team figure out which files can safely be removed from Caesar to remove unnecessary clutter. Getting Torque running will make it possible for students to fully implement it, making for much quicker job times for tasks such as experiments. We managed to gather a lot of information for each of the three categories that will significantly help future students.

List of Installed Software Versions

For this task we went through the preexisting software list on the Foss page and cross-referenced it with the software listed on the servers. While going through each piece of software, we added more information and links to properly describe what each software's purpose is. We also added a section that determined if it was worth it to upgrade to the newest version or not.

Operating System

RedHat - Caesar is currently running RedHat Enterprise 6.8 as an operating system. The newest version is 8 but this is still in beta. The most current stable version is 7.6. RedHat 6 will see it's end of life on November 30th, 2020. We should really think about moving everything to at least RedHat 7.0 mainly for the end of life date that is approaching, which would make these systems not secure.

Software

Sphinx Decoder - Sphinx is a large vocabulary, speaker independent speech recognition code base and suite of tools. Sphinx 3 can use fully continuous observation densities. The current version on Caesar is 3.7 while the newest is 5.0 The major differences with the newer versions is that they are written in Java instead of C and there are some improvements regarding the word error rates. This is not enough reason to update to the newest version though. As of right now, Sphinx is doing everything that we need it to.

CMU Language Model Toolkit - The Carnegie Mellon Statistical Language Modeling Toolkit is a set of Unix software tools designed to facilitate language modeling work in the research community. The version running on Caesar is Toolkit version 2 while the newest is 7. The major differences is that the newer version has various bug fixes and documentation fixes. There are no major reasons that we need to update this as we have not had any major issues that require bug fixes.

Sphinx Trainer - This is part of the CMU toolkit and used to run experiments based on the user's specifications. The current version that Caesar is running is 1.0 while the newest is 1.0.8 The major differences between the versions is that the newer version is meant to be used with Sphinx 4. Since we are not running Sphinx 4 yet, it does not make sense to update this as there will likely be major issues if we do.

CMU Dictionary - The Carnegie Mellon University Pronouncing Dictionary is an open-source machine-readable pronunciation dictionary for North American English that contains over 134,000 words and their pronunciations. The current version that Caesar is running is 0.6 while the newest is 0.7 The newest version has more words and a new dictionary file format. Having more words would be very beneficial but by having a new dictionary file format, this could likely cause issues with other programs that rely on this dictionary and are looking for the old file format. Until we update other software at the same time, this needs to stay at the current version.

Sclite - The program sclite is a tool for scoring and evaluating the output of speech recognition systems. The current version on Caesar is 2.3 while the newest available is 2.4.1 The only differences in the newer version are mainly bug fixes. Since this is currently working as intended, there is no major reason to update the software.

Emacs - This is an extensible, customizable, free/libre text editor. The current version on Caesar is 23.1 while the newest is 26.1 It is not known what the updates are to this software besides minor bug fixes which is why we should not update this software.

Screen - Screen is a full-screen window manager that multiplexes a physical terminal between several processes, typically interactive shells. The current version on Caesar is 3.9.15 while the newest available is 4.6.2 We are unsure of what the newest version has to offer besides minor bug fixes which is why updating the software will not be done.

SOX - SOX is a command line utility that can convert various formats of computer audio files in to other formats. The current version on Caesar is 14.3.1 while the newest is 14.4.2 The major benefits of the new version are that it now supports multi-channel LADSPA plugins and optional latency compensation. It also includes many other bug fixes. As these features so not directly affect what we are using it for, we do not have a good reason to update the software.

Tree - Tree is a recursive directory listing command that produces a depth indented listing of files. The current version on Caesar is 1.7.0 while the newest is 1.8.0 The newer version mainly has bug fixes without adding any features that we would need. For this reason, we should stay on the current version.

With the only piece of software that should be upgraded relatively soon being RedHat, the operating system itself, it was still very important to notate this to make everyone aware. This will have to be addressed soon, likely done over the Summer to avoid any downtime during the next semesters.

Cleanup of Sphinx Installation

For this task we had to go through the files on Caesar and figure out which files are actively being used by Sphinx. Over the years, there have been multiple installations of Sphinx which means there are files scattered all around the server. In order to clean this up without affecting the software, we had to search the server to figure out what is active and what is not.

File Tree

File tree.png

How Sphinx Was Found

Within the "scripts_pl" directory, there is a script named "copy_setup.pl." In that script there are a few lines of code referencing a file in another path. Those lines of code go as follows:

if (!defined($cfg_file)) {
  if (-e "etc/sphinx_decode.cfg") {
    $cfg_file = "etc/sphinx_decode.cfg";
  } elsif (-e "etc/sphinx_train.cfg") {
    $cfg_file = "etc/sphinx_train.cfg";
  }

Though we would not call this search for the file path names incredibly thorough, we would certainly say the search for references to the "sphinxtraindir" was relatively thorough. This script appears to be establishing the creation of the "sphinxtraindir" variable. Following the path names it gives us, we are brought to two configuration file in the etc directory. "sphinx_decode.cfg" and "sphinx_train.cfg" are the two configuration files mentioned in the "copy_setup.pl." Although "sphinx_train.cfg" did not seem to have any pertinent information related to the "sphinxtraindir" variable. We did see an intriguing variable called "CFG_SPHINXTRAIN_DIR", which is assigned to the path "/root/speechtools/SphinxTrain-1.0." This being inside of the Exp directory led us to believe that it might not be very helpful. However, in the "sphinx_decode.cfg", the "sphinxtraindir" variable is finally given a value through the new name "DEC_CFG_SPHINXDECODER_DIR." In this file it lists the path name "/mnt/main/root/sphinx3."

Evidence Proving Validity

Searching the "copy_setup.pl" script for other relevant information turns up nothing noteworthy. There is definitely no references to a sphinx install path within the "python" directory, so that directory can be safely stricken. Searching the "scripts_pl" directory for path names using grep did turn up far more relevant information, but it almost always started with the variable "SphinxTrain_Dir", which made finding the sphinx path name that is currently being used much more difficult. Almost all other mentions of this variable seemed to be irrelevant to naming the location of its initialization. "copy_setup.pl" appeared to be the only script implying any amount of initialization, so we assumed following it would bring us to the path name. Checking both configuration files named in the "copy_setup.pl" script, only one seemed to have anything related to sphinx path name. This path name is the one we have provided as the path to sphinx and it does appear that the config file is setting up a variable with the path as a string which would make sense considering we are looking for a string variable. That path name is "/mnt/main/root/sphinx3".

Status of Torque Software

For this task, we had to first determine the status of Torque based off of previous years' notes. After gathering as much information as we could regarding Torque, we did some more hunting on the servers to figure out where previous students left off. As we learned more and more about it, we eventually got Torque setup on test servers which will eventually make its way to the main servers to actually be utilized by students.

What is Torque?

Torque is an open source resource manager providing control over batch jobs and distributed compute nodes. One can setup a home or small office Linux cluster and queue jobs with this software. For the purpose of this course, Torque is used to queue experiments to the server in order to drastically reduce the amount of time the experiments take. A cluster consists of one head node and many compute nodes. The head node runs the torque-server daemon and the compute nodes run the torque-client daemon. The head node also runs a scheduler daemon. While Torque has a built-in scheduler, pbs_sched, it is typically used solely as a resource manager with a scheduler making requests to it. Resources managers provide the low-level functionality to start, hold, cancel, and monitor jobs. Without these capabilities, a scheduler alone cannot control jobs. While Torque is flexible enough to handle scheduling a conference room, it is primarily used in batch systems. Batch systems are a collection of computers and other resources (networks, storage systems, license servers, and so forth) that operate under the notion that the whole is greater than the sum of the parts. Some batch systems consist of just a handful of machines running single-processor jobs, minimally managed by the users themselves. Other systems have thousands and thousands of machines executing users' jobs simultaneously while tracking software licenses and access to hardware equipment and storage systems. Pooling resources in a batch system typically reduces technical administration of resources while offering a uniform view to users. Once configured properly, batch systems abstract away many of the details involved with running and managing jobs, allowing higher resource utilization. For example, users typically only need to specify the minimal constraints of a job and do not need to know the individual machine names of each host on which they are running. With this uniform abstracted view, batch systems can execute thousands and thousands of jobs simultaneously. Batch systems are comprised of four different components: (1) Master Node, (2) Submit/Interactive Nodes, (3) Compute Nodes, and (4) Resources.

Our Process

Throughout this semester we have been working on finding Torque, learning more about it, and installing it on the test servers the Systems group specifically setup for us. When we were first learning about Torque, it took a lot of research as there was limited documentation from previous classes, besides a few brief mentions here and there. From what we could find, it was not running properly for a while and was just abandoned. While researching, we found a lot of great resources including the manual for the exact version that is installed on the servers. Reading these and getting familiar with the commands was very helpful in understanding how Torque worked. As we started to get further and further into the research process, we started searching on Rome for any files related to Torque. We found a lot of active log files that were giving errors regarding not being able to communicate with the other servers. We dug deeper and eventually found the files that we needed to get Torque up and running. After speaking with Professor Jonas and the Systems group, we had two test machines (1750s) setup for us to start experimenting on. They were called Astronomix and Obelodalix. Our first goal was to move the installation files generated from Rome which is the main server in the Torque configuration, over to the two test servers. These were not connected to /mnt/main so it was tricky to get the files we needed over. Once we had those files over, we learned how to successfully install Torque on each server. We made steps for this on the Spring 2019 Software Group Torque page. Once we had Torque installed on each server and saw that the pbs and Torque processes were running, we tried to figure out how to configure them so they would be able to communicate with Rome. Unfortunately, this is where we left off as we were running low on time at the end of the semester. We were getting errors when trying to open up certain ports which would allow the machines to speak to each other. We hope that next year's class will review our notes and pickup where we left off so they can get Torque to a fully working state.

Systems Group

Overview

The primary goal for the Systems group was to ensure the overall health of the Caesar server along with the drones connected to it. Previous System groups focused on upgrading the hardware and updating the Software tools. However, the most recent group in 2018 had focused more on Stability, and we mirrored their concerns in this area and it required our immediate attention and needed to be addressed to ensure the health of the system. Currently, the backup server is properly and there are multiple issues that have been handled that appear to be related to the permission and path environment issues. Our efforts focused on a combination of system stability and expanding system resources.

Improve Backup Process

Lutetia is now the backup server (previously named “capstonebackup”). Rome is serving as the go-between to backup Caesar to Lutetia. This is done by having Caesar and Lutetia mounted to Rome at /mnt/main and /mnt/backup_solution respectively. A script was written to create the directory structure and complete the initial copy for the rsync process. The script can be edited to fit your environment as at the end of each year the 03xx directories are placed into a directory for that semester, IE sp19 will be filled with 03xx directories. An archival rsync will determine the difference between the initial copy and the current state of the directory. A special file is created to add items to the crontab process at a set time automatically to automate the process.

Server Maintenance

We discovered that the current servers did not have their times set correctly, nor are they being properly synced as they should. We set all the servers to the correct standard time before proceeding to explore the possibility of automation. After manually setting the times of the servers not setup correctly and addressing the problems in the drones, we planned on syncing the servers together (all drones and Caesar) and link them to a trusted time server using the Network Time Protocol (NTP) but due to time constraints, we were not able to complete this task. Though, the research has been done and the possible set up can be found in the individual System’s group member logs. When I look in Methusalix ntp.conf file, I can see that the public servers are already added, but the restricted portion to allow the rest of the drones is commented out.

In the ntp.conf file:

# Hosts on local network are less restricted.
#restrict 192.168.1.0 mask 255.255.255.0 nomodify notrap

# Use public servers from the pool.ntp.org project.
# Please consider joining the pool (http://www.pool.ntp.org/join.html).
server 0.rhel.pool.ntp.org iburst
server 1.rhel.pool.ntp.org iburst

The line:

restrict 192.168.1.0 mask 255.255.255.0 nomodify notrap

would allow access from the LAN to the NTP server by adding the IP address gateway. Each server would need to have their config file changed as well to point to the local NTP server.

To check the status of the NTP server when established, the following commands will check the status:

ntpstat #If this displays an output then your NTP is working in that instance. For Example:

synchronized to NTP server (x.x.x.x) at (NTP Server Name)
time correct to within 399 ms
polling server every 64 s

ntpq -p #shows the list of clients connected to the host NTP server and their status. For Example:

remote           refid           st t  when poll reach   delay   offset  jitter
================================================================================
+Methusalix      192.168.10.9     2 u   15  128  377     88.649   5.946   6.876
-Astronomix      192.168.10.14    3 u  133  128  377    182.673   8.001   1.278
*Obelodalix      192.168.10.15    2 u   68  128  377     29.377   4.726   11.887
+Asterix         192.168.10.2     2 u   31  128  377     28.586   -1.215   1.435

Permissions Issue

The issue stemmed from a group, shell and umask issue that if the groups did not match properly, the umask was set to 022 which did not allow for the directory and files to be created with the (drwxrwsr-x., .rwxrwxr-x.) rights in the Exp directory structure. These rights allow everyone in the CIS 790 team to complete tasks on other group member directories and files. These rights are set in the /etc/profile, /etc/bashrc (for bash shell), and in /etc/csh.cshrc (for -tcsh shell) files. We have marked each change in each file with the remark that Prof. Jonas authorized the change. It goes without saying that this is a security risk but trust and honesty will prevail.

What Was Changed

In /etc/profile, /etc/bashrc and /etc/csh.cshrc

From
# By default, we want umask to get set. This sets it for login shell
# Current threshold for system reserved uid/gids is 200
# You could check uidgid reservation validity in
# /usr/share/doc/setup-*/uidgid file
if [ $UID -gt 199 ] && [ "`/usr/bin/id -gn`" = "`/usr/bin/id -un`" ]; then
    umask 002
else
    umask 022
fi
To
# By default, we want umask to get set. This sets it for login shell
# Current threshold for system reserved uid/gids is 200
# You could check uidgid reservation validity in
# /usr/share/doc/setup-*/uidgid file
if [ $UID -gt 199 ] && [ "`/usr/bin/id -gn`" = "`/usr/bin/id -un`" ]; then
    umask 002
else
    umask 002 #Change Per Prof. Jonas dmcombs
fi

New Servers

This semester we were able to add 3 more drones to the LAN. These servers include Methusalix (Dell R610), Astronomix (Dell 1750) and Obelodalix (Dell 1750). RedHat Enterprise Linux OS version 6.6 was installed on each server. The two Dell 1750 servers were given to the software group for their continued research on downloads without affecting the rest of the drones. Each new server is mirrored to the other drones with Methusalix being the new fast server out of all the drones.

R610 - Methusalix  (192.168.10.9)
1750 - Astronomix (192.168.10.14)
#old Asterix
1750 - Obelodalix (192.168.10.15)
#old Obelix

Documentation

In the process of getting the backup running, we added additional documentation about the steps we took to make it fully operational and we revised the current documentation for the setup process as well as how to properly diagnose certain issues. Furthermore, we documented the new servers that were added to the LAN. Finally, in our group log, we documented PATH environment issues when installing new servers and the permission issues that was affecting every server.

Experiments Group

Overview

The goal of the experiment group was to create, improve, and expand on the scripts that make running experiments, trains, and decoding scripts seamless. In doing this, we aimed to give the other groups a much more fluid system from which to build progress upon. Maximizing efficiency and reducing redundancy for the entire speech project workforce will propel us forward as a group, and simplify our means for success.

Our primary points of concern began with making adjustments to the Add and Copy Experiment scripts to help eliminate bugs and warnings, as well as adding features to improve user functionality. The Add-Experiment script remains one of the most commonly used scripts, and has been updated to give a smoother user experience, and to have a cleaner code base which will allow future students to more easily add features and address bugs. Improvements were made to the Copy-Experiment script as well to ensure it works as expected. However, feedback from students reveals that there is still a lack of trust in the script, and future students still have a tough job to figure out the actual requirements for that script.

In addition to updating these existing scripts, we had a goal of generating two new scripts known as the Make-Experiment Script and the Running Jobs Script. These two scripts will allow us to integrate the Make-Train and Make-Decode scripts, and capture all the training and decoding jobs running on our machines to display who is working on what and when. These two improvements could could significantly make running experiments simpler much more organized.

Finally, reviewed, updated, removed, and rearranged all of the documentation as needed. Some old documentation as well as scripts or project resources were still listed under the wiki page and might cause confusion for current or future groups. It is important that we keep the documentation as detailed and thorough as possible, but keeping it lean and free of expired information is just as important to reduce false leads and wasted time.

Fix existing issues with Add-Experiment Script

The Add Experiment script remains one of the most utilized scripts in the entire speech project. There were some issues with the script which we fixed to start. In its previous state, if you create a root experiment it forced you to create a sub-experiment, which you then later would have had to remove from the wiki by hand. This was an unnecessary step that was made optional. Also, the Add-Experiment hod restrictions as to what you can and cannot type as a name for your experiment such as trailing white spaces and special characters. Small nuisances like this can still complicate formatting and diminish simplicity. Investigation revealed that these issues were due to poor practices which were introduced into older versions of the script and never revised. The code was heavily modified to handle HTTP requests in a more correct way, which eliminates all issues around special characters and will make the script easier to work with in the future.

Implement quality of life improvements for Add-Experiment Script

After making those adjustments, we turned our attention to adding a couple of features as well. We implemented a feature that automatically recognizes your full name when you run the Add-Experiment script, simplifying the process so that the system asks for less information and lets you get on your way quicker. This information is already on Caesar, so it can grab it rather than asking you for it. We also thought about developing a feature that not only adds the information to the wiki, but also creates the directory in /mnt/main/Exps for you, which would have lessen redundancy and eliminated the possibility of the directory not being created, but we never got around to it.

Address issues with Copy-Experiment Script

The Copy-Experiment script is another commonly used script which, as it stands, is a great program but could still use some improvements. Currently, when you run the program it copies absolutely everything, including all the prior models that were created. We would like to make this an option that allows users to decide if they want this extra information or not. The user would simply append an (-a) for all information, or (-s) for a skeleton version that contains only a summary of the most important information needed to run the current experiment. Displaying only the bare bones of the essential data generated from Copy-Experiment would be a huge step in advancing our experiment data organization. In addition, there are issues with the regular expressions used in copyExp.pl which causes the script to incorrectly modify configurations which we will look to fix.

Create new Make-Experiment and Running-Jobs scripts

A couple of things that we believe would be both beneficial and attainable within the scope of this semester would be to create two new scripts; one to combine certain scripts that are usually run together into one execution, and one to keep track of who is running what experiments. Two scripts that are known to be run together are the makeTrain.pl and makeDecode.pl scripts. Mending these two together into one script called makeExp.pl can make execution much more efficient and save time. The second script would keep a log of who is running what tests and when. This script, spyExp.pl, would be able to capture all the training and decoding jobs running on our machines and display the results in a table for anyone to see. The results captured would include things like the program being run, the experiment number, the user, and the machine it was run on.

Update documentation

We are not the last year of capstone students looking at these scripts, therefore it will be important that we leave behind appropriate documentation on each of the scripts so people in the future won't be scratching their heads on what we did. Further, we made an effort to ensure all existing documentation is up to date, including the Scripts page is up to date.

Team Strategies

Alliance

Members

Team Members
Aashirya Kaushik
Adam Harney
Brooke Brown
Ethan Jarzombek
George Harvey
Nicholas Klardie
Peter Baronas
Scott Hughes
Wesley Krol

Goal

Our goals for this project were to try to leverage areas that we felt weren’t completely utilized by past groups and to see if any reduction in word error rate could be achieved by looking into those areas. To do this, we felt it was necessary to investigate the past semesters work and see the areas we felt warranted further investigation based on the results achieved and prior research into areas that seemed promising. From our research, we determined that the dictionary needed to be further investigated to see if incorrect data in this area was preventing the word error rate from being reduced further.

Description

Our primary changes that were made to the process, was the implementation of linear discriminant analysis (LDA), and adjusting the mixture weight and senone count to be the most appropriate for a specific corpus length. LDA was implemented successfully, and it did prove to reduce the word error rate. LDA is a process of reducing the number of dimensions the data is represented in without comprising the distance between the various categories of data. With reducing the number if dimensions allows for the creation of a “line” of best fit that is more apt to be a better representation of data from multiple different categories. To implement LDA we changed to following configuration settings in sphinx_train.cfg:

$CFG_LDA_MLLT = ‘yes’
$CFG_LDA_DIMENSION = 32

Senones were a large focus as to what will effect the final word error rate. Senones act as detectors and are associated with a specific segment of audio data. There is a correlation between the length of the training and the amount of senones used. If not enough senones are used then a senone will be correlated with to much data causing it to be under-fitted. On the other side, if to many senones are used, then they will be over-fitted and there will not be enough distinction between senones. Either case results in a higher word error rate, there is a “happy-medium” value that we found to have work best for each corpus length. For a five hour train, 1000 senones, thirty hour, 3000 senones, and anything over 100 hours, 8000 senones. To change the senone value is a adjusted in sphinx_train.cfg, the variable CFG_N_TIED_STATES. Lastly, the mixture weight was the other main configuration setting that we had touched. The mixture weight is a value that is ultimately created when the program files are compiled. The mixture weights are used to associated a specific senone to a specific piece of audio data and the manipulation of which can change the fitting of a specific senone. Since the actual values are generated at compile time, they cannot actually be hard-coded. However, we can adjust the CFG_FINAL_NUM_DENSITIES parameter which takes a integer that is a multiple of eight. This parameter is what affects the final mixture weight value. Similarly to the senone values, we saw a correlation between this and the length of the train. The best values we have found so far are the 8 for a five hour train, 16 for a 30 hour train, and 64 for anything longer than 100 hours.

Results

As of now, our results for the final 300 hour experiment has not been finalized. At the time of this submission, the experiment is still decoding. Currently, our best experiment has been a 145 hour, tested with seen data. The changes we made from the default configurations were the following.In sphinx_train.cfg:

$CFG_N_TIED_STATES = 8000
$CFG_FINAL_NUM_DENSITIES = 64
$CFG_LDA_MLLT = 'yes'
$CFG_LDA_DIMENSION = 32

Additionally, we implemented LDA during the decoding process as well which required us to call the sphinx decoder manually passing it the following flags:

$lda
$hmm
$lm
$dict
$fdict
$ctl
$cepdir
$cepext

This configuration using a 145 hour train and testing on seen data resulted in a word error rate of 18.8%. We did not conduct any other experiments on unseen data, however this was the best result on seen data we received.

Best Submission:

Corpus Information Training Corpus Name:

145hr LDA with bracket removal

Training Corpus Size (in hours/minutes):

145hrs

Test Corpus Name:

145hr LDA with bracket removal

Test Corpus Size (in hours/minutes):

145hr

Results Word Error Rate:

18.8

Runtime PoweEdge 1950 used: Miraculix Real-time factor:

2x

Summary

At this point, we feel confident that we have found the best configuration for the mixture weight sand senone count. Implementing LDA definitely reduced the word error rate and combining that with the senone and mixture values of 8000 and 64 for any experiment over 100 hours should at this point produce the best word error rate. We also implemented a parsing script that was created which removes additional brackets from the transcript. Lastly, we had begun to look at manipulation of the dictionary to further define utterances such as [laughter] to make it so that sphinx would be able to decode more sounds that are common in speech, in turn reducing the word error rate. This was mainly thought to possibly help in cases where utterances were followed by or directly followed by words that are defined but since the first part of the sounds are not then the entire word is missed. However, we were unable to make any significant progress in implementing this theory. We recommend that follow on semesters continue with this theory as it could possibly improve the WER.

First Order

Members

Team Members
Donald Combs
Travis Deschenes
Vladimir Kazarin
Christian Khoshatefeh
Monica Pagliuca
Brandon Peterson
Naina Prasai
Kevin Richardson
Anthony Toscano
Dilpreet Singh

Goal

Our main goal is to get a lower WER than last year’s best result of 43.3% on unseen data. Our team also researched new approaches to improve the WER while observing and recording new data from our experiments.

Description

Our team researched how WER is affected by different combinations of densities / senones on different corpus sizes without any other enhancements like LDA or improvement of train data. The settings in sphinx_train.cfg for densities /senones are $CFG_FINAL_NUM_DENSITIES / $CFG_N_TIED_STATES. Our idea is to use the range of densities from the default value of 8 to 64 with senones count equal to 1000, 2000, 5000 and 8000. The purpose of this is not just to find the best resulting combination but also find any correlation of densities / senones with the corpus size. This will allow us to better understand how these settings affect speech recognition and predict the best combinations for larger corpus sizes.

Additionally, one of the main goals this year was to improve upon scripts used by last year's class which removed certain annotations from the transcript files. The corpus of audio data that we are using for this project is a 300 hour sample of random phone conversations that were manually transcribed by humans. The human transcribers had a system that they used to make notes about how certain words were said, or corrections when someone mispronounces a word. This is not useful to us in a speech recognition context, so those notes have to be dealt with in some way. This is particularly important for the construction of the language model, which is a statistical model of the probabilities of words occurring in context of each other. When the annotation notes are left in the transcript when the language model is created, the model will have a slightly skewed probability. This is because the annotated words occur rarely, most likely only once. By removing them from the transcript, the language model becomes more accurate. It was determined that removing words contained in brackets and trailing hyphen symbols provided the most significant improvement in WER. With this in mind, we trained a model using 300 hours of data. The language model was created with a transcript that was created by taking the original transcription of the data and replacing the aforementioned text using a series of complex regular expressions. Once the acoustic and language models were created, they were used to decode five hours of unseen data. This decode took under 24 hours to complete and resulted in a 58.1 percent word error rate. This is still higher than last year’s efforts, but when combined with LDA, should produce a more accurate result.

Results

Despite not having all the data we had initially set out to collect as a group, some interesting trends appeared in the data that we did have the chance to collect. Our work with the 5 hour corpus seemed to indicate that there seems to be optimum settings in regards to the senone and density count. In general it seems that a higher density and senone count has a beneficial impact on producing experiments with a low word error rate, however, there seems to be a point in most data sets where data can become overtrained based on the size of the data set. There are a number of controls that could have been addressed that may have given us a more consistent trend across the different sized data sets. For example, some experiments used the preferred seed found by Hannah Yudkin in the summer 2018 semester. 5hours senones.png Note: 16/5000 and 16/8000 results were proved to be invalid. 30hours senones.png 145hours senones.png

The best result we got was in experiment 0319/053 with 300hr test corpus using proved to be effective for this corpus size, 64/8000 senonses/densities, with the use of our parsing scripts for transcript files. The results didn't beat the goal WER on unseen but was a promising base for the experiments we couldn't finish: our 300hr LDA experiment failed to score and due to time constraints, we were unable to finish the experiment, we laso didn't have resources to do 300hr/LDA + parsing experiment. This can be a great start for the next year's team to run 300hr/LDA and 300hr/LDA + parsing.


Best Submission: 0319/053

Corpus Information Training Corpus Name:

switchboard/300hr/train/trans/train.trans

Training Corpus Size (in hours/minutes):

300hrs

Test Corpus Name: /mnt/main/corpus/switchboard/300hr/test/trans/test.trans Test Corpus Size (in hours/minutes):

5hrs (on seen data)

Results Word Error Rate:

29.4 (seen)
58.1 (unseen)

Runtime PoweEdge 1950 used: Obelix Real-time factor:

N/A

Summary

This was our best result because of the specific configuration changes we made. If we had more time, we might experiment with different configuration settings to provide better results. Given the baseline aspect of this project, we attempted to create several baseline experiments and some of them may be more accurate than others. We used a spreadsheet to keep track of all the different 5hr, 30hr, 145hr, and 300hr experiments with different senones, and densities to use as a way to find out what settings might be best for getting the lowest WER. Having some difficulty with unseen data experiments, we probably could have gotten better results if we hadn’t run into those issues. By the time we were able to figure out the issue regarding unseen experiments, it was getting close to the end of the semester. However, we were able to get some unseen experiment results that can prove to be helpful for future semesters. We were also able to complete several seen experiments which confirm the results of previous semesters.

The final results submitted here are for the best 300 hour seen results. These were collected using the scripts which remove annotations from transcripts. The plan was to submit the results of an unseen experiment, but those experiments were found to take much longer than expected. As such, we chose to submit our seen data results to suggest were our research could have gone if we had more time. The intent was to run one final experiment that incorporated both LDA and the parsing script, which could be useful for future semesters.