Speech:Spring 2013 Report
Speech recognition is a field within technology that has been posed with one of the most complex problems: understand the exponential intricacies of human speech and translate it to a digital format. There are a few speech translators which have gained notoriety, but there is still no such device that is a perfect translator. Understanding and using speech tools is becoming a skill needed to gain an edge in the computing sector due to the potential of what extremely accurate speech translation can translate to. CMU’s open source Sphinx speech recognition software was used throughout the course of the semester as the toolset required to achieve semester goals.
The design of the Capstone project is to provide students a medium to immerse themselves in group based projects designed to emulate a real world working environment. Students are required to understand advanced concepts of speech recognition while learning how to navigate through a command line Linux environment, and balancing group dynamics to achieve results. The class was spilt into many groups that were charged with individual tasks such as hardware configuration and software tools early in the semester. As each group made progress in understanding their contribution to our speech recognition program that class organically evolved into two large groups competing against one another and attempting to generate a solid dictionary, clean speech model, and low word error rates.
Goal of the Semester
(Jake/Charlie) The main goal of the semester was to first modify or change any areas of the Speech experiment process to make it more efficient and more reliable. In this sense, we would be able to familiarize ourselves with the technology and processes involved with speech modeling. As a class, we were larger than any previous semester. This presented both advantages and disadvantages in that we were able to distribute group work out in a wider manner than previously done, but also presented challenges where not all members of the class could contribute to experiments simultaneously.
As a part of the main goal, we also set out to attempt a 50-hour train and decode at some point in our progress. A large portion of the semester was spent getting the entire class trained on how to create and run experiments on pre-made corpi using older 1-hour experiments as examples. This greatly helped the class understand what issues can arise and how to troubleshoot them. As a precursor to attempting a 50 hour train and decode, the class was split into two groups which would attempt a 5 hour train and decode to see how well our refined processes worked. Though not entirely successful, these experiments did reveal further issues which called for more investigation and troubleshooting. That being said, the 50 hour goal was not achieved this semester but the foundation for future progress with new classes has been strengthened much more than when Spring 2013 semester first began.
Below current configuration details about the systems and file structure are discussed about what was changed/modified to improve the system.
(Mike Bianchi) The first thing that the systems group [the team that worked on the updating and maintenance of the Capstone file-servers] did was learn about what the technical specifications of the Dell PowerEdge file-servers[each named Caesar, Asterix, Obelix, Miraculix, Traubadix, Majestix, Idefix, Automatix, Methusalix, and Verleihnix] were. The systems group also briefly researched which of the drives was connected to the Internet and could be logged into through SSH as well, the chart from Spring 2012 had later informed us that the drive that accomplished all of the above was the Caesar server.
As of the writing of this final report, the only things that haven’t changed about the hardware since Spring 2012 are the fact that the servers still currently run OpenSUSE 11.3 and all of the file-servers still have the same processor speed. This is because the systems group had decided that improvements, which would benefit both the hardware/operating system itself and Capstone Project as a whole, could be made. The benefits of these improvements being that the updated operating system [discussed in the Operating Systems section below] would be more up to date and would not only run faster, but also be able to process more user tasks (such as training, decoding and scoring) as well.
The second thing the systems group did was to replace two of the CPU fans in the Caesar server as they were not spinning, even after the group had taken them apart and tried to fix them by scraping out whatever was causing the fan mechanism to not rotate the fan. Next, The systems group was encountered with an unexpected update, they had to install a new 73GB hard drive into the Caesar server after the existing hard drive that was in Caesar had failed on Feb.12 as noted here: [Insert Tyler’s log when added to wiki]. The last improvement that the systems group made was to increase the memory in the client drives, which utilizes the data for the Speech Recognition projects that are stored on Caesar.
The systems group brought all of the client drives up to 4GB of memory so that they can handle more user activity at one time. It should be noted however that Caesar still only has 3GB of memory as the systems group found that one of the memory sticks was defective, leaving us with only enough memory sticks to increase the client drives’ total memory (a decision we made based on the fact that the client drives handle more activity than Caesar does).
The Caesar server [and subsequent client drives] are currently running a Linux operating system known as OpenSUSE, which is one of many various flavors of Open Source Linux systems. However, the version that Caesar is currently running, 11.3, is outdated and no longer supported by its developers (OpenSUSE). Research into the system itself, which was done by a member of the systems group, had also noted that the security features were very basic and not particularly useful either since the developers themselves stated in the 11.4 update guide that 11.3 has a partially functional version of a security program named AppArmor. It should also be noted that the drivers for the system are also outdated as well.
The systems group from Spring 2013 decided to see if they could find a decent, more stable system upgrade that would both benefit the project in Spring 2014 and work with the current server setup. The systems group also wanted a system that would work with the already set up RAID drive, have LDAP [Lightweight Directory Access Protocol] and Active Directory capabilities (which is used for handling user logins), and also have a new Linux kernel/updated drivers. The systems group’s initial decision was to test out the newest OpenSUSE system, 12.2, as that fulfilled the aforementioned requirements and also had improved security features as well. However when 12.2 was tested on Caesar after the hard drive failure, it did not work as none of the needed programs worked well with 12.2. It should also be noted that this system could not connect to the network we had set up for Caesar and the OS itself also ran very slow as well.
The systems group decided to research other Linux OS options after downgrading back to OpenSUSE 11.3 temporarily and started to look at the system specs for the newest versions of Fedora and Ubuntu. The new research that one of the members of the group had conducted stated that Fedora had all of the latest driver updates as well as the newest Linux kernel. The gathered research noted that Fedora was also more secure than both Ubuntu and OpenSUSE as it utilized the much stronger SELinux [Secure Linux] program which had the ability to divide the system into “sandboxes”, meaning that SELinux puts each program into its own protected part of the system so that whatever issue happens to the program doesn’t affect the system as a whole.
Fedora also had the LDAP and RAID capabilities that the systems group wanted as well as the newest version of Perl, the language that was used by the data and modeling groups to write scripts for running the training process. After also finding out that UNHM had a special deal with Fedora’s developers, RedHat, the systems group decided on testing Fedora 18 (the newest Fedora OS) as the candidate for an OS update and see if that would work out better than OpenSUSE 12.2. Luckily, despite having a rocky start which included a low resolution screen and very small text, the system test on the swappable drives of Caesar was successful as Fedora 18 ran much smoother than OpenSUSE 12.2. The system will either be updated over the Summer or sometime before the next class in Spring 2014.
Experiment Directory and Setup
The experiment folders generated for the purposes of training and decoding are located within the /mnt/main/Exp/ directory. Each experiment is numbered in a 4 digit sequence starting from 0001. In order to reference the details of each individual experiment it is necessary to create Experiment logs on the class wiki so that other group members and future classes can understand what the purposes of the experiments were and what issues or successes may have resulted. This is also important to understand what exactly is being trained and exactly what type of train is run, along with the amount of time that the file was trained. Much of this is accomplished by having a detailed title in the experiment section of the class wiki, which describes what the purpose of the experiment was and what was used.
The actual setup and design of each experiment folder is in accordance with the necessary directories that the Sphinx software requires when initiating a train and decode. Previous classes had originally set up scripts in order to create these file folders for each experiment, however, it was more reliable to use the included script that Sphinx uses to create the appropriate folders: setup_SphinxTrain.pl. From this point, it was also noticed that there can be some condensing of scripts which populate experiment folders, but only doing so in segments so as to weed out any errors that might occur in file creation. As an example, genPhones.pl and Make_feats.pl could be combined to run consecutively and help speed up some of the process. Another improvement in regard to experiments could be in the form of a script that copies over entire experiment folders in order to help facilitate duplicating and running variances in experiments to see the effects on train and decode. Different portions of each established experiment could be copied so that the process of trial and error can be improved in situations where it is required to solve an issue.
Sphinx is an open source software package that is used for speech recognition. The sphinx toolkit includes software that are used for the train as well as decoding. The sphinx website is constantly updated with news related to the Sphinx toolkit.
Currently Sphinx version 3 is being used. An upgrade proposal has been written in order to look at the flaws and advantages of the current and the new, version 4 of Sphinx. This document can be found here(Insert link to upgrade proposal here, Bego). Testing the new version of Sphinx would be most advisable before fully upgrading the system from Sphinx 3 to 4. Since Sphinx 4 is a complete rewrite of the system, there are possibilities that scripts may not work in Sphinx 4, and the overall process for speech may have to be completely redone. (Bego)
Working with Experiments
The process of training of different corpuses including new scripts created, decoding and scoring of experiments is discussed below.
(Tyler) Within this semester the Capstone class has expanded the number of experiments using the built corpuses on the machines and the creation of new longer ones. Previous semesters have gotten to the point of testing the mini corpus which represents about one hour of data. This semester we have gone even farther testing 5 hour corpuses on the first and last five hours of data.
Currently the number of experiments is past 90 which means that we have added about 70 different experiments trying to improve the system. Also, new scripts were developed this semester to increase performance in creating and running experiments.
New scripts include:
- pruneDictionary2.pl -- pruneDictionary.pl was a script which gathered unique words from a dictionary, removes all non-word characters and entries, and feeds it to the dictionary.pl script which actually gets the pronounciations. This script was slightly modified from the original by passing its results to dictionary2.pl instead of the old dictionary.pl.
- dictionary2.pl -- This script is a more efficient and more useful version of the original dictionary.pl. The original dictionary.pl script kept looking for a word match once it found one; dictionary2.pl will discontinue the search once a matching word is found. This dramatically decreases execution time. More importantly, the script will keep track of words it cannot find pronunciations for and will output it to a file called “add.txt”. If there is already a file called “add.txt” in the working directory, the script will append a number at the end of the filename to ensure that it is unique. This new add.txt file streamlines the process of adding words to the experiment dictionary as one doesn’t have to run the train, have it fail, and extract each unique word from the train’s logs.
- genTrans2.pl -- Improves upon genTrans.pl by adding better error handling for Sox errors. The original script always assumed that Sox completed successfully and subsequently the user would never know if it failed. By having it print out the error message it gets from Sox and exiting out, it is much easier to troubleshoot as to why its failing.
- genTrans3.pl -- improves upon genTrans2 in that it has more regular expressions to remove items from the transcripts that would cause additional error rates.
- genTrans4.pl -- improves upon genTrans3 by fixing some regular expressions and adding some more in. Also, the output of adding dots to the command line was removed and replaced with a percentage instead.
- genTrans5.pl -- fixes some regular expressions in genTrans4 and adds more regular expressions by request of the project manager.
- updateDict.pl -- This is used for adding words to the dictionary. It takes in an existing .dic file as well as a .txt file that contains the words to be added. The script does some error checking to make sure the .txt file being added matches the .dic format, having both words and their pronunciation, as well as checking for redundant entries. The script will also update existing words in the dictionary with a new pronunciation as well as sorting the dictionary alphabetically before the script finishes.This script works best when used in conjunction with the “add.txt” output of pruneDictionary2.pl/dictionary2.pl; once the missing words list is generated, pronounciations for the missing words can be added to the file and given to the updateDict.pl script to add into the experiment’s dictionary.
Mini Train (1 hour)
For the first 8 weeks of the class, the modeling group worked on familiarizing themselves with how the process works using the tiny corpus. After becoming acclimated with the process for running a train, they started working on the mini corpus, representing about 1 hour of data. For the mini train corpus, there were 25 words that needed to be added to the dictionary in order for the train to run. The existing processes of adding words to a dictionary was slow and tedious; although utilizing a series unix commands would ease this process tremendously, Sphinx expects each entry in the experiment dictionary to follow very specific formatting rules, Unix commands would not find these errors and may even exacerbate existing issues. To ease the processes of updating the experiment dictionary with missing entries, a new script, called updateDict.pl, was created to not only add entries to the dictionary, but to provide error checking for both the added entries, but also the resulting updated dictionary.
After the first 8 weeks, the groups reconfigured and the modeling group was tasked with teaching the other students on how to run the train process. The mini train was again used for this, resulting in most of the new experiments being built from this corpus. The average error on the mini train corpus was 29.4%
Five Hour Train
(Tyler) This semester we have developed two new corpuses for testing, two 5 five hour corpuses using the first and last five hours of new the data. The four modeling groups created mid-way through semester were combined into two larger modeling groups where one group worked on the first five hours and the second group worked on the last five hours. Being groups of eleven it proved to be a little difficult to get everyone something to do for the week with training. The biggest area where the large groups were beneficial was in dictionary creation. When the new pruneDictionary2.pl script ran, it created larger amounts of words needed to be added to the dictionary for the train to run. Some experiments needed over 150 words to be added to it for it to run. This is where division of labor was used to have each group member find the pronunciation of select words from the dictionary.
Once trains were created on the five hour data set language models would be created to be used with the 30 minute tests. Once the 5 five hour language model were created, a new experiment would be created using the first thirty minutes of the corpus being tested so in one group’s case it would be the first 30 minutes of the first 5 hour.
Decoding is the process by which we try to recognize the words spoken off of the audio files. It is processed by Sphinx 3, and currently initiated by the run_decode.pl script. The experiment’s files are passed through to sphinx3_decode, which logs the process and errors into a created decode.log file.
Sphinx allows us to use a different acoustic model to decode an experiment than the current experiment number. This is the second parameter for the run_decode.pl script. The benefit of this is to allow comparison of a different model to the same experiment, or vice versa.
Decoding can take a while to run, and is dependent upon a multitude of factors including the length of the audio, as well as the models it is using to decode. The decoding process is usually the longest part of the modeling, decoding, and scoring process.
Once the decoding is finished, we proceed onto scoring the success of the decoding process -- how well it interpreted the audio based on the acoustic model utilized.
Scoring refers to the process of rating the quality of the models created. The groups used a program called SCLite to generate the scores. SCLite is a tool for scoring and evaluating the output of speech recognition systems. The program compares the hypothesized text output by the speech recognizer to the correct, or reference text. The two scripts that were compared were, <experiment #>_train.trans (Reference) and hyp.trans (Hypothesized). SCLite compares each line of the hypothesis transcript with the reference transcript, counting the number of times a word is substituted for another, an unrelated word is added, and when a word is deleted.
Before the scoring takes place, hyp.trans must get generated from the initial decode.log file. The Decoder combines output and status/error text into that single decode.log file. All the status/error text has to be taken out, leaving only the decoded sentences. The parseDecode.pl script was used to achieve this. It places the newly created hypothesis transcript (hyp.trans) into the users etc directory.
The scores generated by the class were not entirely great. Scores ranged from 22.3 to 67.2 word error rate for the one hour trains tests, and 31.9 to 57.6 for five hour tests.
This semester was the first group that have been able to consistently run experiments. By the time we began our work, all major hardware, software, scripts, and procedures had been already implemented by the previous groups; only minor improvements to certain scripts, documentation, and procedures were needed. We have started more than 82 experiments this semester, 4 times more than all previous semesters combined. Of those 82 experiments, about half were experiments designed to familiarize ourselves with the Sphinx system and the existing processes; the remainder were real experiments attempting to improve the quality of our models.
The set of training experiments focused on three sets of corpuses: the hour-long “mini/train” corpus, the 5-minute long “mini/eval” corpus, and the 15-minute long “tiny/train” corpus. In a real-world scenario, these corpuses are too small to make accurate and reliable models; however, since they are small in size, they are great to learn and test the processes behind creating models.
Once we had learned, documented, and refined the model creation and testing procedures, we created two 5 hour train corpus sets, along with pair of corresponding half-hour long testing corpuses derived from each 5 hour-long corpuses. These sets of experiments were designed to increase accuracies of created models; we have had mixed success in this regard. Our most successful experiment using the the 5-hour corpus had an average word error rate of about 31.9; this isn’t a bad score for models created with 5 hours worth of audio.
(Vinnie G) Overall, as a group we were able to make several improvements to the project as a whole. The hardware has been significantly improved in many aspects, including memory upgrades and replacement of failed/failing hardware. The systems group was also able to pave the way for future groups to have a newer operating system than the one currently in use, Fedora 18. In addition to hardware improvements, our team was also able to make several improvements to the scripts used in the process of running a train. PruneDictionary2.pl, Dictionary2.pl, and genTrans5.pl (plus 3 earlier versions) are the current updated scripts.
With these upgrades (plus other various changes), we were able to make further strides than the previous semesters by excelling in the amount of experiments we created. The results of said experiments were also a significant improvement with our best score being a 31.9 word error rate. In conclusion, we belive our group has produced significant results and accomplished our goals.
Recommendations for Future Capstone Groups
(Mike B.)Look into LDAP [Lightweight Directory Access Protocol] authentication for new OS, this will keep future classes from having to keep making new accounts on Caesar in order to conduct new training experiments.
(Scott A.)A script to run the entire train process itself, rather than having to do many commands to setup a train, a simple script taking in a few parameters could make the process easier. sed can be used to edit files outside of an editor, solving the sphinx path setup portion.
(Drew) Adding the words not existing in the current master dictionary that were needed for running the various corpus sets.