Speech:Spring 2012 Report
By Sky Swendsboe
Speech is the newest frontier of computer word processing and external device peripherals. From the keyboard, to the mouse to the headset: speech recognition is becoming popular with computer users everywhere. The Capstone class this semester is taking on the task of implementing speech recognition with Sphinx Tools. The Goal of this project is to get Caesar and the accompanying 9 other servers to take recorded voice on sphere files and run them through Sphinx and output text files transcripts of the spoken words in each sphere file.
However, this is just not another capstone project: the project is something the professor has a strong background in and desire to make a strong speech program. Although he has the capability to complete the necessary work in less than a 1/10th of the time, he is using his expertise to help the class complete as much of this project as possible. His expertise is able to guide the class all the way through. There is still some trial and error, but the class is making strides in their progress.
The capstone class is more than just a requirement: it's a challenge to learn something outside of the conventional teaching type class. It makes the students think and assigns tasks that they must complete by certain dates: its goal oriented. It makes the student take the initiative to work hard and figure out scenarios and come to results under their own power; with the teacher stepping in to provide assistance every now and then.
In short, the Capstone Class is an experience for its students to experience real world challenges, tasks and goals with a deadline and expected results and proof of their work in a professional manner. The class is tough, but offers categories of work suited to all walks of IT professionals. However, the whole class is expected to participate in all areas so that no one is left without some competence in the project. Overall, the Capstone project is real world experience without the fear or intimidation of real world consequences. Approach the class with hard work and team work and you will do alright.
Overview of Speech
(Matt) Speech Recognition software is a long term effort to emulate language comprehension by a machine. The idea of a machine which can "listen to" and interpret spoken word is one that has been thrown around by Hollywood and scientists alike since the 1950's. Progression has been both slow and steady, but without much research one can find steady advancement in the field by simply observing consumer products and their features.
Many cars now offer "hands free" control over various dashboard features including the ability to make phone calls to a member of a contact list by verbalizing predefined commands to the computer. Smart phones also boast the ability to accept voice commands as well. Some phones will even transcribe voice into text messages.
PC software such as Rosetta Stone uses speech recognition to teach users new languages and effectively check pronunciation. Microsoft's Dragon speech software allows for dictation to some of their more popular software products such as Microsoft Outlook and Microsoft Word. Most operating systems also come with built in speech recognition software with similar functionality.
Despite all of these advances, speech has not yet reached it's full potential. Each of the aforementioned implementations has its flaws. Some software products such as those used in cars are designed to "listen" for predetermined speech patterns. This is much easier to implement than true recognition software, and often these type of systems will accept words which are "close" to those in its library.
To combat some of these flaws, more research needs to be done. Systems must be trained to listen for a large variety of sounds, known as phonemes. This is a complex, time consuming process. The goal of students in the Capstone course at the University of New Hampshire in Manchester has been to learn about this process. Their ultimate goal has been to perform a training process of their own, and to gain experience using a tool specifically built for the task known as Sphinx.
(Evan) The first order of business for getting the hardware situated was to take inventory of what resources were there to work with. The students figured out what hardware was already installed in each system. They then constructed the following table containing the value of each system and other useful information. This is valuable information, because it shows how much space each system has, and what computing power it possesses in an easy to read easy to access format. Caesar was the main computer for this project, but keeping in mind that a full train would be run across all of the systems, it was necessary to ensure that all systems had enough power to keep up.
The table mentioned can be found here
Setup of the servers involved software updates, as well as moving files to where they need to be. The complete setup we performed ensure that all servers had the correct hardware, were running the same OS, and shared a networked drive so that the student accounts could be used on each machine. Having the servers communicating with each other is key because the train requires each server to communicate with the main server.
A network bridge was set up using openSUSE 10.3 and openSUSE 11.3. openSUSE is a free Linux based OS that was good to use for this procedure. The reason two different versions were used was to understand the differences between these two versions of the OS and to become familiar with the newer version. The hardware on the two test servers consisted of two NICs on the primary system and a single NIC on the secondary system. The steps taken to accomplish this are located Here in the April 9th entry. The reason this bridge was necessary is due to the fact that none of the other servers had an internet connection and due to the fact that UNH didn't want the servers to each have their own connection via a switch, a bridge was necessary for the other servers to use one server as their point of connection.
(Damir) With support for 32-bit and 64-bit systems, openSUSE 11.3 is packed with features and it is free. All of the 9 servers we are using for Speech project running openSUSE 11.3. For a speech processing system to work, the software environment requires computers that are capable of handling high processing loads. openSUSE which requires minimal configuration, it was exactly what we need. The following requirements should be met to ensure smooth operation of openSUSE 11.3:
- Pentium* III 500 MHz or higher processor (Pentium 4 2.4 GHz or higher or any AMD64 or Intel* EM64T processor recommended)
- Main memory: 512 MB physical RAM (1 GB recommended)
- Hard disk: 3 GB available disk space (more recommended)
- Sound and graphics cards: supports most modern sound and graphics cards, 800 x 600 display resolution (1024 x 768 or higher recommended)
- Booting from CD/DVD drive or USB-Stick for installation, or support for booting over network (you need to setup PXE by yourself, look also at Network install) or an existing installation of openSUSE, more information at Installation without CD.
Current configuration on our systems meets and exceeds requirements on all systems. This is one example of current system:
- 2 Intel (R) Xeon TM 3.06 Ghz
- Main memory 2 GB
- ATI Technologies Inc Rage XL (rev 27) with 16MB video memory
Announced by the openSUSE project about seventeen months ago, the openSUSE 11.3 operating system will reach end of life (EOL) on January 16th, 2012. With regrets, the openSUSE developers, through Marcus Meissner, announced on November 30th that there will be no more updates coming for the openSUSE 11.3 Linux distribution, starting with January 16th, 2012.Support for openSUSE 11.3 will be officially dropped, which means that starting with January 16th, 2012, the openSUSE project will stop "feeding" the openSUSE 11.3 operating system with security or critical fixes, and software updates. The current version of openSUSE 11.3 is very customizable, easy to use and about every 8 months there will be new version that can be downloaded at no charge!
Do we need to upgrade or system? I am not sure that our Speech Tools will get more efficient, stable and resolve any performance issues. The current System operates well under openSUSE 11.3, for now! However, future Capstone classes must consider and beware that openSUSE will not be supported in the future. Having said that, the long term impact of no support or updates needs to be monitored and considered as the project moves to completion.
Training & Decoding
The Spring 2012 capstone class was given the task to try and run a mini train and decode from a shared drive without having the Sphinx 3 Speech Recognition software installed locally on one of the batch machines they would use to run the software. This would be possible by having Caesar the file-server for the batch machines share part of it hard drive with the batch machines. The class used a NFS shared drive on Caesar and install the Speech Recognition software on the drive with a makeshift directory that was better suited for the experiments for the class. The class installed on this drive the proper resources that the batch machines would need to run the Speech Recognition software. With the use of soft link the class was able to link the batch machines /usr/local directory to the shared drive's /mnt/main/local to give the batch machine the proper executable to run the Speech Recognition software.
The Spring 2012 capstone class used the work of the Spring 2011 capstone class and the CMU toolkit documentation (link located below) to learn the steps that were required to create a language model. A test directory was created in /mnt/main/home/sp12/eli54 with all the needed files copied to it. These files are ParseTranscript.pl, lm_create.pl, and test.trans. This was done so no file would be accidentally erased or overwritten.
The first step in creating a language model is to parse the raw transcript file to remove all unnecessary characters. The ParseTranscript.pl script requires the raw transcript file, which in this case this file is test.trans, and creates a new clean, text file, called tmp.text, that is needed for the next step. The second script that is called is lm_create.pl. lm_create.pl calls four executables to create the various files needed to create the language model. First, "text2wfreq" is called by "text2wfreq <tmp.text> tmp.wfreq". What this is it uses a hash table to efficiently count the total number of occurences of all of the words found in the tmp.text file. Unfortunately, this list is non-alphabetical due to the randomness of the hash. "wfreq2vocab" is called next using the format "wfreq2vocab <tmp.wfreq> tmp.vocab", which takes the recently created tmp.wfreq to produce an alphabetically ordered list of the words found in the filtered transcript. The next command used in creating a language model is "text2idngram", which is used in the form "text2idngram -vocab tmp.vocab -n 3 -write_ascii <$infilename> tmp.idngram". This is done to enable more n-grams to be stored and sorted efficiently in memory. Since the "-write_ascii" tag is used, this creates an ascii version of the file instead of the binary version so that users can read the file. The last command used, which is the command used to actually create the language model itself, is "idngram2lm". The full command used is "idngram2lm -idngram tmp.idngram -vocab tmp.vocab -arpa tmp.arpa -ascii_input". This uses the tmp.vocab and the ascii version of tmp.idngram, to create the file tmp.arpa, which is the actual language model that can be read by users.
The language model is then checked for accuracy by the decoder.
The Spring 2012 Capstone class used the information gathered by the Summer 2011 independent study group to prepare the data needed for training on the batch machine Traubadix. A new corpus directory was created in the machines /mnt/main to hold the folders needed for the training and decoding. The Switchboard in the corpus was created by using a soft link to link the folder with the data's original location; the soft link was used instead of copying the data to reduce the risks of the Switchboard being damaged or altered.
The data preparation also included cleaning the transcripts and separating out the individual .sph files from the Switchboard disks. This was accomplished using a series of Perl scripts. The main script used to accomplish this was the GenTrans.pl script. This script combines everything from cleaning the transcripts to creating the .wav files that are needed to run the train. This file still has a few flaws, such as not deleting the words that are used as place holders for background noise. This is a problem since it causes the train being run to crash. The train knows that it does not make sense, and fails.
The directory called flat was also created, stored in the directory are all the .sph files from the Switchboard disks. This directory was created so that eventually, the Perl scripts will look for the .sph files in this directory rather than having to sort through the rest of the data for the .sph files.
The Spring 2012 Capstone class used the work of the Summer 2011 independent study group as a guide to run a train from the batch machine Automatix. The April 24th groups ran into a problem with the first Perl script they tried to run, the setup_SphinxTrain.pl script. With the help of Dr. Jonas the class changed the script. The class change the line in the script where $SHINXTRAINDIR variable was from $0 to /mnt/main/root/SphinxTrain-1.0/scripts_pl/setup_SphinxTrain.pl. With this change the class was able to continue running the steps of the train.
The groups were then broken up into the final group called the Modeling group. The first problem the group ran into was that the make_feat.pl script was giving an error. The group looked through the script and found that the make_feat.pl script was looking for files in the feat directory. After checking the feat directory the group realized that when the feat directory was copy over it did not copy the files. Once the group copied the files into the feat and ran make_feat.pl again the make_feat.pl script ran successfully.
The Modeling group ran into another problem once they tried to run the RunAll.pl script. The RunAll.pl script calls 14 other Perl scripts. The group had to go through each script starting with the verify_all.pl script. After searching for the answer to the first error the group was able to find on a message board that the RunAll.pl script was being ran in the wrong directory. The RunAll.pl script must be ran in the main experiment directory. For this group that was /mnt/main/Exp/0001. After that was figured out the script was able to get through 2 Phases in the train and then there was another error. After looking at the error message and reading the script the group was able to determine that the verify_all.pl script was looking for the 0001_train.fileids. After looking for this script and finding the script in /mnt/main/Exp/0001/etc, the group was able to determine that the 0001_train.fileids was named wrong. The file was named 0001.fileids. Once the group changed the file to 0001_train.fileids. The group got the train to run to Phase 7 and ran into another error.
The Modeling group added SIL to the 0001.phone script. Once again there was an error. The error reported a syntax error. The Modeling group compared the original script with the one they were running. They noticed two things, one, the dirname was missing in the catdir line of the script. Two, the path was pointing to the script and not the directory. The modeling group replaced the path with $0 in all the slave scripts. Once that happened the train ran until the deleted_interpolation.pl script. The modeling group changed the the path in that script and the make_s2_models.pl script to $0. Once that was done the the train ran through successfully.
A mini train and decode was completed several times using different data by following these steps. The purpose of this task is to take the conversations, saved in the .wav format, and their corresponding transcripts to create a speech recognition tool. The trainer then grabs the .wav files, phonemes dictionary, dictionary, and transcript of the conversations. It then matches up the audio with the corresponding transcripts. In order for the trainer to do this correctly, it needs a dictionary with every word that's in the transcript. It also needs an accurate phoneme dictionary with every word that is used in the dictionary to run.
The decoder is a simple two step process that is necessary to check to see if the trainer actually worked and how accurate it is. It first creates a language model then runs one script. The decoder grabs the transcripts and the language model. The decoding is completed by running one script, run_decode.pl. The decoder matches up the transcripts with the language model to check validity. (Decode is NOT finished - might not get finished)
The class was divided into groups with certain tasks to complete for the first half of the semester. The students were broken into several groups. Each group was assigned tasks to research information about the system that was in place and about future possible modifications to the system.
Hardware configuration was explored to see how the current system was set up using Caesar and the batch servers. The results can be found on the Information page Hardware Configuration.
The system software setup was explored to see how OpenSUSE11.3 is used and what differences there are in OpenSUSE12.1. The results can be found on the Information page, System Software Setup. The speech software tools that were in place were researched and online locations of current and newer versions were found. Also a cloud storage system to use for this project was created on Google Code. The results can be found on the Information pages, Speech Software Functionality and Data Backup.
A network bridge was created to give the nine server's access to the internet via Caesar for use in future capstone classes. The current installation was under the root directory on Caesar. The purpose of the class was to get a better Sphinx install therefore the speech software was moved under the shared directory on Caesar to allow individual machines to access to the software. The speech corpus setup group worked on making directories, cleaning up transcripts, converting files to the correct formats, as well copy files to the correct locations. The results of the work can be found at Speech Corpus Setup. This page also has the steps of how the commands were run and also the steps that were taken to complete the tasks. The GenTrans information page has the information on how this Perl script works. This page has information on what the script does and why the script is so useful.
(Johnny Mom)The experiment setup group worked on creating a new experiment setup directory in preparation for running the experiment from the new directory in "/mnt/main/Exp" compared to the "/root" directory. The previous experiments were run under the root account in the "/root" directory. The idea was to be able to run multiple experiments on various servers (methusalix, asterix, miraculix, etc.) on the shared "/mnt" directory using the user accounts rather than the root account. Also a script was created to move existing trains from the "/root" directory to the new experiment directory. The results of this group are the steps required, which can be found at Experiment Setup.
The last group worked on model building. This group was further broken up into three groups. The data preparation group prepared the data so that a train and decode could be successfully with the correct files and formats. The results from this group include how to create a new dictionary which can be found at Data Preparation. The next subgroup was language modeling which explained the steps required to create a language model and explanations of the scripts needed. The results of this group can be found at Language Model Building Steps. The last subgroup worked on building and verifying the acoustic models. This group created a list of steps to complete a mini train and a table explaining all the scripts that are used to verify an acoustic model. The results from this group can be found at Building and Verifying Acoustic Models.
The second half of the semester was spent getting a mini train to run. First the class was divided into three groups; two groups were tasked to perform a mini train which was run successfully with root access on Caesar and the third group finished their individually assigned tasks. After that was successfully completed the class worked on getting a mini-train to run on the individual machines using individual access. This was done by dividing into groups of three students. A few groups were semi-successfully and got the first couple script to run. At this point the groups were then divided again. Half the class took on the responsibility of writing the final report and the other half of the class worked on getting a mini train to run. The final modeling group was successful at getting a mini-train to run. The results from that group can be found at Modeling Group Log.
As a class of fourteen enthusiastic CIS enthusiasts, we attempted the ambitious goal learning the hardware configuration, software setup, and system tools of the "speech recognition" tools. Ultimately, the class success was driven and shaped by our ability to learn and understand the speech tools in which our capstone class was to master over the semester. As the system tools and setup were housed in a Linux environment, a strong background and knowledge of running the command line was a must. For some, this created a challenging learning curve, but ultimately forced us to sharpening our skills. In our initial proposal, we clearly defined and emphasized defining the system components and tools for future capstone classes. Additionally, we intended to run a "mini" or even possibly a "full" train by the end of the spring 2012 semester.
Throughout the course of the first few weeks in the semester, our class spent the necessary time to learn and understand speech recognition and the tools in which we were going to use and ultimately the products in which we intended to produce by using these speech tools. The following weeks of the semester were spent clearly defining our system down to every atomic component and software tool. This in itself was a key learning opportunity for the class as well as a supreme opportunity to refine and polish our complete knowledge of the system.
Over the last few weeks of the class, we broke up into concentration groups to focus on actually installing and running the speech tools on the eight machine servers vice the main server, "Caesar”, which had been previously accomplished under the effort of the summer 2011 class. Utilizing the highly refined software setup procedure, our teams hammered at our eight machines with the ultimate goal of getting one team to accomplish the "mini" train function of the speech tool. Durning the final two weeks of the semester the modeling group was able to successfully run a "mini-mini" train and decode on one of the Caesar machines! This was a HUGE accomplishment for the class and is a true testament of the class' effort and dedication.
Overall the Capstone class was a great opportunity to learn and understand system speech tools and just how complicated the speech recognition software setup and procedure is. The fruits of our labor and effort were showcased, though our persistence and drive to understand and move the massive project further forward as outlined in our proposal. Ultimately, this semester was a pushed the capstone project further than any other pervious semester and set the bar for future classes. The future of this project now has a obtainable conclusion with tremendous real world experience for any that takes the time to learn and push this project to its final phase.