Speech:Spring 2011 Report


 * Home
 * Spring 2011
 * Proposal
 * [Report]
 * Appendix

Introduction
The University of New Hampshire at Manchester Capstone experience aims to involve students in a research group. The students work collaboratively on individual assignments to accomplish a specific goal. Tasks are delegated by the students themselves, while the professor acts as the project leader. The capstone project was designed to challenge students, and to help them develop skills necessary to excel in the workplace. Students work together to understand the overall problem, and then outline specific goals and benchmarks to solve it. The remainder of the project is spent researching, developing, testing and implementing methods to reach their desired solution.

Professor Dr. Michael Jonas chose speech recognition as the subject of the capstone experience. The goal was to construct a system that converts recorded audio into readable text. Speech recognition, although not an entirely new concept, has inspired an exciting field of research with much room for growth and improvement. Developing a speech recognition system is thus a great project as it takes students out of their comfort zone, and pushes them to ultimately innovate.

Overview of Speech
It has been an overarching goal of researchers in the field of speech recognition to develop software that is fast and reliable. Presently, few technologies provide the functionality that the field of speech processing demands. Better software design and an increase in hardware capabilities have fostered advances in the field, but improvements and innovation are necessary to realize a technology that fully meets speech recognition professionals’ expectations.

The field of speech processing, at its inception, utilized very expensive hardware. Despite its great cost, the hardware available did not provide for a reliable system. This meant speech recognition, at the time, was not commercially viable. But technology has since improved. The continual development of speech processing algorithms, combined with the exponential increase in the processing power of computers, has helped spawn speech recognition systems affordable to the average computer user. There is now a wide range of software offerings associated with speech processing systems. They range in price from free, to very expensive, and vary in complexity to appease users of all skill levels. The field of speech processing has come a long way, but there is, and always will be room for improvement.

For a speech processing system to work, the software environment requires computers that are capable of handling high processing loads. First, this means using a high speed CPU. Most end user high end processors are fast enough to perform small tasks. Next, a lot of RAM is required to process the data in a timely manner. Finally, a good sound card is necessary to facilitate the conversion of sound to analog for the system to recognize.

In a basic speech recognition system, a microphone is used to record voice. The sound is then translated into an analog signal which is in turn converted by the sound card into a digital signal. The digital signal is binary, and consists of 1s and 0s. This is what the system actually recognizes. This is how the computer "views" sound.

Speech recognition software then uses acoustic models to convert voice to separate elements called phonemes. Phonemes are the break up of what words "sound" like into small segments like (example, Dog = Dah ugh). The system also has to break the voice down, and get rid of all the background noise, taking only the actual words that were said. This is what the system will convert to text. In sum, voice is converted to analog, then to digital, and then through the software to phonemes. Once the voice is split up into phonemes it is then compared to a “dictionary” that is stored in the system’s memory. The dictionary has all the words split up into phonemes as well. Once the system finds a match that correlates to the digital phonemes, it displays that word for the user to see.

Materials
Professor Jonas has significant experience developing speech technologies and is familiar with common tools that make speech processing more manageable.

A bank of nine servers were provided by Professor Jonas to house the speech software. Each server contains a quad core xenon processor with * GB of random access memory. Suse Linux was installed on these systems. Professor Jonas took the liberty to name each server in accordance to a popular international comic "The Adventures of Asterix". The servers are named Caeser, Asterix, Obelix, Miraculix, Troubadix, Majestic, Idefix, Automax, Methusalix, and Verlehnix. Each were fitted with the same login and password. The primary access to these systems is via SSH. Using Secure Shell, access from the outside world, “Internet” is possible and no personal data can be extracted. This is a console based interface so there is a steep learning curve but there is plenty of information available about how to use a terminal service with Linux. Sphinx is a widely used open source toolkit developed by The Carnegie Mellon University. This software suite includes tools for processing virtually any type of sound into a readable text. There is a large amount of information about how to use the tools. The best resource for information on downloading, installing, and using Sphinx is

Server Setup
The students began work on the project by focusing on the process of setting up the physical servers that would serve as the environment for the speech recognition software. The nine servers that were provided for this project were setup individually. Each student was assigned a dedicated server where they could test, develop, and explore. The servers run a Linux distribution called openSuse. Sphinx 4 was installed on each server as the primary program for speech recognition. It is written entirely in Java and thus compatible with multiple operating systems. This makes it easy for users with operating systems other than Linux to still use Sphinx. All servers were initially expected to be connected directly to the Internet. This proved to be problematic though. The costs associated with the number of ports required lead the team to seek an alternative. The Internet connection had to be through one port. The initial solution was to use a switch connected to a single port, and to share the connection with the other servers by incorporating tools such as port-forwarding. This idea was eventually dismissed as port-forwarding is not supported by the ISP. Due to these constraints, the primary server, Caesar, was connected to the Internet and networked with the other servers. SFTP was then utilized to move necessary files from one server to the next.

Speech Tools setup
In order to run trains, decode, and ultimately test a speech recognition system, it is important to have software tools in place to facilitate this research. The software tools developed consist of Perl scripts written by various team members. These scripts are discussed in more detail in the Training and Decoding sections. Ultimately, their purpose is to automate tasks that otherwise would take extensive amounts of time to perform. For example, this includes cleaning up transcripts, creating a dictionary, and recreating an experimental directory structure for recording experiments.

Demo
Sphinx 4 comes with various demos that can be run in order to test to see if the software is working properly. The most basic is the Hello World demo. A device for recording sound is required, and the test consists of speaking into the device, and the software translating what was said into readable text. For example, the demo gives you a set of words used to comprise a greeting and asks for some combination of them. It then attempts to interpret this into the spoken greeting i.e.,. The main purpose of the demos is to firstly demonstrate program functionality and secondly to provide insight into how it works and allows you to change the outputs and manipulate it how you see fit for your current project.

Concepts
Parallelization is the system of breaking down a complex procedure into its smaller components and distributing the processing load across multiple CPUs or Computers. By allowing for multiple parts of a lengthy procedure to be conducted simultaneously the overall time to conduct a process can be reduced significantly. At the advent of the capstone course there was a prevailing notion that, if it could be done, the 9 servers could be configured to share the computational load of the lengthy training process. The lengthy train process would have been drastically reduced in length by up to a day if it could be accomplished. One of the major methods of parallelization to split a larger process into its individual components and have each component operate in its own loop, a process called multithreading. Each of these component threads operate in a synchronised fashion to effectively push the process through each of its computing steps in an orderly fashion. Without significant planning put into a parallelized program bottlenecks can form, especially if the threads computing loads are of significantly different values. Without the ability to split a process into smaller components without crippling bottlenecks multithreading can be totally useless which seemed to be the case with the Sphinx training process. The training process for Sphinx is an enormous monolithic process that has not been optimised for multithreaded operation. The second option for parallelization of the train process was to break the dataset into smaller chunks and run the train process independently on all 9 systems. By running the process independently the overall time should theoretically be brought down to a fraction of the original length, or 1/9th. This idea relies on the notion of each acoustic model being a set of distributions that could be merged somehow. This concept, while promising was not explored due to the lack of a complete train operation being completed. Had a train been completed it may have confirmed the capability of a merge process, most likely through the use of a script.

Technology (Torque)
After researching parallelization for several weeks, we managed to find a possible option for parallelizing the training process sitting right under our nose. The sphinx training system has support for the torque clustering system. In this system, several computers are clustered together using the torque pbs software. The program or user that is using this cluster then provides a task and required resources for the task and torque allocates these resources across the cluster to the process. In this case, the user modifies the configuration file for the SphinxTrain utility that we use to train our acoustic models. In this configuration file, the user specifies to use the Torque queue rather than the cores of the local system as well as the name of the queue to use. The SphinxTrain software then requests and schedules it's required tasks on the Torque queue. Unfortunately, due to the fact that we did not find this option until late into this semester and then had issues connecting the systems to this queue due to lack of familiarity with the configuration of the software, we did not have time to explore this option. Whether or not this option works, how it works, and to what extent it speeds up the training process all remain unknowns the equation to which answers will hopefully be found this summer.

Language Models
There are two different types of training models, Acoustic and Language. Acoustic models are created by taking an audio recording of speech and its text transcription counterpart. Sphinx is then used to create statistical representations of the different types of sounds that make up each of the words within the transcription. The result is the acoustic model. Language models use a statistical analysis to predict what the next word in a sequence is going to be. They also are used to correctly interpret words that may not have been clearly pronounced. Language models and acoustic models are used in tandem to complete the task of speech recognition.

Training
Training is essentially teaching the computer to recognize spoken words. This sounds like a fairly straight forward process but to do this from scratch would take quite a bit of time and effort. Fortunately for us, we are using Sphinx and it does this quite well given the correct setup. To train Sphinx, you feed the Trainer recorded spoken words, and the text representation of the words. To Decode, discussed in more detail in the decoding section, you take the same recorded spoken words and have the computer output the text that represents it. This process is especially tricky when different people, who may be saying the same words, either pronounce the words differently or speak at a different pace.

Training proved to be one of the most difficult tasks to complete. Training and decoding was completed using the AN4 data-set and some attempted to complete a train using original sources. All involved came away with a greater understanding of the training process and how to navigate through it. This process will be expanded on and hopefully completed this summer. In order to run a train, there were several configuration files that had to be created. We have written a couple of scripts that are each able to generate one of these files having been given another file.

One such script, SwitchboardToDecoderConverter Script was written for the purpose of processing the transcriptions. The transcriptions provided by Switchboard were not in a format that the Sphinx decoder could interpret. An example of the format that Switchboard provided is:

The first part of this example (sw2039A-ms98-a-0086) consists of the utterance ID. The capital A in the utterance ID refers to the speaker. Because the transcripts were taken from telephone conversations, the speaker can be one of two people, either A or B. After the utterance ID, the first decimal number is the start time of the utterance; the second decimal number is the stop time of the utterance. Immediately after the stop time is the actual utterance. Some words were encased in quotes, these quotes represented the words which had already had phonetic transcriptions created for them. The Capstone class however was not using those previously made transcriptions and had no need for the quotes.

To convert the Switchboard transcript into the format that the decoder needed, the Perl script read in the transcript file line by line. A lot of data on each line needed to be removed, such as the quotes previously mentioned, and any brackets, periods, or background noise that was represented in the text file by [noise]. The start time and stop time also had to be removed from the transcript. However, before these times could be removed. they had to be used to generate a smaller audio file that contained only the current line of the transcription. After all of the superfluous characters had been removed from the transcript, a new file was created. This new file contained the revised transcript utterance. At the end of the utterance, the file name, which was a shortened version of the Switchboard’s utterance ID was added to the end of the utterance.

The FileID Script that was written generated a file with the extension ".fileids". This file contained a listing of all of the audio files and their paths in the system. This was used by Sphinx as a lookup table so that it could find the file that was being represented in the transcript.

Another script that was written was the Dictionary Script. This script's job was to make a pruned copy of the large dictionary that we found on the CMU Sphinx website that contained only the word and it's phoneme for each unique word in the transcript.

Finally, the last script was the Phoneme Script. This script's entire job was to generate a listing of the unique phonemes in the dictionary. This file is most likely used by Sphinx so that it has a reference of what Phonemes it is training.

After all of these script were run, there were still a few minor modifications that had to be made before a train could be run. First of all, there were multiple lines that needed to be edited in the sphinx_train.cfg file including a specifying a project name and an exact path to the project in the system. Also, a filler dictionary containing symbols such as  had to be created. After both of these had been done, each of the created files had to be renamed according to the specification in the sphinx_train.cfg file so that sphinx would be able to find them when the training was run.

After all of the above configuration was done, the project was finally in a state where training could be run. First, the script make_feats.pl had to be run given the fileids file as an argument (scripts_pl/make_feats.pl -ctl fileidsfile). This completed the first part of the training by generating the features that would be used from the sph files. After this had been done, the RunAll script could be run in order to complete the rest of the training process.

While we did not get the training of Switchboard to a state in which it could be run in the duration of the class, we were able to run one of the demo trains provided by CMU. In order to do this, we had to run "perl scripts/setup_tutorial.pl an4" in order to copy the directory structure for a train (certain files had to be in certain folders). Once this had been performed, the an4 files were downloaded from the CMU website and extracted in this directory. After this had been performed, the make_feats.pl scripts and the RunAll.pl script could be run in order to perform a train. This demonstration was useful as a test that Sphinx was set up properly. The configuration files provided with AN4 also provided a useful reference for how the configuration files for a train should look. These examples have been useful in the design and production of the scripts discussed above.

Decoding
Within the development process of an accurate decoder there are two processes that are closely related: training and decoding. Decoding is essentially dependent on the training in that an accurate decoding session relies upon the results from an extensive training process. The decoding process is essentially what comes to mind when the layman hears the phrase "speech recognition". It is the process of a computer converting speech audio into its textual equivalent. Speech decoding is not a new technology but is still a rather fascinating man-machine interface that is seeing an ever-increasing application in the technology of today. In fact, there are a variety of applications available on the market today that provide decoding functionality. In terms of the technology used in the UNHM capstone, the technology is quite advanced and highly capable. One of the most powerful aspects of the Sphinx platform is the ability to customize the decoder through the use of user-controlled training procedures.

Conclusion
Despite various roadblocks throughout the length of the course, teams were able to establish a framework for future Capstone classes to perform research in speech recognition. Learning to navigate and operate a Linux shell was a major skill emphasized throughout the course. Many of the capstone students were from an exclusively Windows-based background. This made the transition challenging, but persevering rewarded them with experience and knowledge of some of the most powerful operating systems available to an end user. Although all goals outlined in the initial proposal were not fully realized, substantial steps were taken to further UNHM’s ability to facilitate experiments in speech recognition.

Nine Servers were setup with Sphinx speech tools. They were loaded with audio files from the LDC Switchboard corpus along with their correlating textual transcripts. A test database for recording each step involved in testing was developed. Perl scripts were produced to massage this data into a usable format for interpretation by Sphinx. A script to separate the audio files into audio portions that aligned with their text representation was created. This can be used to transform output into a format essential for training. A script to produce a language model of the transcripts was also created. This will greatly speed up training in the future. Finally, a script to pull all of the unique words from the transcripts, along with their pronunciations, was created. This can be used to produce a viable dictionary. These accomplishments will aide in running full speed training during later semesters. These achievements are well-documented on the Wiki so that future researchers can benefit from it. Classes in years to come will be able improve upon these utilities, and use them to move further through the process of developing speech recognition tools, as well as conduct experiments.

Speech recognition has been around for some time now. Many students can visualize it being involved in our everyday lives. From Star Trek to the Matrix it has seemed like a novelty. The truth is that with such a diversity of languages and such cultural diversity within a single language, the Speech Recognition is still a cutting edge science. This project was a huge learning experience for everyone involved. Hopefully, a good foundation for future Capstone students was built to work from.