From Openitware
Jump to: navigation, search

Welcome to the Speech Project Pages

Here you'll find information about the UNH Manchester Speech Project, and undergraduate research projected hosted by both the spring Capstone course (COMP 790) and via independent study over the summer by the Summer Speech Academy.

History of Speech Recognition

IBM engineer William Dersch 1961, demonstrates Shoebox.

"In other early recognition systems of the 1950’s, Olson and Belar of RCA Laboratories built a system to recognize 10 syllables of a single talker and at MIT Lincoln Lab, Forgie and Forgie built a speaker-independent 10-vowel recognizer. In the 1960’s, several Japanese laboratories demonstrated their capability of building special purpose hardware to perform a speech recognition task. Most notable were the vowel recognizer of Suzuki and Nakata at the Radio Research Lab in Tokyo, the phoneme recognizer of Sakai and Doshita at Kyoto University, and the digit recognizer of NEC Laboratories. The work of Sakai and Doshita involved the first use of a speech segmenter for analysis and recognition of speech in different portions of the input utterance. In contrast, an isolated digit recognizer implicitly assumed that the unknown utterance contained a complete digit (and no other speech sounds or words) and thus did not need an explicit “segmenter.” Kyoto University’s work could be considered a precursor to a continuous speech recognition system.

In another early recognition system Fry and Denes, at University College in England, built a phoneme recognizer to recognize 4 vowels and 9 consonants . By incorporating statistical information about allowable phoneme sequences in English, they increased the overall phoneme recognition accuracy for words consisting of two or more phonemes. This work marked the first use of statistical syntax (at the phoneme level) in automatic speech recognition." Automatic Speech Recognition – A Brief History of the Technology Development, B.H. Juang & Lawrence R. Rabiner

In 1962 IBM at the World’s Fair in Seattle, displayed a device they built called the “Shoebox” boasting the capability to recognize 16 spoken words including 0-9 “plus”, “minus” and “total” It also had the ability to communicate with an adding machine that could process and print simple addition and subtraction problems.

The Hidden Markov Modeling (HMM) approach to speech recognition was invented by Lenny Baum of Princeton University and shared with ARPA contractors including IBM. HMM is a complex mathematical pattern-matching strategy that eventually was adopted by many of the leading speech recognition companies including Dragon Systems, IBM, and AT&T.


The challenge that speech recognition presents is deceptively complex. How hard could it be, right? You understand people even if they have heavy accents or their words are not necessarily in the right order. We as humans take for granted our innate ability to communicate that has developed over the span of a hundred thousand years. Computers, on the other hand, are literal machines that can only understand things as being on or off. A computer performing speech recognition does not listen for entire words, but instead listens for segments of words broken up into smaller parts called phones. The reality is that natural speech is not so cut and dry that . Depending on the word, its context, or the speaker, the sound of a phone can be drastically different while even if its meaning is the same.