Speech:Switchboard


 * Home
 * Semesters - Project Work by Semester
 * [Information]
 * System Description
 * Experiments - List of speech experiments

Project Notes

 * Unix Notes
 * Speech Corpus Setup - [Switchboard], NOAA
 * Speech Recognition Related Readings
 * Experiment Setup
 * Scripts Page
 * Model Building - more info on data prep,  language models, &  building models
 * Step 1: Run a Train
 * Step 2: Create the Language Model
 * Step 3: Run a Decode

Audio Corpus Data
The Audio is from a set of corpus DVDs that contain 2438 audio files that amount to 259 hours of audio. These files are two channel,8khs, sph files. All the files vary in size and duration due to the fact that each one of these files represents one phone conversation between two people over a telephone line.

The audio data is stored in 23 directories on Caesar (the release used 23 CDs). The audio data can be found here: /mnt/main/corpus/switchboard/dist

LDC released a newer version of the corpus in 1997. The new version contained error corrections to the data and updated the header of the sphere files to reflect the new release. Speech is currently using the second release.

Source: http://www.elsnet.org/list/sep97/4.01Sep97.html

Note: our current audio corpus (as of 3/26/2018) has 311 hours' worth of audio, split into a 5hr dev test set, a 5hr eval test set, and a four training sets of 5hr, 30hr, 145hr and 300hr.

Audio Corpus Related Readings
Catalog page with more information on the switchboard audio data: https://catalog.ldc.upenn.edu/LDC97S62 (current as of 3/26/2018)

IBM reports a WER of 5.5%. https://www.ibm.com/blogs/watson/2017/03/reaching-new-records-in-speech-recognition/

The IBM article noted that the most of the speakers in their training data set were also in their testing data sets. Two papers had differing opinions as to whether this could be regarded as cheating (among other useful information on speech recognition in general):

https://arxiv.org/pdf/1708.08615.pdf

https://arxiv.org/pdf/1703.02136.pdf

Transcription Corpus Data
The Transcription text files that we have represents the latest version of the Telephone Speech Corpus this is the latest manually corrected release(1/29/03) and can be downloaded here transcript. Once this file is extracted the data is organized similarly to the audio with folders containing sub folders. Each of these sub folders contain 4 files a transcript for channels A and B as well as word files for channels A and B. Transcript files are organized by utterance, and word files are organized by word. For the purpose of capstone the word files are not used.

It is unclear how the data was organized on Caesar prior to Spring 2014. As of Spring 2015, a copy of the 3/21/01 release is located here: /mnt/main/corpus/switchboard/dist/master_trans.

The transcription has around 518 hours of transcripts this should be twice the the information found in the audio files. The reason for this is the transcript is split into A and B channels. A single Channels transcript is 259 hours long but the audio data is 255 hours long. This is because the audio file and the transcript files are from different sources so for the purpose of capstone these transcript files have to be removed. Transcript File Numbers to Remove for Capstone 2289 2716 2717 2718 2719 2720 2721 2722 2723 2724 2725 2726 2727 2728 2729 2730  2731 2731 2732 2733 2734 2735 4361 4379

For a more in depth breakdown of speech data compared to audio data an excel spread sheet can be found on ceasar in /mnt/main/corpus/Transcript_Spreadsheet/

Transcript Files
One transcript file represents half of a spoken conversation. Each line in a transcript file can represent one utterance from a speaker, or a specified amount of time the speaker is silent. An utterance can be a line of dialog or a noise that the speaker is making or both.

Example of dialog utterance sw4927B-ms98-a-0007 50.531500 53.172375 the idea itself of service is good Example of noise utterance sw4927A-ms98-a-0070 297.363875 297.858000 [vocalized-noise] Example of silence sw4927B-ms98-a-0026 190.269875 192.835625 [silence]

Each line of dialog is marked with a header starting with containing the file name, corpus, and line number. The next two items in the line are the start and stop times of the utterance in seconds. The rest of the line is the transcript for the utterance.

Ex  sw4927B-ms98-a-0007 50.531500 53.172375 the idea itself of service is good - -

The transcript cotains notation for uninteligable or unspoken dialog these are usually contained between brackets. But they can imply partial words.

Transcript Notation      Spoken Audio -[ha]ppy                 ppy -[p]oppin[g]-            oppin [laughter-bongs]         bongs said while laughing [compooter/computer]     compooter

The transcript has notation for words made by the user these are represented by a word surrounded by {}. Ex  {chowser}

There other notations found in the transcripts are Ex  _1 i-

Corpus File Structure
Below is the file structure that all switchboard data corpora follow:



WER Benchmark
Using Gaussian Mixture Models, Switchboard has a benchmark WER of 25.2% as of 2015.

Source: http://recognize-speech.com/acoustic-model/knn/benchmarks-comparison-of-different-architectures


 * ADD IMAGE OF RESULTS - Jon Shallow***

Information on 2011 improvements which significantly reduced WER (using articifical neural networks). ***ADD THIS IN - Jon Shallow***