Speech:Models LM Build
- Semesters - Project Work by Semester
- Experiments - List of speech experiments
- Unix Notes
- Speech Corpus Setup - Switchboard, NOAA
- Speech Recognition Related Readings
- Experiment Setup
- Scripts Page
- Model Building - more info on data prep, [language models], & building models
Model Building: Language Model Building Steps
Description of initial setup and preparation of data needed to build statistical language models and generate a robust set of acoustic models (training) and verifying them by testing (decoding) on the trained corpus. For detailed steps on how to train and decode, see the sub-steps under Model Building above.
Parsing the Transcript
The first step to building a language model is to clean it up by removing all unwanted characters from the raw transcript file. To do this, the parseNLTrans.pl script must be called from the /mnt/main/corpus/dist/Switchboard/transcripts/ICSI_Transcriptions/trans/icsi directory. The command to execute the script is "perl parseNLTrans.pl test.txt tmp.text" with test.txt being an example of a name for a raw transcript file, and tmp.text being the filtered transcript. Both the input and the output have to be text files because the text2wfreq command that is called in lm_create.pl requires a text file for input. The result of running this script is a copy of the transcript that is purely what was said in the audio files.
Creating the Language Model
The next step to create a language model is to run the lm_create.pl script. This perl script calls four different executable commands. The first of these commands in text2wfreq. The format of the command is "text2wfreq <tmp.text> tmp.wfreq". text2wfreq takes in the filtered transcript file that parseNLTrans.pl created uses to create a file that contains the frequency of every word in the transcript. After that, the command wfreq2vocab is executed in the form "wfreq2vocab <tmp.wfreq> tmp.vocab". wfreq2vocab takes in tmp.wfreq as input and creates an alphabetical list of every word that was found in the transcript. The next command used in creating a language model is text2idngram, which is used in the form "text2idngram -vocab tmp.vocab -n 3 -write_ascii <$infilename> tmp.idngram". This is done to enable more n-grams to be stored efficiently in memory. The last command, the executable that actually creates the language model has two forms. The first being one that be physically read by users "idngram2lm -idngram tmp.idngram -vocab tmp.vocab -arpa tmp.arpa -ascii_input", and one that is used by the computer "idngram2lm -idngram tmp.idngram -vocab tmp.vocab -binary tmp.binlm -ascii_input".
List of Scripts Used to Create a Language Model
Note: The underscore between ICSI and Transcriptions is not actually there