Speech:Models LM Build

From Openitware
Jump to: navigation, search


Project Notes


Model Building: Language Model Building Steps

Description of initial setup and preparation of data needed to build statistical language models and generate a robust set of acoustic models (training) and verifying them by testing (decoding) on the trained corpus. For detailed steps on how to train and decode, see the sub-steps under Model Building above.

Parsing the Transcript

The first step to building a language model is to clean it up by removing all unwanted characters from the raw transcript file. To do this, the parseNLTrans.pl script must be called from the /mnt/main/corpus/dist/Switchboard/transcripts/ICSI_Transcriptions/trans/icsi directory. The command to execute the script is "perl parseNLTrans.pl test.txt tmp.text" with test.txt being an example of a name for a raw transcript file, and tmp.text being the filtered transcript. Both the input and the output have to be text files because the text2wfreq command that is called in lm_create.pl requires a text file for input. The result of running this script is a copy of the transcript that is purely what was said in the audio files.

Creating the Language Model

The next step to create a language model is to run the lm_create.pl script. This perl script calls four different executable commands. The first of these commands in text2wfreq. The format of the command is "text2wfreq <tmp.text> tmp.wfreq". text2wfreq takes in the filtered transcript file that parseNLTrans.pl created uses to create a file that contains the frequency of every word in the transcript. After that, the command wfreq2vocab is executed in the form "wfreq2vocab <tmp.wfreq> tmp.vocab". wfreq2vocab takes in tmp.wfreq as input and creates an alphabetical list of every word that was found in the transcript. The next command used in creating a language model is text2idngram, which is used in the form "text2idngram -vocab tmp.vocab -n 3 -write_ascii <$infilename> tmp.idngram". This is done to enable more n-grams to be stored efficiently in memory. The last command, the executable that actually creates the language model has two forms. The first being one that be physically read by users "idngram2lm -idngram tmp.idngram -vocab tmp.vocab -arpa tmp.arpa -ascii_input", and one that is used by the computer "idngram2lm -idngram tmp.idngram -vocab tmp.vocab -binary tmp.binlm -ascii_input".

List of Scripts Used to Create a Language Model

Script Details
Scripts Details Location
lm_create.pl
  • The following is a brief description of what each command does and its input and output files:
    • text2wfreq <tmp.text> tmp.wfreq:
      • Input: Filtered transcript text file
      • Output: A file that lists every word that is found in the transcript and the number of occurrences of each word.
      • What it does: Uses a hash table to efficiently count the words. The words in the list are out of order due to the randomness of the hash.
    • wfreq2vocab <tmp.wfreq> tmp.vocab
      • Input: Word unigram file produced from text2wfreq
      • Output: Vocab file
      • What it does: Produces an alphabetically ordered list of all the words from the transcript.
    • text2idngram -vocab tmp.vocab -n 3 -write_ascii <$inFilename> tmp.idngram:
      • Input: Parsed transcript file and vocab file
      • Output: File that lists all id n-grams that occurs in the text file, and the total number of occurrences of each.
      • What it does: Matches each word from the transcript with an integer to enable more n-grams to be stored and sorted in memory. If the "-write_ascii" command is used, an ascii version of the file is created instead of the binary version that sphinx needs to create the language model.
    • idngram2lm -idngram tmp.idngram -vocab tmp.vocab -arpa tmp.arpa -ascii_input:
      • Input: Id n-gram file (either the binary or the ascii versions) and a vocab file.
      • Output: Language model file in arpa format.
      • What it does: Creates a language model using the idngram and vocab files. If using the "-arpa .arpa" command, a readable language model file is created.
    • idngram2lm -idngram tmp.idngram -vocab tmp.vocab -binary tmp.binlm -ascii_input:
      • Input: Id n-gram file (either the binary or the ascii versions) and a vocab file.
      • Output: Language model file in binary format.
      • What it does: Creates a language model using the idngram and vocab files. If using the "-binary .binlm" command, a binary file is created that is used by the "evallm" command.
    • http://www.speech.cs.cmu.edu/SLM/toolkit_documentation.html
/root/SCRIPTS
parseNLTrans.pl
  • Input: Raw transcript file
  • Output: Filtered transcript file
  • What it does: Takes in a raw transcript file and strips it of all unwanted characters.

Note: The underscore between ICSI and Transcriptions is not actually there

/mnt/main/corpus/dist/Switchboard/transcripts/ICSI_Transcriptions/trans/icsi